Published in PHP Architect on 19 Apr 2005

Markup Basics

There are many ways to mark up content. In a plain text environment, there are some common forms of markup that have been adopted specifically for the purpose of being easy to interpret by a human. Examples are *bold*, /italics/, and _underline_.

The markup format most familiar to web developers is HTML. The same examples in HTML are <b>bold</b>, <i>italics</i>, and <u>underline</u>.

BBCode introduces a new vocabulary, and unfortunately there is no standard to which developers can adhere. However, the most simplistic elements are consistently implemented. Examples include [b]bold[/b], [i]italics[/i], and [u]underline[/u].

If the BBCode vocabulary were limited to these simple elements, it would offer very little benefit over HTML, unless you happen to think square brackets are more user-friendly than angled brackets.

HTML Versus BBCode

In order to assess the advantages of BBCode, it is best to compare and contrast the differences in implementation between allowing users to enter a subset of HTML versus a subset of BBCode.

Rather than use an existing solution such as PEAR::HTML_BBCodeParser for this discussion, I perform a manual translation of the markup in order to make the comparison as controlled as possible.

As regular readers of Security Corner know, input must always be filtered. When you're allowing users to enter very complex data, creating a whitelist of acceptable characters can be very difficult. Because of this, many developers employ very weak filtering rules for such input and rely on the escaping performed by htmlentities() for protection.

While htmlentities() can save you from poorly filtered data, relying on escaping alone is not ideal. Because an attacker can send any type of data, it’s equally unwise to rely on BBCode for protection — you can’t assume that the attackers will abide by your rules unless you enforce those rules in your programming logic.

To better illustrate these points, consider a simple form that allows anonymous users to provide a comment:

  1. <form action="comment.php" method="POST">
  2. <p>Comment: <input type="text" name="comment" /></p>
  3. <p><input type="submit"/></p>
  4. </form>

From a security perspective, the major difference in implementation is when the output is escaped and presented as part of the page, so that users can view previous comments.

If a subset of HTML is allowed, the implementation is to escape every character and deliberately remove the escaping on certain characters, allowing them to be interpreted:

  1. <?php
  3. foreach ($comments as $comment) {
  4.   $comment = htmlentities($comment);
  6.   $comment = str_replace('&lt;b&gt;', '<b>', $comment);
  7.   $comment = str_replace('&lt;/b&gt;', '</b>', $comment);
  9.   $comment = str_replace('&lt;i&gt;', '<i>', $comment);
  10.   $comment = str_replace('&lt;/i&gt;', '</i>', $comment);
  12.   $comment = str_replace('&lt;u&gt;', '<u>', $comment);
  13.   $comment = str_replace('&lt;/u&gt;', '</u>', $comment);
  15.   echo "<p>{$comment}</p>";
  16. }
  18. ?>

If the same markup is allowed, but only with BBCode, this example becomes the following:

  1. <?php
  3. foreach ($comments as $comment) {
  4.   $comment = htmlentities($comment);
  6.   $comment = str_replace('[b]', '<b>', $comment);
  7.   $comment = str_replace('[/b]', '</b>', $comment);
  9.   $comment = str_replace('[i]', '<i>', $comment);
  10.   $comment = str_replace('[/i]', '</i>', $comment);
  12.   $comment = str_replace('[u]', '<u>', $comment);
  13.   $comment = str_replace('[/u]', '</u>', $comment);
  15.   echo "<p>{$comment}</p>";
  16. }
  18. ?>

As you can clearly see, there is very little difference in the treatment of this data. While some might argue that using BBCode allows you to use strip_tags() to eliminate any HTML, it’s important to realize that this is no safer than if strip_tags() were used with the second optional parameter that allows some HTML tags. It has a few notable weaknesses:

  1. It is a blacklist approach.
  2. It violates the security principle that says that invalid data should not be modified in order to make it valid.
  3. It is not as exhaustive as htmlentities().
  4. It does not consider character encoding.

There is, in fact, no security benefit to allowing BBCode versus a subset of HTML.

Why Use BBCode?

BBCode isn't entirely useless. Some BBCode markup can potentially be easier for users to remember and understand. For example, consider using a red font:

  1. [color=red]red text[/color]

There isn't an HTML equivalent that's quite this intuitive. Of course, this could be made just as easy with something that closely resembles HTML markup:

  1. <red>red text</red>

Another potential advantage of BBCode is that it helps to eliminate collisions between HTML that users want to be interpreted and HTML that they do not. For example, I might intend to bold something:

  1. This comment is <b>bold</b>.

Instead, I might intend to tell someone else how to bold something in HTML:

  1. You bold things like this: <b>bold</b>.

If BBCode were allowed, it would be easier for a user to distinguish between these two scenarios. Without BBCode, the user has to enter a comment like this:

  1. You bold things like this: &lt;b&gt;bold&lt;b&gt;.

After htmlentities(), this becomes:

  1. You bold things like this: &amp;lt;b&amp;gt;bold&amp;lt;b&amp;gt;.

Therefore, the bold tags will not get translated back, but they will be displayed in the browser as the user intends.

This is where the gap in user-friendliness becomes clear, and I think this is the strongest case in favor of implementing BBCode. Of course, any time users want to explain to other users the actual markup vocabulary used in the comments, the situation is going to be slightly complicated using either approach.

Until Next Time…

Hopefully, you now realize that BBCode is not something that increases the security of your application in any way. It can, however, offer some advantages over a subset of HTML. The appropriate choice depends upon your own needs and the opinions of your users. Choose whichever method best suits you, but don't fool yourself into thinking that security has anything to do with the decision.

Until next month, be safe.