Allowing HTML and Preventing XSS

13 Mar 2007

One of the most common problems faced by web developers is allowing some HTML without creating XSS vulnerabilities in the process. This problem comes up more and more often due to the rise of social networking and other Web 2.0 properties that embolden users.

Sorry, I couldn't resist using the word embolden. :-)

There have been numerous solutions to this problem, some of which are pretty good. In a previous post where I casually mentioned this topic, a few people made some recommendations, including:

Of course, BBCode inevitably comes up during these types of discussions, but I really hate the idea of using yet another markup language just because I'm too lazy to deal with HTML, especially if the markup language doesn't even try to be user-friendly. Edward Yang, the author of HTML Purifier, seems to agree:

BBCode came to life when developers were too lazy to parse HTML correctly and decided to invent their own markup language. As with all products of laziness, the result is completely inconsistent, unstandardized, and widely adopted.

Why isn't there a good, standard solution to this problem? I think it's because everyone (including me) has slightly different requirements. Creating a solution that caters to everyone's needs is likely to yield an overly-complex and error-prone approach, so it's not necessarily bad that multiple solutions exist.

For my new blog, I want to let readers mark up their comments to help them communicate more effectively. One of the most essential features is the ability to format code, because unformatted code can be difficult to follow. It's also important that no content is removed. I detest commenting on blogs where my comment is passed through something like strip_tags(), effectively mangling what I'm trying to say. It reminds me of using an IM client that tries to identify smilies and replace them with images, often making responses difficult to decipher.

I have reviewed several existing solutions, experimented with solutions that use DOM and Tidy, and eventually resorted to a dirt-simple approach that I'd like to share with you now.

I don't recommend using this approach until it has been reviewed and vetted by others. (Use at your own risk.)

The fundamental concept is to make the content safe by default, then carefully translate specific patterns back to valid (standards-compliant) markup. The basic framework, which allows no markup, is as follows:

  1. <?php
  2.  
  3. /* Normalize Newlines */
  4. $html = str_replace("\r", "\n", $html);
  5. $html = preg_replace("!\n\n+!", "\n", $html);
  6.  
  7. /* Escaped (Safe) by Default */
  8. $html = htmlentities($html, ENT_QUOTES, 'UTF-8');
  9.  
  10. /* Make Paragraphs */
  11. $lines = explode("\n", $html);
  12. foreach ($lines as $key => $line) {
  13.     $lines[$key] = "<p>{$line}</p>";
  14. }
  15. $html = implode("\n", $lines);
  16.  
  17. ?>

This lets people type plain comments without the need for any markup, and they can still discuss anything they want without losing part of their comment. Of course, this is the easy part, because no HTML is allowed.

Allowing simple tags like <em> can be accomplished like this:

  1. <?php
  2.  
  3. /* Emphasized Text */
  4. $html = preg_replace('!&lt;em&gt;(.*?)&lt;/em&gt;!m',
  5.                      '<em>$1</em>',
  6.                      $html);
  7.  
  8. ?>

Keep in mind that this replacement is taking place after $html has been escaped, so whatever is matched by .* (and represented by $1) is already escaped. I don't use greedy matching for this particular pattern, so .* matches as little as possible to satisfy the pattern. You might prefer greedy matching, and ultimately, it only makes a difference in edge cases, such as when users want to use <em> tags as well as talk about them. Allowing users to preview comments before posting gives them the opportunity to correct any problems that arise from such cases.

Allowing <blockquote> is also pretty straightforward:

  1. <?php
  2.  
  3. /* Blockquotes */
  4. $html = preg_replace('!^&lt;blockquote&gt;(?:&lt;p&gt;)?(.*?)(?:&lt;\/p&gt;)?&lt;\/blockquote&gt;$!m',
  5.                      '<blockquote><p>$1</p></blockquote>',
  6.                      $html);
  7.  
  8. ?>

As you can see, I want to accommodate users who forget to use <p> tags, but I want to make sure the output is valid regardless.

Links are a bit trickier. Consider the following:

  1. <?php
  2.  
  3. /* Links */
  4. $html = preg_replace('!&lt;a +href=&quot;(.*?)&quot;(?: +title=&quot;(.*?)&quot;)? *&gt;(.*?)&lt;/a&gt;!m',
  5.                      '<a href="$1" title="$2">$3</a>',
  6.                      $html);
  7.  
  8. ?>

The content represented by $1 is already escaped, but it has a special meaning in this context. Users who click the link text ($3) will initiate a request to the URL identified by $1. Imagine a link to javascript:alert('XSS'). Although this isn't actually XSS, the result is still undesirable, because users might be tricked into clicking a link that executes malicious JavaScript. For this reason, you might consider restricting the pattern further:

  1. <?php
  2.  
  3. /* Links */
  4. $html = preg_replace('!&lt;a +href=&quot;((?:ht|f)tps?://.*?)&quot;(?: +title=&quot;(.*?)&quot;)? *&gt;(.*?)&lt;/a&gt;!m',
  5.                      '<a href="$1" title="$2">$3</a>',
  6.                      $html);
  7.  
  8. ?>

I'm also allowing inline <code> tags as well as blocks of code. For the latter, I'm using the e modifier and my code highlighting technique.

You can try all of this for yourself by commenting on this post, and I'll be releasing the code once it has matured a bit more.

Please let me know if you discover any problems.