About the Author

Chris Shiflett

Hi, I’m Chris: entrepreneur, community leader, husband, and father. I live and work in Boulder, CO.


Allowing HTML and Preventing XSS

One of the most common problems faced by web developers is allowing some HTML without creating XSS vulnerabilities in the process. This problem comes up more and more often due to the rise of social networking and other Web 2.0 properties that embolden users.

Sorry, I couldn't resist using the word embolden. :-)

There have been numerous solutions to this problem, some of which are pretty good. In a previous post where I casually mentioned this topic, a few people made some recommendations, including:

Of course, BBCode inevitably comes up during these types of discussions, but I really hate the idea of using yet another markup language just because I'm too lazy to deal with HTML, especially if the markup language doesn't even try to be user-friendly. Edward Yang, the author of HTML Purifier, seems to agree:

BBCode came to life when developers were too lazy to parse HTML correctly and decided to invent their own markup language. As with all products of laziness, the result is completely inconsistent, unstandardized, and widely adopted.

Why isn't there a good, standard solution to this problem? I think it's because everyone (including me) has slightly different requirements. Creating a solution that caters to everyone's needs is likely to yield an overly-complex and error-prone approach, so it's not necessarily bad that multiple solutions exist.

For my new blog, I want to let readers mark up their comments to help them communicate more effectively. One of the most essential features is the ability to format code, because unformatted code can be difficult to follow. It's also important that no content is removed. I detest commenting on blogs where my comment is passed through something like strip_tags(), effectively mangling what I'm trying to say. It reminds me of using an IM client that tries to identify smilies and replace them with images, often making responses difficult to decipher.

I have reviewed several existing solutions, experimented with solutions that use DOM and Tidy, and eventually resorted to a dirt-simple approach that I'd like to share with you now.

I don't recommend using this approach until it has been reviewed and vetted by others. (Use at your own risk.)

The fundamental concept is to make the content safe by default, then carefully translate specific patterns back to valid (standards-compliant) markup. The basic framework, which allows no markup, is as follows:

<?php
 
/* Normalize Newlines */
$html = str_replace("\r", "\n", $html);
$html = preg_replace("!\n\n+!", "\n", $html);
 
/* Escaped (Safe) by Default */
$html = htmlentities($html, ENT_QUOTES, 'UTF-8');
 
/* Make Paragraphs */
$lines = explode("\n", $html);
foreach ($lines as $key => $line) {
    $lines[$key] = "<p>{$line}</p>";
}
$html = implode("\n", $lines);
 
?>

This lets people type plain comments without the need for any markup, and they can still discuss anything they want without losing part of their comment. Of course, this is the easy part, because no HTML is allowed.

Allowing simple tags like <em> can be accomplished like this:

<?php
 
/* Emphasized Text */
$html = preg_replace('!&lt;em&gt;(.*?)&lt;/em&gt;!m',
                     '<em>$1</em>',
                     $html);
 
?>

Keep in mind that this replacement is taking place after $html has been escaped, so whatever is matched by .* (and represented by $1) is already escaped. I don't use greedy matching for this particular pattern, so .* matches as little as possible to satisfy the pattern. You might prefer greedy matching, and ultimately, it only makes a difference in edge cases, such as when users want to use <em> tags as well as talk about them. Allowing users to preview comments before posting gives them the opportunity to correct any problems that arise from such cases.

Allowing <blockquote> is also pretty straightforward:

<?php
 
/* Blockquotes */
$html = preg_replace('!^&lt;blockquote&gt;(?:&lt;p&gt;)?(.*?)(?:&lt;\/p&gt;)?&lt;\/blockquote&gt;$!m',
                     '<blockquote><p>$1</p></blockquote>',
                     $html);
 
?>

As you can see, I want to accommodate users who forget to use <p> tags, but I want to make sure the output is valid regardless.

Links are a bit trickier. Consider the following:

<?php
 
/* Links */
$html = preg_replace('!&lt;a +href=&quot;(.*?)&quot;(?: +title=&quot;(.*?)&quot;)? *&gt;(.*?)&lt;/a&gt;!m',
                     '<a href="$1" title="$2">$3</a>',
                     $html);
 
?>

The content represented by $1 is already escaped, but it has a special meaning in this context. Users who click the link text ($3) will initiate a request to the URL identified by $1. Imagine a link to javascript:alert('XSS'). Although this isn't actually XSS, the result is still undesirable, because users might be tricked into clicking a link that executes malicious JavaScript. For this reason, you might consider restricting the pattern further:

<?php
 
/* Links */
$html = preg_replace('!&lt;a +href=&quot;((?:ht|f)tps?://.*?)&quot;(?: +title=&quot;(.*?)&quot;)? *&gt;(.*?)&lt;/a&gt;!m',
                     '<a href="$1" title="$2">$3</a>',
                     $html);
 
?>

I'm also allowing inline <code> tags as well as blocks of code. For the latter, I'm using the e modifier and my code highlighting technique.

You can try all of this for yourself by commenting on this post, and I'll be releasing the code once it has matured a bit more.

Please let me know if you discover any problems.

About this post

Allowing HTML and Preventing XSS was posted on Tue, 13 Mar 2007. If you liked it, follow me on Twitter or share:

67 comments

1.Hossein said:

Hello,

http://ilia.ws/archives/131-Filter-...P-5.2-news.html

May you explain how to use this filter?You didn't answer my question about XML..

Thanks in advance.

Wed, 14 Mar 2007 at 03:20:56 GMT Link


2.Chris Shiflett said:

Hi Hossein,

I might blog about the new filter extension at some point, but that doesn't really apply to what I'm talking about in this post, because I'm not filtering anything.

I do plan to talk about filtering, validating, and sanitizing soon, because I think these words are misused frequently.

Regarding your XML question, I'm not sure what it is, but I won't necessarily know the answer. :-)

Thanks for the comment.

Wed, 14 Mar 2007 at 03:30:17 GMT Link


3.Ed Eliot said:

Nice article. The ability to insert javascript in the href could be construed as XSS because, although in your example with alert the result isn't particularly useful, someone could construct a link which grabbed user cookies, append them to a URL an redirect the user on click. As you've done, the best solution is to look for and explicitly require valid HTTP/FTP addresses thereby disallowing the JavaScript pseudo protocol entirely.

Wed, 14 Mar 2007 at 04:04:07 GMT Link


4.Edward Yang said:

Hey, thanks for quoting me!

Your dirt simple approach gives me the willies, but it shouldn't cause XSS problems yet. I ran a few tests using the preview feature and there's a bit more going under the hood than you let on to. ;-)

Anyway, you'll want to fix this first. Tags need to be properly nested within each other.

Your method, actually, is a lot like the BBCode method. Part of the effectiveness of BBCode (not that it is very effective) is that you don't have to treat brackets that aren't part of any actually expression specially: they're duds to browsers. In your case, entity-ized < and > are the "new brackets": if you missed one of them, no sweat: it's already escaped.

This will guard against most XSS attacks, I suppose. But it won't work out very well when it comes to standards-compliance, especially if you decide to increase the amount of allowed tags.

Wed, 14 Mar 2007 at 04:31:09 GMT Link


5.Chris Shiflett said:

Hi Edward,

I was hoping you'd comment. I know you're very experienced in this area.

Tags need to be properly nested within each other.

You're right, and this is something my dirt-simple approach doesn't handle at this time. What does HTML Purifier do in these cases?

It seems like changing the pattern to not match < and > might keep the output standards-compliant without adding much complexity.

My initial focus is on allowing HTML, preventing XSS, and preserving content, but enforcing standards is also important.

Thanks for commenting.

Wed, 14 Mar 2007 at 04:45:24 GMT Link


6.Jordan said:

I have this sneaking suspicious you just wanted to show off your exclamation alert and the pull-quote element.

Of course, if I had such well designed elements on my blog I'd be tempted to throw them in randomly as well. ;-)

Wed, 14 Mar 2007 at 05:45:07 GMT Link


7.Jordan said:

Also -- looks like mailto links are currently broken (see my profile page). Intentional or one of those features you're working on but isn't currently allowed by the pattern matching?

Wed, 14 Mar 2007 at 05:49:59 GMT Link


8.Chris Shiflett said:

Hi Jordan,

looks like mailto links are currently broken

They're not broken; the mailto scheme just happens to not be one of the allowed ones (currently only http, https, ftp, and ftps).

I'll consider adding mailto to this list. :-)

Wed, 14 Mar 2007 at 05:56:08 GMT Link


9.Jakub Vrana said:

The code is very poor for following reasons:

1. \n and \n\n are treated the same way but \n usually means <br> and \n\n <p>.

2. Running code in the presented order doesn't convert <blockquote>s as they already doesn't begin on the beginning of line after Make Paragraphs part. Converting them before Make Paragraphs part makes invalid P, BLOCQUOTE, P.

3. It's impossible to write tags in literal (e.g. if you want to say that emphasize is opened by [em] and closed by [/em]).

4. Edward Yang already mentioned the problem with overlapping tags.

5. In XHTML, not only quotes but also apostrophes can be used to enclose attribute values.

6. In XHTML, not only space but also any other white character (\s in PCRE) can be used to separate attributes.

7. Perfectly valid <a title="" href=""> is not allowed.

8. [a href="abc" hreflang="cs"][/a] builds very nice [a href="abc&quot; hreflang=&quot;cs"][/a].

If you wish to allow syntax similar to XHTML, it should be real XHTML otherwise it's more confusing than BBCode.

I'll send you other comments to private mail.

Wed, 14 Mar 2007 at 08:36:25 GMT Link


10.Jakub Vrana said:

[9] In point 8, I wrote [a href="abcAMPquot; hreflang=AMPquot;cs"][/a]. AMPquot;s were converted to " by comment system. Moreover, http:// should be present in href of the example.

Wed, 14 Mar 2007 at 08:50:23 GMT Link


11.myself said:

It's strange, from a "security expert" to hear that bbcodes were created because users are "lazy".

Bbcode were invented for security reasons...

Wed, 14 Mar 2007 at 12:25:02 GMT Link


12.Nico said:

At least I would replace all the new line handling with less code. You could use preg_split with /[\r\n]+/ or even simpler:

$html = preg_replace('/(?:^|[\r\n]+)([^\r\n]*)(?=[\r\n]+|$)/', "<p>\\1</p>\n", $html);

I don't know if bbcode has anything to do with security, but it's not about laziness. They are just easier to write for the average user. Good examples are [url], [img] or even [right].

PS: Without javascript the preview goes to the homepage. And I hope '+' is only lost in the preview - it's a simple space. Seems to be a url encoding problem.

Wed, 14 Mar 2007 at 12:59:12 GMT Link


13.Gabe said:

Just wanted to let you know that on my system ( XP SP2, Firefox 1.5.0.10 ) the code view for the blocks of code hides the horizontal scrollbar. If I toggle off the code view, the horizontal scrollbar appears. Use the <blockquote> code as an example.

FYI

Wed, 14 Mar 2007 at 13:16:30 GMT Link


14.Andy T said:

It's nice to see an article that actually tells people how to allow in legitimate content. Most of the stuff out there is the very simplistic "reject it if it isn't alphanumeric" type stuff that's not very helpful for beginners. In the real world we need to allow things like HTML.

Good work.

Wed, 14 Mar 2007 at 15:20:26 GMT Link


15.Chris Shiflett said:

Hi Jakub,

Thanks for the detailed critique. I'll respond to a few of your points.

\n and \n\n are treated the same way but \n usually means <br> and \n\n <p>.

True, but I'm picky, and I don't like <br>. :-) Because this is my blog, I get to decide what I allow and how I want things spaced.

Of course, changing this is easy, and no one is forced to share my opinion.

Running code in the presented order doesn't convert <blockquote>

As Edward correctly pointed out, there's a bit more going on, but I don't want to complicate the discussion with edge cases. Allowing <blockquote> just requires some extra code when it's time to make paragraphs.

It's impossible to write tags in literal (e.g. if you want to say that emphasize is opened by [em] and closed by [/em]).

I could allow HTML entities to give people more control, but I'm undecided about whether I want that. I prefer to keep it simple.

Regardless, most people have enough common sense to know that using the allowed markup in the allowed way results in it being interpreted. People discussing <em> tags can do so.

Edward Yang already mentioned the problem with overlapping tags.

I already mentioned a simple solution.

In XHTML, not only quotes but also apostrophes can be used to enclose attribute values.

I'm very picky about the markup I allow, and the style guide indicates exactly what's allowed.

If you wish to allow syntax similar to XHTML, it should be real XHTML otherwise it's more confusing than BBCode.

I don't want to allow syntax similar to XHTML; I only want to allow XHTML.

Wed, 14 Mar 2007 at 15:50:18 GMT Link


16.Jakub Vrana said:

I don't want to allow syntax similar to XHTML; I only want to allow XHTML.

But you don't allow it. XHTML valid <a href=''> doesn't work even if <a href> is supported, <a title="" href=""> doesn't work even if <a href title> is supported. It's not XHTML, it's syntax similar to XHTML thus it is confusing.

People discussing <em> tags can do so.

"Emphasized text is opened by the tag and closed by the tag ." Does it work? With the provided code, there's no way to write about EM tags. If entities would be allowed then it would be possible but it's written neither in the article nor in the Style Guide.

Wed, 14 Mar 2007 at 16:01:44 GMT Link


17.Chris Shiflett said:

Gabe wrote:

Just wanted to let you know that on my system ( XP SP2, Firefox 1.5.0.10 ) the code view for the blocks of code hides the horizontal scrollbar.

This is intentional. The scroll bar is ugly, but since it is sometimes necessary, there's a link to toggle the code view. This is also helpful for browsers that insist on prepending each line with # when pasting code.

Andy T wrote:

Good work.

Thanks, Andy!

Jakub Vrana wrote:

But you don't allow it. XHTML valid <a href=''> doesn't work even if <a href> is supported, <a title="" href=""> doesn't work even if <a href title> is supported. It's not XHTML, it's syntax similar to XHTML thus it is confusing.

Sorry to confuse you, Jakub. I allow a subset of XHTML, not all XHTML. I'm not sure why this confuses you, but if you can think of a way to make this clearer, please let me know.

Regardless, you are wrong to suggest that the allowed markup is not XHTML.

Wed, 14 Mar 2007 at 16:11:40 GMT Link


18.Edward Yang said:

Alrighty, lets see.

You're right, and this is something my dirt-simple approach doesn't handle at this time. What does HTML Purifier do in these cases?

If you want to keep the code dirt simple, my approach of decomposing the document into tags and text, and then running through a tag balancer, is not feasible. For your purposes, disallowing nested tags will probably do the trick, then.

I'll consider adding mailto to this list. :-)

Mod down! :-P There is little to no reason for email address links to be posted in blog comments.

It's strange, from a "security expert" to hear that bbcodes were created because users are "lazy"...Bbcode were invented for security reasons...

Yes, they were invented for security reasons. And yes, I stand by my assertion that developers are lazy. A naive implementation of BBCode is very small and easy to implement. However, recent and more secure versions have greatly grown in complexity. BBcode was established through laziness and ignorance, and was preserved through ubiquity. Its wide usage has nothing to do with its merits.

True, but I'm picky, and I don't like <br>. :-) Because this is my blog, I get to decide what I allow and how I want things spaced.

I tend to hold with the camp that only one line break should not break the flow of the paragraph, and two line breaks starts a new paragraph. I agree: there's no reason to need to allow <br> unless you're publishing poetry. However, support for lists would be pretty interesting.

I could allow HTML entities to give people more control, but I'm undecided about whether I want that. I prefer to keep it simple.

Since this is a web-application security blog, I believe that it is absolutely essential that the users be able to post arbitrary plaintext. I remember trying to comment in RSnake's blog, but WordPress kept on scrubbing the contents.

As a compromise, I would suggest unconditionally escaping everything between a code tag, so that [code][em]foo[/em][/code] renders as [em]foo[/em].

Sorry to confuse you, Jakub. I allow a subset of XHTML, not all XHTML. I'm not sure why this confuses you, but if you can think of a way to make this clearer, please let me know.

This has to do with the concept of "Be liberal with what you accept, but conservative with what you output." Ideally speaking, your parser should be able to normalize all those different forms to a standard <a href="" title=""> declaration.

I'd like to write some a little more general than replies, so that'll be in another comment.

Wed, 14 Mar 2007 at 19:37:48 GMT Link


19.Edward Yang said:

I've done a bit of thinking about it, and I've come to the conclusion that HTML Purifier can be configured to have exactly the same behavior as your simple filter, albeit with a bit more support for the edge cases.

<?php
 
$config = HTMLPurifier_Config::createDefault();
 
$config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,code,pre');
 
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');
 
?>

...these configurations will do the element and attribute limitations you are seeking (it will also require you to use pre for block usage). You will also need to implement a few bits of custom code:

1. Pre-processing auto-paragraphing, which can be done using preg_split, although you'll need to be careful not to do anything to the pre blocks and make sure that the <p> tags don't wrap around other elements.

2. Pre-processing pre-block armoring, essentially involves preg_replace_callback, matching the innards of <pre> and entity-izing everything in them.

3. Post-processing code beautification, after HTML Purifier's done with the output, match all <pre> tags with preg_replace_callback and run it through your source beautifier to get pretty output (you may need to de-entitize them).

Just putting another option on the table. These three functionalities are actually on the HTML Purifier 2/3 roadmap for core features, although the implementation may be a little different.

Ooh, by the way, don't forget to check the character encoding of input text! :-)

Wed, 14 Mar 2007 at 19:48:31 GMT Link


20.Jon Tan said:

Edward, thanks for your input on this, and to everyone who is contributing their considered thoughts and alternatives.

Thanks to Chris' generosity and the contributions of others, the final comments filter will go in to a web log application - currently in the final stages - that will be open source. We've talked about a comments widget for existing blog apps too.

Still a way to go, but we're getting there!

Wed, 14 Mar 2007 at 20:32:10 GMT Link


21.Ilia Alshanetsky said:

I think allowing HTML is very dangerous and any "safe-HTML" is a step back, there will be ways to inject HTML/JavaScript through it. If not due to parser, but via a user of a browser supported syntax that does not meet the spec.

Consider things like

<block

quote> tag for example or embed \0 chars, etc...

Wed, 14 Mar 2007 at 21:08:26 GMT Link


22.Edward Yang said:

I think allowing HTML is very dangerous and any "safe-HTML" is a step back, there will be ways to inject HTML/JavaScript through it. If not due to parser, but via a user of a browser supported syntax that does not meet the spec.

While HTML was not designed with security in mind, I don't think this mentality is correct. I've previously proposed that by insisting standards compliance, you protect yourself against browser quirks. While there wasn't that much discussion on it, I think that it is very possible to do HTML safely. Otherwise, HTML Purifier wouldn't have happened. ;-)

Anyway, as the filter stands right now, those two common vectors you mentioned won't work because Chris's regexes wouldn't match them.

Wed, 14 Mar 2007 at 21:17:34 GMT Link


23.Jacques Distler said:

Do you really expect to sanitize HTML using RegExps? That seems doomed to failure.

I explain a real solution on my blog.

Thu, 15 Mar 2007 at 05:04:41 GMT Link


24.Jacques Distler said:

OK. That worked slightly better than I thought. I'd have to think a bit ...

Thu, 15 Mar 2007 at 05:14:10 GMT Link


25.Jeremy Harnois said:

I apologize for going off-topic, but I have a question about the "More Paragraphs" code. Is there any reason you didn't use something like this:

$html = '<p>'.str_replace("\n","</p>\n<p>",$html).'</p>';

Thu, 15 Mar 2007 at 12:00:55 GMT Link


26.Larry said:

Don't hit me...

But why not use PHP's strip_tags() and the "allowed_tags" option?

Thu, 15 Mar 2007 at 12:04:22 GMT Link


27.Paul Reinheimer said:

The strip_tags() function when comibined with allowed tags doesn't do much for you in terms of security. You yould be able to do something like <b onClick="evil-stuff-here"> in a bold tag if it's allowed. Tag allowances allow for attributes, many of which are evil :(

Thu, 15 Mar 2007 at 17:56:03 GMT Link


28.Edward Yang said:

I explain a real solution on my blog.

It's an interesting bit of code, and definitely a step in the right direction, but it's not going to give you standards compliant code, and is no good for Shiflett's situation: why the heck would we need SVG pictures in our blog comments ;-)

I apologize for going off-topic, but I have a question about the "More Paragraphs" code. Is there any reason you didn't use something like this:

There's a number of methods for auto-paragraphing code. Your particular implementation would add oodles of empty paragraph tags when there are double-spaces, although this can be corrected.

Thu, 15 Mar 2007 at 20:36:24 GMT Link


29.Jacques Distler said:

but it's not going to give you standards compliant code,

The Wiki software it was written for is served as application/xhtml+xml to compatible browsers. It, therefore must be well-formed XHTML at all times. And, indeed, since all user-content is parsed to a tree, and then serialized by REXML, the output is unfailingly well-formed.

Ensuring well-formedness is, however, not the purpose of XSS-sanitization.

Despite the extremely limited support for entering "XHTML" (you can't even enter a link to Bugzilla, let alone use single quotes for attributes, or type an unordered list or ...), I came within one character of an XSS exploit in the comment above. And that was without really trying very hard. Anyone who attempts to extend this code to include a more realistic subset of "XHTML" runs the risk of opening a security hole.

why the heck would we need SVG pictures in our blog comments ;-)

Why not? I allow MathML (but not, currently, SVG) in my blog comments. And the Wiki certainly allows both MathML and SVG, in addition to an actual reasonable subset of XHTML (as opposed to "XHTML," which only vaguely looks like XHTML).

Thu, 15 Mar 2007 at 21:03:01 GMT Link


30.Edward Yang said:

therefore must be well-formed XHTML

Call me pedantic, but well-formed != standards compliant. Otherwise, I could have cut out more than two thirds of the code in HTML Purifier ;-) Well-formedness is actually the easy part: as long as you can parse HTML into a DOM, you can serialize it back into a well-formed XML document no problem'o. You're absolutely correct in this respect.

But it is trivially easy to cause your wiki pages to fail validation. My favorite example is <strong><div>No block elements in inline context!</div></strong>, demonstrating the lack of child validation.

Despite the extremely limited support for entering "XHTML" (you can't even enter a link to Bugzilla, let alone use single quotes for attributes, or type an unordered list or ...), I came within one character of an XSS exploit in the comment above. And that was without really trying very hard. Anyone who attempts to extend this code to include a more realistic subset of "XHTML" runs the risk of opening a security hole.

Oh yes, of course. Chris Shiflett is a smart guy, and I trust that he has no illusions about the scalability of this approach. But if you get to that point, use HTML Purifier. ;-) Even though it's written in a different language, I strongly encourage you to go check it out.

And yes, I agree the lack of lists is a bummer. I also don't think implementing them will be possible with this paradigm, unless you resort to something really hacky.

Why not? I allow MathML (but not, currently, SVG) in my blog comments. And the Wiki certainly allows both MathML and SVG, in addition to an actual reasonable subset of XHTML

Because this is not a graphic design blog, this is a web application security blog. ;-) Plus, I am of the school that SVG should not be embedded in pages and should be treated as separate files. I am also of the school that until Internet Explorer gets reasonable SVG support, you must rasterize it into a format like PNG. I love MediaWiki's (the software that runs Wikipedia) implementation of this stuff.

(as opposed to "XHTML," which only vaguely looks like XHTML).

The moment we tossed in auto-paragraphing, code formatting and dual block/inline <code> tags, it stopped being a subset of XHTML. But limiting to double-quotes, limiting attribute ordering and limiting tags fits perfectly into the idea of a subset of XHTML behavior. Whether or not that behavior involves only tags is another story. ;-)

BTW, what do you mean by Bugzilla links?

Thu, 15 Mar 2007 at 21:53:52 GMT Link


31.Jordan said:

Mod down! :-P There is little to no reason for email address links to be posted in blog comments.

@Edward: See my original comment -- I wasn't suggesting it for all comments, but it does seem appropriate for the personal profile page where I was trying to use it, doesn't it?

http://shiflett.org/community/members/jordan

Thu, 15 Mar 2007 at 23:25:42 GMT Link


32.Jacques Distler said:

Call me pedantic, but well-formed != standards compliant.

Of course: well-formed⊆valid⊆conformant.

[No utf-8, eh? That's real internationalization.]

On my blog, all comments are run through a local copy of the W3C Validator. So all 7500+ comments are guaranteed to be valid XHTML+MathML. This does not, in any way, prevent commenters from entering nonconformant markup.

(Indeed, since XML DTDs are less expressive than SGML DTDs. there are conformance requirements that are enforced by the W3C Validator on HTML4 documents that are not enforced on XHTML documents.)

For the Wiki software, there's a very nice extended Markdown syntax (which generates <table>s and footnotes and definition-lists and allows one to add attributes to any of the Markdown-generated markup and ...), that users are strongly encouraged to use.

This does not prevent them from entering well-formed, but invalid XHTML markup by hand, but generally they don't. And the extra usability tax involved in ensuring that the content is not merely well-formed, but actually valid exceeds whatever maginal benefit might thereby be accrued.

But if you get to that point, use HTML Purifier. ;-) Even though it's written in a different language, I strongly encourage you to go check it out.

From a cursory perusal, it looks quite nice.

It's case-insensitive and not namespace-aware (to be fair, namespace prefixes are hard-coded in my code, which is to say it isn't really namespace-aware either). Which would make it less than useful for XHTML (and/or MathML and/or SVG). But for HTML4, it looks like it should work quite nicely.

Plus, I am of the school that SVG should not be embedded in pages and should be treated as separate files.

Either way, if you want to allow SVG on a Wiki, you have to sanitize it. And allowing it inline is a lot less dangerous than allowing the <object> element to include it from a separate file.

(Wiki syntax allows file-includes along the lines of [[!include file.svg]]. Either way, that counts as inline.)

Fri, 16 Mar 2007 at 02:37:09 GMT Link


33.Jacques Distler said:

I wrote:

[No utf-8, eh? That's real internationalization.]

Apparently not in the Preview, but it's OK in the published comment? Strange...

Fri, 16 Mar 2007 at 02:43:47 GMT Link


34.Edward Z Yang said:

(Not posting logged in, since somehow my OpenID account on this website got really messed up, and I get served blank pages for all the blog posts).

See my original comment -- I wasn't suggesting it for all comments, but it does seem appropriate for the personal profile page where I was trying to use it, doesn't it?

::grins sheepishly:: Completely missed it. A profile page definitely calls for a different set of allowed elements.

Although... there's always spam-bots...

Apparently not in the Preview, but it's OK in the published comment? Strange...

It's a bug. The AJAX call must not be UTF-8 safe. Shiflett's blog is served in UTF-8, however, so that's why it worked in the actual comment.

Of course: well-formed⊆valid⊆conformant.

Agreed. Beg and plead all we want, but we cannot read the minds of our users and transform their tag soup into semantically correct markup. Well, not until we build neural interfaces into these computers!

On my blog, all comments are run through a local copy of the W3C Validator. So all 7500+ comments are guaranteed to be valid XHTML+MathML.

I'm partial to that approach. On one hand, it's very easy to implement. On the other hand, it's not very user-friendly. (But it's all about audience)

Indeed, since XML DTDs are less expressive than SGML DTDs. there are conformance requirements that are enforced by the W3C Validator on HTML4 documents that are not enforced on XHTML documents.

I stopped trusting the DTDs a long time ago. Things like SGML exclusions and the chameleon nature of <ins> and <del> elements makes things very silly unless you pay attention to the specs very closely and write bundles of custom code. Thankfully, that task has not been too difficult.

For the Wiki software, there's a very nice extended Markdown syntax [snip]

Yeah, raw HTML's not very user-friendly. But WYSIWYG editors are so shiny. :-)

I personally write all my documents in HTML, with a text-editor. This gives me fine-grained control over every aspect of my document without having to wrangle with a markup language's parser quirks or switching back to HTML just to do a remotely complicated data-structure. This approach doesn't work for everyone, but it's what I do.

From a cursory perusal, it looks quite nice.

Thanks!

It's case-insensitive and not namespace-aware (to be fair, namespace prefixes are hard-coded in my code, which is to say it isn't really namespace-aware either). Which would make it less than useful for XHTML (and/or MathML and/or SVG). But for HTML4, it looks like it should work quite nicely.

Its parsing routines are very much founded in the conventions of HTML, where when it was XML-ized as XHTML they decided to lowercase everything. Namespaces are a bit more tricky, because they're fluid. You classically write <xsl:stylesheet> but you just as well could write <i:stylesheet> as long as you set the xmlns properly. Until I get that right, I'll stick to just supporting xml:lang (the only namespaced attribute I allow!)

Either way, if you want to allow SVG on a Wiki, you have to sanitize it. And allowing it inline is a lot less dangerous than allowing the <object> element to include it from a separate file. (Wiki syntax allows file-includes along the lines of [[!include file.svg]]. Either way, that counts as inline.)

Object tags get a bad wrap for being very dangerous. But as long as you carefully validate the data attribute to ensure that it is a valid file-include (and not a request for a file that lies on a remote server; that's nigh impossible to guard against!), attach the appropriate type, and sanitize the separate file sufficiently (you are sanitizing the SVG files on the server, right?) they can be quite harmless. Although, once again, Internet Explorer does not support SVG, so you'll need to offer rasterized versions.

Fri, 16 Mar 2007 at 03:24:12 GMT Link


35.Jacques Distler said:

Object tags get a bad wrap for being very dangerous. But as long as you carefully validate the data attribute to ensure that it is a valid file-include (and not a request for a file that lies on a remote server; that's nigh impossible to guard against!), attach the appropriate type, and sanitize the separate file sufficiently ...

That was my impression, too. But, as a first go-around, I figured I would go along with at least that part of the conventional wisdom. Perhaps <object> sanitization will be in a future version.

(you are sanitizing the SVG files on the server, right?)

Of course! That's why there are those white-lists of SVG elements and attributes. XML is XML and CSS is CSS, though, so the sanitization logic is exactly the same.

Although, once again, Internet Explorer does not support SVG.

As with MathML, it does through a plugin.

Pushing these technologies may not be "mainstream." But IE users are not completely shut out.

Fri, 16 Mar 2007 at 03:55:55 GMT Link


36.Jakub Vrana said:

Regardless, you are wrong to suggest that the allowed markup is not XHTML.

I gave you three reasons why your syntax is not XHTML at all:

1. Apostrophes are allowed in XHTML, not in your code.

2. Attributes are allowed in any order in XHTML, not in your code.

3. Any white character is allowed to separate attributes, only spaces in your code.

Your code is not even a subpart of XHTML (as only some tags are allowed), it's just a different language.

It seems like changing the pattern to not match < and > might keep the output standards-compliant without adding much complexity.

Your "simple solution" is not going to work because perfectly valid [a][em][/em][/a] would not be recognized by your new code.

Since this is a web-application security blog, I believe that it is absolutely essential that the users be able to post arbitrary plaintext.

And this is unfortunatelly not the case of this blog. If I want to write about entities then they are silently converted (it's not documented anywhere). Furthermore, preview works in different way than actual processing (and doesn't work without JavaScript at all).

Fri, 16 Mar 2007 at 10:23:28 GMT Link


37.Jeremy Harnois said:

There's a number of methods for auto-paragraphing code. Your particular implementation would add oodles of empty paragraph tags when there are double-spaces, although this can be corrected.

Independently yes, but Chris' code handles this in the "Normalize Newlines" section. I was curious to know if there was a reason he went with code that created additional variables and looped versus a more simplistic approach.

Fri, 16 Mar 2007 at 13:30:02 GMT Link


38.Edward Yang said:

Perhaps <object> sanitization will be in a future version.

It'll be tough. Object is probably the most complex element in the entire HTML spec.

As with MathML, it does through a plugin. Pushing these technologies may not be "mainstream." But IE users are not completely shut out.

I am aware that both functionalities are supported by plugins. However, it is so much nicer when things are natively supported. The average Joe is not, without a lot of prodding, going to install a MathML plugin just to see a few equations (then again, he probably has no business seeing them anyway).

Your code is not even a subpart of XHTML (as only some tags are allowed), it's just a different language.

Lets look at it a different way. Is valid Shiflett-code valid XHTML? If we ignore paragraphing and the <pre> and <code> duality, yes.

Whether or not this is sanctioned by W3C as a valid subset is a different matter.

Your "simple solution" is not going to work because perfectly valid [a][em][/em][/a] would not be recognized by your new code.

Voila, another edge case. Personally, I think that it's a reasonable tradeoff (you can do without an emphasis in link text!), but they do accumulate.

And this is unfortunatelly not the case of this blog. If I want to write about entities then they are silently converted (it's not documented anywhere).

Agreed. The style guide should mention that character entities are not supported, and if you want those characters you have to type in the literal Unicode character (which I personally favor).

Furthermore, preview works in different way than actual processing (and doesn't work without JavaScript at all).

Actually, I think it does work the same way. You can't inspect the source code though. Preview without JavaScript would be nice. :-)

Independently yes, but Chris' code handles this in the "Normalize Newlines" section. I was curious to know if there was a reason he went with code that created additional variables and looped versus a more simplistic approach.

You got me. The normalization code would protect against that. I guess he's doing some other checks on each line of the code, maybe that's how he ensures <blockquote>s don't get wrapped in p tags.

Fri, 16 Mar 2007 at 17:22:02 GMT Link


39.Larry said:

I've thought through this quite a bit and have been reading this thread with interest.

I've concluded that really, bbcode is the most simple and straight forward answer. There is no standards issue, you convert the user supplied data to the same, standards compliant html every time; and everything else is escaped by default.

It would seem lame (or lazy, as you say) to use an alternate markup language only if it wasn't pretty much already an accepted markup for this use. Even people I know with no knowledge of web programming know how to use bbcode in a forum because of the prevelance of things like phpBB.

People who are used to using <i> and <b> but don't really know html also find it pretty obvious to use [i] and [b] tags as an alternate.

I really can't see the point in doing all this to support some exact use of html with all the issues discussed here when you can just use bbcode.

I must really be missing something. (always possible).

Fri, 16 Mar 2007 at 22:54:35 GMT Link


40.Edward Yang said:

I must really be missing something. (always possible).

I think (and this is an assumption on my part) that you've never seen a BBCode library before. They're huge, the PEAR parser and phpBB BBcode parsers as case in points.

Sun, 18 Mar 2007 at 02:57:37 GMT Link


41.Chris Shiflett said:

Edward Yang wrote:

I've previously proposed that by insisting standards compliance, you protect yourself against browser quirks.

I agree, and I'm not aware of any evidence to the contrary.

Have you considered compiling a list of last year's browser-related vulnerabilities to see what percentage of them require quirks mode?

Jacques Distler wrote:

Do you really expect to sanitize HTML using RegExps?

Yes I do, although I wouldn't describe any of this as sanitizing.

As an aside, Andrei has some slides available from a great talk he gives on regular expressions.

Jacques Distler wrote:

I came within one character of an XSS exploit in the comment above.

Are you playing horseshoes?

Please use the preview feature for testing exploits, and I promise to take your criticism more seriously if you discover any problems.

Edward Yang wrote:

The AJAX call must not be UTF-8 safe. Shiflett's blog is served in UTF-8, however, so that's why it worked in the actual comment.

The response to the Ajax request indicates the proper character encoding:

Content-Type: text/html; charset=utf-8

If Internet Explorer (I'm just making an educated guess) misinterprets this, then there is potentially a new XSS attack vector for IE. That doesn't relate directly to the current topic, but it's interesting nonetheless.

Jakub Vrana wrote:

I gave you three reasons why your syntax is not XHTML at all.

You tried, yes, but your logic is flawed. I've considered blogging about logic in the past, but for now, there's a Wikipedia article that describes the specific flaw in your argument.

Edward already posed a better argument, which is that the code blocks aren't valid XHTML. Another example is that I let people use <blockquote> without paragraphs. In fact, I don't even require paragraphs at all.

Jeremy Harnois wrote:

I was curious to know if there was a reason he went with code that created additional variables and looped versus a more simplistic approach.

I do more than create paragraphs, but that's a topic I omitted from the discussion.

Edward Yang wrote:

Preview without JavaScript would be nice. :-)

You can disable JavaScript to force a non-JS preview.

Sun, 18 Mar 2007 at 20:58:23 GMT Link


42.Jakub Vrana said:

your logic is flawed

My logic is brilliant, the misunderstanding came from different use of "is". I used it as "has equivalent syntax as", you used it as "produces". In other words, I used it as equivalence, you used it as implication.

You can disable JavaScript to force a non-JS preview.

Did you even try it in any browser? Two people (me and Nico) reported you it doesn't work.

Mon, 19 Mar 2007 at 10:11:45 GMT Link


43.Chris Shiflett said:

Nico wrote:

Without javascript the preview goes to the homepage.

This has been fixed. Thanks for pointing it out.

Nico wrote:

I hope '+' is only lost in the preview - it's a simple space. Seems to be a url encoding problem.

You're right. At first glance, it seems that I am misunderstanding how encodeURI() works, because it is not encoding the +. Here's a simple test I just tried:

<script>alert(encodeURI('1+1=2'));</script>

Jakub Vrana wrote:

Two people (me and Nico) reported you it doesn't work.

I missed that part of Nico's comment. Sorry, Nico!

Mon, 19 Mar 2007 at 11:52:51 GMT Link


44.dgx said:

Hello Chris!

Do you know Texy?

It is complex solution for formatting comments (and blog spots etc.). Yes, it has its own intuitive markup, but it is not the point. You can use only HTML tags.

Main benefit of the Texy is its bulletproof: ensures the well-formedness of the resulting code. Look at this.

The added value is support for typography rules. The most of people dont know how write typography correct qoutes, dashes, ellipsis - Texy knows it.

And very long words division with respect for language rules (see the html code), may be used together with a syntax highlighter, etc.

Texy is highly configurable - allowed tags, classes, ID, support for rel-nofollow etc. The english documentation is not very well, I know.

Wed, 21 Mar 2007 at 14:24:51 GMT Link


45.Andrew Millne said:

I may be missing something but why hasn't something like this been created based on a whitelist. I'm probably oversimplifying as I'm only at an intermediate level but what is wrong with having say a list of allowed tags and associated allowed attributes?

That way you could allow a <p> tag and associated attributes like class, id etc.. but strip anything else, all using regular expressions.

Like I say I'm no expert but to a novice like me it seems that this should be an easy problem to solve and I am curious as to why it never seems to come to any form of conclusion.

If somebody could explain the finer points I am missing then maybe I could advance my knowledge of PHP security.

Thanks

Sun, 08 Apr 2007 at 12:58:40 GMT Link


46.Lorenzo Campanis said:

I was just wondering...

Wouldn't it be simpler to just use a user-friendly WYSIWYG editor for input texts?

Reading most of the comments, I believe it could be a lot easier to just have PHP strip any possible XHTML code that was inserted by the user.

This way everything would be consistent through out, and the desired formatting would be kept intact... Crossbrowser compatibility has become a lot more flexible nowadays..Thank God!

Just asking though, why wouldn't that be better, at the end of the day?

PS: Chris, I really enjoyed your speech in Madison Av. this week, Good job! ;)

Thu, 26 Apr 2007 at 00:29:05 GMT Link


47.Chris Shiflett said:

I believe it could be a lot easier to just have PHP strip any possible XHTML code that was inserted by the user.

I don't really like the idea of mangling someone's comment, just because they have chosen to discuss something that I'm too lazy to deal with.

Although writing your own solution might seem like a hassle, using HTML Purifier is easy. :-)

Chris, I really enjoyed your speech in Madison Av. this week, Good job!

Thanks a lot, Lorenzo!

Thu, 26 Apr 2007 at 00:37:26 GMT Link


48.Coffinboy said:

Hi Chris, I've managed to painlessly fix the blockquote-issue Jacub mentioned earlier.

It seems the preview wasn't too happy with my RegExp so I posted it at pastebin.

Thu, 12 Jul 2007 at 14:36:13 GMT Link


49.Wes Mahler said:

Chris,

droped you an email, but just wanted to make sure it got on here:

this is from post from php.net - thoughts?

-- comment on php.net --

Okay, so maybe this SHOULD be posted under Urlencode, but there's more talk of foiling XSS attacks here than there, so…

Be VERY careful validating submitted data not to miss something. By that I mean EVERYTHING passed in the $_POST array, including keys (the names of the form fields themselves) is susceptible to XSS attacks. Any hack can add whatever they want to your form and submit it to your script:

<input type="hidden" name="<script>alert('…the form_fields_NAMES can get you, too!');</script>" value="We all validate form_field_VALUES, but…">

Step one of course is to adopt a sensible naming convention for your form fields, to whit: name="always_lower_case" (underscores do NOT get encoded because they are valid URL characters). So, you should never find a "%" in one of your form field NAMES. Here's what I do:

foreach($_POST as $key => $val) {

// scrubbing the field NAME...

if(preg_match('/%/', urlencode($key)*)) die('FATAL::XSS hack attempt detected. Your IP has been logged.');

// okay, got here, now scrubbing the field VALUE...

[ scrub $val here by using htmlentities or a custom replacement function ];

...;

}

* %3Cscript%3Ealert%28%27%85the+form_fields_NAMES+can+get+you%...

P.S. Yes, remove the asterisk!

Thu, 20 Sep 2007 at 12:13:20 GMT Link


50.Santosh Patnaik said:

htmLawed is a new HTML purifier/filter PHP script (single 45 kb file) like HTML Purifier. It is highly customizable, and has many features including good anti-XSS capability.

Sat, 03 Nov 2007 at 18:45:09 GMT Link


51.yop said:

Iñtërnâçiônàlizæçiøn

Iñtërnâçiônàlizæçiøn

Iñtërnâçiônàlizæçiøn

Iñtërnâçiônàlizæçiøn

Iñtërnâçiônàlizæçiøn

Wed, 14 Nov 2007 at 04:03:57 GMT Link


52.Josh Stodola said:

Could somebody please port this to .NET?

After all, it is only a matter of time before ASP.NET is used a helluva lot more than PHP (for painfully obvious reasons), so please, somebody smarter than myself that understands both languages please convert it for the rest of the developer world to utilize! Best regards...

Mon, 17 Dec 2007 at 21:41:21 GMT Link


53.John Sharp said:

Reading most of the comments, I believe it could be a lot easier to just have PHP strip any possible XHTML code that was inserted by the user.

Sat, 29 Dec 2007 at 15:40:14 GMT Link


54.Alex Perry said:

Brilliant! I always used strip_tags, I'll get out of the habit I guess.

Thu, 24 Jan 2008 at 07:44:21 GMT Link


55.Mark Berne said:

I don't understand ... Why not use PHP's strip_tags() and the "allowed_tags" option?

Sat, 23 Feb 2008 at 09:38:42 GMT Link


56.Brad Jasper said:

#55, Exactly.

Strip tags with allowed_tags accomplishes the exact same thing.

Not to mention it's already built in to PHP...

Am I missing something here?

Tue, 22 Apr 2008 at 19:25:25 GMT Link


57.Chris Shiflett said:

Mark and Brad,

I know there are a lot of comments now, so it's hard to keep them organized, but I think this has already been answered pretty succinctly by Paul.

A brutally honest summary is that using strip_tags() in the way you suggest creates XSS vulnerabilities, which is precisely the topic of this post.

Hope that helps.

Wed, 23 Apr 2008 at 01:35:55 GMT Link


58.Muttley said:

Thanks for this, Shiffers. I've been working on a similar thing, using a similar method, so it's nice to know that I'm using the right trail. I hadn't considered the protocols for the links, so that could saved me a few brown points.

Bonsoir.

Sat, 10 May 2008 at 07:52:50 GMT Link


59.Geld Lenen said:

Hi Chris, I really like your blog design and I want to use it on my personal Geld Lenen site.

Where did you get it, or did you create it by yourself?

Mon, 26 May 2008 at 13:55:32 GMT Link


60.Ash Searle said:

It might be worth changing your example code from using htmlentities to htmlspecialchars.

Running text through htmlentities often leads to problems when content ends up in XML feeds where html-entities mean nothing.

I was going to mention some examples like eacute; etc. (i.e. accent characters in european languages); saldy, I can't figure out how to get an ampersand past your filter...

Sat, 05 Jul 2008 at 13:53:51 GMT Link


61.DB said:

If you just want to allow paragraph input in a text area it's easier to use the PHP's nl2br() module (http://php.net/nl2br). This PHP version 4 and up module automatically converts new lines to HTML line breaks.

For Example (PHP CODE):

//prevent xss by stripping all HTML(php.net/strip_tags)

$input = strip_tags($_POST['fieldname'], "");

//convert \n to <br>

$input = nl2br($input);

note: this does not work for stripping xss from an wysiwyg editor and keeping non-xss html.

Mon, 08 Sep 2008 at 15:06:14 GMT Link


62.Chris Shiflett said:

Hi DB,

The only thing nl2br() does is prepend newlines with <br /> tags.

Combining this with strip_tags() helps prevent XSS, and depending upon context, it can be pretty effective. However, it is an imperfect solution to the problem, and more importantly, it mangles comments. In fact, several of the comments above would be mangled using your code.

So, while I agree that your approach is easier, it is a poor solution. If you want a better example that incorporates nl2br(), try this:

<?php
 
header('Content-Type: text/html; charset=UTF-8');
$html = array();
 
$html['comment'] = nl2br(htmlentities($_POST['comment'], ENT_QUOTES, 'UTF-8'));
 
?>

Like your example, this also does not allow HTML, but it offers better protection against XSS and doesn't mangle comments.

Mon, 08 Sep 2008 at 15:23:13 GMT Link


63.Matthew Bonner said:

I have to say that I think you have stepped off on the wrong footsteps.

You are starting by filtering, where you really want to start by splitting.

Once you have splitted everything up into a tree structure of HTML you should then go through each element filtering out what you consider dangerous.

I hate converting content into 1 thing, to convert it into something else. I know it sometimes has to be done, especially with the likes of < PHP 6 but hopefully when PHP 6 comes out it will answer more of our prayers.

I think br tags are ugly and I agree that using them for spacing paragraphs is wrong. So wrong that in fact they are not br tags in XHTML 2.

That leads me onto the unfinalised XHTML 2 which when it is finalised, will cause more headaches. I have tried to come up with a solution for years and due to my lack of knowledge when it comes to regular expressions I think I am losing a lost battle.

I truely think the answer to this would be using some really clever regular expressions to firstly break up input into a tress or use some of the built in XML functions to do so:

http://uk.php.net/manual/en/functio...into-struct.php

Then, to slow your page down and complicate things further, more regular expressions to remove elements that are dangerous or bloody annoying. I am debating on how well PHP's XML functions are so to be even more of a pain, if all else fails, filter more before using the XML functions.

This is a topic that I think will never end, the solution will be one that is hard to find, and coming up with something that gives enough options without being the size of HTML Purifier will probably be even more impossible.

But at least I agree on 2 things, HTML should be allowed in user input and bbcode is the laziest and most inconsistent popular thing I have seen in the programming community.

Tue, 16 Sep 2008 at 22:30:32 GMT Link


64.Roman said:

I have a class that uses whitelists for tags, attributes and protocols, and ensures that all tags are opened and closed in a sane fashion. It comments out the rest of the symbols using htmlspecialchars without deleting anything. 124 lines of code. Feedback would be appreciated.

I have to post tinyurl, because ampersands seem to get deleted.

http://tinyurl.com/HtmlFilter

Fri, 12 Dec 2008 at 18:47:09 GMT Link


65.leuke filmpjes said:

Somebody familiar with HTML has only to learn which subset he may use. This is so much simpler then learning about your current BBCode definition.

Tue, 16 Dec 2008 at 02:00:52 GMT Link


66.David JM Emmett said:

Prelude: I prefer to use an MVC Style architecture.

I've always thought that the best way to go about things is to structure your code so that the View is a series of calls to a DOMDocument using PHP DOM XML.

Personally, I find this to be the best and safest way to validate user's HTML input, I load the users input into a DOMNode and remove all attributes that I think may be dangerous.

Wed, 14 Jan 2009 at 14:37:27 GMT Link


Hello! What’s your name?

Want to comment? Please connect with Twitter to join the discussion.