Edward Yang’s profile

MIT student who plays oboe and spends an inordinate amount of time fiddling with software

Latest Comments


Don't forget about Doxygen!

Posted in PHP Advent Calendar Day 2.

Sun, 02 Dec 2007 at 21:25:23 GMT


Putting words in my mouth? :-)

My apologies. That was the vibe I got from your last post on that thread. And I know another programmer who I greatly respect who stated that I shouldn't worry too much about the advisory.

Because both htmlentities() and htmlspecialchars() use ISO-8859-1 by default, you could just make sure that the Content-Type header indicates ISO-8859-1, but I think it's a good habit to be explicit.

I think a programmer should have wrapper functions for these two functions, so that they can easily change the encoding globally if need be.

Posted in Character Encoding and XSS.

Wed, 30 May 2007 at 21:42:26 GMT


It's a bit scary to see the misinformation prevalent about what htmlspecialchars() and htmlentities() do, although I don't blame you guys. Concerned Citizen is right: in some research I did previously, I established that the character encoding parameter in these functions does not actually fix up the character encoding of the string.

But here's the kicker: this behavior is mostly benign, since the usual exploits are extremely difficult to pull off (Chris Shiflett doesn't believe in the seriousness of it). I, personally, balk because it's extremely unclean: your application will not do anything about form feeds, or null bytes, or any of those icky control characters, and they will get passed back to the browser. This is the price you pay for a string format that by default is binary-compatible.

There's no cross-compatible, built-in way to fix this. The typical approach, however, is to use iconv when it is present and roll your own UTF-8 cleaner it isn't.

Posted in Character Encoding and XSS.

Wed, 30 May 2007 at 11:43:52 GMT


I must really be missing something. (always possible).

I think (and this is an assumption on my part) that you've never seen a BBCode library before. They're huge, the PEAR parser and phpBB BBcode parsers as case in points.

Posted in Allowing HTML and Preventing XSS.

Sun, 18 Mar 2007 at 02:57:37 GMT


Perhaps <object> sanitization will be in a future version.

It'll be tough. Object is probably the most complex element in the entire HTML spec.

As with MathML, it does through a plugin. Pushing these technologies may not be "mainstream." But IE users are not completely shut out.

I am aware that both functionalities are supported by plugins. However, it is so much nicer when things are natively supported. The average Joe is not, without a lot of prodding, going to install a MathML plugin just to see a few equations (then again, he probably has no business seeing them anyway).

Your code is not even a subpart of XHTML (as only some tags are allowed), it's just a different language.

Lets look at it a different way. Is valid Shiflett-code valid XHTML? If we ignore paragraphing and the <pre> and <code> duality, yes.

Whether or not this is sanctioned by W3C as a valid subset is a different matter.

Your "simple solution" is not going to work because perfectly valid [a][em][/em][/a] would not be recognized by your new code.

Voila, another edge case. Personally, I think that it's a reasonable tradeoff (you can do without an emphasis in link text!), but they do accumulate.

And this is unfortunatelly not the case of this blog. If I want to write about entities then they are silently converted (it's not documented anywhere).

Agreed. The style guide should mention that character entities are not supported, and if you want those characters you have to type in the literal Unicode character (which I personally favor).

Furthermore, preview works in different way than actual processing (and doesn't work without JavaScript at all).

Actually, I think it does work the same way. You can't inspect the source code though. Preview without JavaScript would be nice. :-)

Independently yes, but Chris' code handles this in the "Normalize Newlines" section. I was curious to know if there was a reason he went with code that created additional variables and looped versus a more simplistic approach.

You got me. The normalization code would protect against that. I guess he's doing some other checks on each line of the code, maybe that's how he ensures <blockquote>s don't get wrapped in p tags.

Posted in Allowing HTML and Preventing XSS.

Fri, 16 Mar 2007 at 17:22:02 GMT


Can anyone confirm whether or not the exploit still works? If it had been fixed as David suggests, there could be a possibility that Amazon saw this blog post, fixed it, and thus full disclosure works. (or, possibly, the timing was extremely lucky, and Amazon had just gotten around to fixing it).

Posted in My Amazon Anniversary.

Fri, 16 Mar 2007 at 16:39:20 GMT


therefore must be well-formed XHTML

Call me pedantic, but well-formed != standards compliant. Otherwise, I could have cut out more than two thirds of the code in HTML Purifier ;-) Well-formedness is actually the easy part: as long as you can parse HTML into a DOM, you can serialize it back into a well-formed XML document no problem'o. You're absolutely correct in this respect.

But it is trivially easy to cause your wiki pages to fail validation. My favorite example is <strong><div>No block elements in inline context!</div></strong>, demonstrating the lack of child validation.

Despite the extremely limited support for entering "XHTML" (you can't even enter a link to Bugzilla, let alone use single quotes for attributes, or type an unordered list or ...), I came within one character of an XSS exploit in the comment above. And that was without really trying very hard. Anyone who attempts to extend this code to include a more realistic subset of "XHTML" runs the risk of opening a security hole.

Oh yes, of course. Chris Shiflett is a smart guy, and I trust that he has no illusions about the scalability of this approach. But if you get to that point, use HTML Purifier. ;-) Even though it's written in a different language, I strongly encourage you to go check it out.

And yes, I agree the lack of lists is a bummer. I also don't think implementing them will be possible with this paradigm, unless you resort to something really hacky.

Why not? I allow MathML (but not, currently, SVG) in my blog comments. And the Wiki certainly allows both MathML and SVG, in addition to an actual reasonable subset of XHTML

Because this is not a graphic design blog, this is a web application security blog. ;-) Plus, I am of the school that SVG should not be embedded in pages and should be treated as separate files. I am also of the school that until Internet Explorer gets reasonable SVG support, you must rasterize it into a format like PNG. I love MediaWiki's (the software that runs Wikipedia) implementation of this stuff.

(as opposed to "XHTML," which only vaguely looks like XHTML).

The moment we tossed in auto-paragraphing, code formatting and dual block/inline <code> tags, it stopped being a subset of XHTML. But limiting to double-quotes, limiting attribute ordering and limiting tags fits perfectly into the idea of a subset of XHTML behavior. Whether or not that behavior involves only tags is another story. ;-)

BTW, what do you mean by Bugzilla links?

Posted in Allowing HTML and Preventing XSS.

Thu, 15 Mar 2007 at 21:53:52 GMT


I explain a real solution on my blog.

It's an interesting bit of code, and definitely a step in the right direction, but it's not going to give you standards compliant code, and is no good for Shiflett's situation: why the heck would we need SVG pictures in our blog comments ;-)

I apologize for going off-topic, but I have a question about the "More Paragraphs" code. Is there any reason you didn't use something like this:

There's a number of methods for auto-paragraphing code. Your particular implementation would add oodles of empty paragraph tags when there are double-spaces, although this can be corrected.

Posted in Allowing HTML and Preventing XSS.

Thu, 15 Mar 2007 at 20:36:24 GMT


I think allowing HTML is very dangerous and any "safe-HTML" is a step back, there will be ways to inject HTML/JavaScript through it. If not due to parser, but via a user of a browser supported syntax that does not meet the spec.

While HTML was not designed with security in mind, I don't think this mentality is correct. I've previously proposed that by insisting standards compliance, you protect yourself against browser quirks. While there wasn't that much discussion on it, I think that it is very possible to do HTML safely. Otherwise, HTML Purifier wouldn't have happened. ;-)

Anyway, as the filter stands right now, those two common vectors you mentioned won't work because Chris's regexes wouldn't match them.

Posted in Allowing HTML and Preventing XSS.

Wed, 14 Mar 2007 at 21:17:34 GMT


I've done a bit of thinking about it, and I've come to the conclusion that HTML Purifier can be configured to have exactly the same behavior as your simple filter, albeit with a bit more support for the edge cases.

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML', 'AllowedElements', 'a,em,blockquote,p,code,pre');
$config->set('HTML', 'AllowedAttributes', 'a.href,a.title');

...these configurations will do the element and attribute limitations you are seeking (it will also require you to use pre for block usage). You will also need to implement a few bits of custom code:

1. Pre-processing auto-paragraphing, which can be done using preg_split, although you'll need to be careful not to do anything to the pre blocks and make sure that the <p> tags don't wrap around other elements.

2. Pre-processing pre-block armoring, essentially involves preg_replace_callback, matching the innards of <pre> and entity-izing everything in them.

3. Post-processing code beautification, after HTML Purifier's done with the output, match all <pre> tags with preg_replace_callback and run it through your source beautifier to get pretty output (you may need to de-entitize them).

Just putting another option on the table. These three functionalities are actually on the HTML Purifier 2/3 roadmap for core features, although the implementation may be a little different.

Ooh, by the way, don't forget to check the character encoding of input text! :-)

Posted in Allowing HTML and Preventing XSS.

Wed, 14 Mar 2007 at 19:48:31 GMT


  • Twitter: @ezyang
  • Location: Cambridge, MA
  • Joined: October 2006
  • Comments: 15