Character Encoding

Published in PHP Architect on 28 Feb 2006

I want to give thanks to Ilia Alshanetsky, who has agreed to take over Security Corner. It has been my pleasure to be the author of this column for the past few years. I think it’s valuable to hear from different sources of security expertise. Ilia is a well-known PHP expert and educator, and I’m confident that you’ll learn a lot from what he has to say.

Character encoding is a vast topic that I don’t plan to cover in much detail. In fact, the purpose of this month’s Security Corner is to illustrate why character encoding matters, not to explain character encoding mechanics. I highly encourage you to learn as much as you can about character encoding, because I think it will not only make you a better developer, but also lead to web apps that are more accessible.


I use escaping to describe all techniques that represent data in such a way that it is preserved in a different context. From a PHP developer’s perspective, there are three primary contexts that involve escaping:

The format of URLs does not support other character encodings, so urlencode() sufficiently preserves data within a URL (e.g., as the value of a query string parameter). However, properly escaping data to be used in SQL queries or HTML requires more attention.


Escaping HTML is typically performed with htmlentities() or htmlspecialchars(), but the simplest use of these functions does not indicate which character encoding to use:

  1. <?php
  3. $html = array();
  5. $html['username'] = htmlentities($clean['username']);
  7. echo "<p>Welcome back, {$html['username']}.</p>";
  9. ?>

This example suggests that the username has already been filtered (hence $clean['username']), so it’s unlikely to cause problems when used in the context of HTML. If $_POST['username'] were used instead, however, this would be vulnerable to XSS, despite the use of htmlentities().

The best way to illustrate this is to recreate Google’s recent XSS vulnerability. I used the following example when I blogged about this:

  1. <?php
  3. header('Content-Type: text/html; charset=UTF-7');
  5. $string = "<script>alert('XSS');</script>";
  6. $string = mb_convert_encoding($string, 'UTF-7');
  8. echo htmlentities($string);
  10. ?>

The Content-Type header indicates a character encoding of UTF-7. Browsers such as Internet Explorer automatically detect the encoding, in which case this line can be removed, but I wanted to make sure the example works in any browser.

The next two lines create $string, which represents the attack -- a typical XSS attack encoded with UTF-7. By default, htmlentities() assumes the character encoding is ISO-8859-1, so it misinterprets the characters used in the attack and fails to escape them properly. Thus, if you try this example yourself, you should see the following:

In order to avoid this type of vulnerability, it's best to always be explicit about the character encoding:

  1. <?php
  3. header('Content-Type: text/html; charset=UTF-8');
  5. $html = array();
  7. $html['username'] = htmlentities($clean['username'], ENT_QUOTES, 'UTF-8');
  9. echo "<p>Welcome back, {$html['username']}.</p>";
  11. ?>


Character encoding is more important when escaping HTML than when escaping SQL. The HTML you output is interpreted by many different browsers. When you're communicating with a database, you're communicating with a particular database, and you control how it interprets characters.

It is both interesting and educational to see how character encoding inconsistencies can be problematic in the context of SQL. In order to provide an example, I'll demonstrate an SQL injection attack that is immune to addslashes(), because this function also assumes ISO-8859-1. For this demonstration, I'll use GBK, a multi-byte character encoding.

In GBK, 0xbf27 is not a valid multi-byte character, but 0xbf5c is. Interpreted as single-byte characters, 0xbf27 is 0xbf (¿) followed by 0x27 ('), and 0xbf5c is 0xbf (¿) followed by 0x5c (\).

The goal of many SQL injection attacks is to inject a single quote without it being escaped. If addslashes() is being used, this can seem impossible, because it inserts a backslash before every single quote. However, all an attacker must do is inject something like 0xbf27, because addslashes() modifies this to become 0xbf5c27, a valid multi-byte character followed by a single quote. In other words, a single quote can be injected, despite the escaping. This is because 0xbf5c is considered to be a single character.

In order to illustrate this further, I provided a concrete example in my blog that I want to share. If you want to try this yourself, make sure you're using GBK. You can do this in /etc/my.cnf:

  1. [client]
  3. default-character-set=GBK

You'll need a table called users:

  1. CREATE TABLE users (
  2.     username VARCHAR(32) CHARACTER SET GBK,
  3.     password VARCHAR(32) CHARACTER SET GBK,
  4.     PRIMARY KEY (username)
  5. );

The following script mimics a situation where only addslashes() is used to escape the data being used in a query:

  1. <?php
  3. $mysql = array();
  5. $db = mysqli_init();
  6. $db->real_connect('localhost', 'myuser', 'mypass', 'mydb');
  8. /* SQL Injection Example */
  9. $_POST['username'] = chr(0xbf) .
  10.                      chr(0x27) .
  11.                      ' OR username = username /*';
  12. $_POST['password'] = 'guess';
  14. $mysql['username'] = addslashes($_POST['username']);
  15. $mysql['password'] = addslashes($_POST['password']);
  17. $sql = "SELECT *
  18.         FROM users
  19.         WHERE username = '{$mysql['username']}'
  20.         AND password = '{$mysql['password']}'";
  22. $result = $db->query($sql);
  24. if ($result->num_rows) {
  25.     /* Success */
  26. } else {
  27.     /* Failure */
  28. }
  30. ?>

Despite the use of addslashes(), an attacker can log in successfully without knowing a valid username or password.

To avoid this type of vulnerability, use mysql_real_escape_string(), bound parameters, or any of the major database abstraction libraries.

This type of attack is possible with any character encoding where there is a valid multi-byte character that ends in 0x5c, because addslashes() can be tricked into creating a valid multi-byte character instead of escaping the single quote that follows.

Until Next Time…

Hopefully you appreciate the importance of character encoding consistency and will always indicate the character encoding in your htmlentities() calls, your Content-Type headers, and the like. If you're using MySQL, use mysql_real_escape_string() instead of addslashes(), or if at all possible, use bound parameters.

Until next time, be safe.