About the Author

Chris Shiflett

Hi, I’m Chris: entrepreneur, community leader, husband, and father. I live and work in Boulder, CO.


Input Filtering

  • Published in PHP Architect on 18 May 2004
  • Last Updated 18 May 2004
  • 9 comments

Welcome to another issue of Security Corner. This month's topic is input filtering, one of the cornerstones of web application security. Input filtering is the method by which you validate all incoming data and prevent any invalid data from being used by your application. It's very similar in theory to how water filtering works, where impurities in water are not allowed to pass.

This article covers a variety of issues, but unlike previous Security Corners, I will be focusing more on the theoretical aspects than the practical. Understanding why and where to filter is more important than understanding how. By the end, you should be able to better design your applications with security in mind.

Spoofed Form Submissions

When you write code that is expecting data from the client, you're usually processing a form. It's important to appreciate just how easily a form submission can be spoofed, so that you realize that absolutely nothing about the client's request can be blindly trusted.

Consider the following form located at example.org:

<form action="/receive.php" method="POST">
<select name="color">
    <option value="red">red</option>
    <option value="green">green</option>
    <option value="blue">blue</option>
</select>
<input type="submit" />
</form>

When a user selects red and submits the form, a request similar to the following is sent:

POST /receive.php HTTP/1.1
Host: example.org
Content-Type: application/x-www-form-urlencoded
Content-Length: 9
 
color=red

There are two pretty common methods used to spoof such a form. One method is to recreate the HTML markup for the form, using absolute URLs instead of relative:

<form action="http://example.org/receive.php" method="POST">
<input type="text" name="color" />
<input type="submit" />
</form>

An easy way to create such a form is to save the HTML from the real site and substitute URLs where appropriate (this task can be automated to make things easier). The fake form can reside anywhere, because the request will still be sent to example.org due to the absolute URL specified by the action attribute.

Whereas the real form restricts the color to one of three choices, this new form has no restrictions and makes it convenient for an attacker to practice submitting various values for color in an attempt to subvert your application. If the attacker types red into the text field and submits the form, the request will be exactly the same as in the previous example. Of course, the most important point is that, in both cases, the request is coming from the client. Thus, you have no control over what is sent and must make sure that the color is one of the expected values.

A more direct method to spoof the form is to manually enter the POST request. If you telnet to port 80 (for standard HTTP) on the target host, you have complete flexibility. Here is an example from a standard shell prompt:

$ telnet example.org 80
Trying 192.0.34.166...
Connected to example.com.
Escape character is '^]'.
POST /receive.php HTTP/1.1
Host: example.org
Content-Type: application/x-www-form-urlencoded
Content-Length: 9
 
color=red
 
HTTP/1.1 404 Not Found
...

Because receive.php is a fictional resource, this generates a 404 response. I encourage you to try this with real forms, so you can appreciate the ease and power of this approach.

If the method of the form is GET rather than POST, the request resembles the following instead:

GET /receive.php?color=red HTTP/1.1
Host: example.org

As with the POST request, this can be spoofed manually or with a fake form. However, it is much easier to simply type the desired URL into a browser, so the other approaches are unnecessary. This additional convenience should not mislead you into believing that POST requests are more secure.

It should be clear that a dedicated attacker has complete control over the HTTP request that is processed by your application. In fact, it's best to not think of requests as being form submissions, since the use of a form is actually unnecessary.

Of course, the attacker could also try sending unexpected variables, but unless these are used, there is virtually no risk. This is a key point; as long as you filter the input that you use, you have a good design (implementation might be another matter). But, if you don't filter all input that you use, an attacker has an opportunity to compromise your application.

Register Globals

In PHP 4.2.0, the default setting for register_globals changed from On to Off. This change is regarded as one of the most controversial in PHP's history. There is also quite a bit of misinformation being spread about register_globals and its inherent insecurity as well. Most of this information unjustly blames register_globals for poor programming.

Some people, myself included, argue that it is possible to develop secure PHP applications with register_globals enabled. This is absolutely true, although it presents a heightened security risk. A mistake is much more dangerous and likely easier to exploit when register_globals is enabled.

With register_globals enabled, it becomes necessary to filter or initialize all data prior to use, assuming it to be tainted otherwise, because any variable can potentially be overwritten by input. This is a good practice, even when register_globals is disabled.

A common example of a security vulnerability is the assumption that a variable cannot exist without being explicitly set in the code:

<?php
 
if (validate_user()) { 
    $validated = TRUE;
}
 
/* ... */
 
if ($validated) {
    /* Sensitive Activity */
}
 
?>

It is easy enough for an attacker to send validated in the URL and bypass the second check (and anything else that relies on $validated). Of course, this is not possible with register_globals disabled, but it is also not possible with better coding practices. ($validated should be initialized to FALSE.)

With error_reporting set to a sufficiently high level (E_ALL will do the trick), this code generates a notice about an undefined variable. It is a good practice to always initialize variables (and to develop with error_reporting set to E_ALL to help catch yourself when you forget).

Timing

How can initializing variables protect you? Consider a slight modification to the previous example:

<?php
 
$validated = FALSE;
 
if (validate_user()) {
    $validated = TRUE;
}
 
/* ... */
 
?>

With this code, it is impossible for $validated to be TRUE unless validate_user() returns TRUE (regardless of the register_globals setting). If register_globals is enabled, and this script is accessed with validated=1 in the URL, the sequence of events is as follows:

  1. Request with validated=1 in the URL is sent.
  2. $validated is created with a value of 1.
  3. Your code begins execution.
  4. $validated is set to FALSE.
  5. ...

This indicates your complete control, because by the time the first line of your code is executed, the user is finished sending the request and can do nothing else. Thus, as soon as you initialize a variable, you can be assured that the user cannot directly manipulate it. Use this to your advantage.

Where Is the Trust?

There has to be a certain amount of trust, else your application can do nothing. The key is to understand where you are placing trust. Never trust the client, as the mantra goes, but how can you be sure that you're not?

One way is to rely on the superglobal arrays such as $_GET, $_POST, and $_COOKIE to make the data's origin very clear in your code.

Another good practice is to initialize an array in which you store all data that is safe to be used. This can include data that the application generates itself as well as input from remote sources that has been proven valid.

Design

The culmination of all of the information presented thus far should be used in your application's design. If you fail to design with security in mind, you're doomed to be patching security holes for eternity. One primary concern needs to be input filtering, and a good design makes it easy for developers to distinguish safe data from potentially tainted data.

As mentioned in the previous section, a naming convention can be helpful:

<?php
 
$clean = array();
 
if (valid_color($_POST['color'])) {
    $clean['color'] = $_POST['color'];
}
 
?>
 

A developer can get into the habit of assuming everything that's not in $clean is tainted. Good habits are valuable.

Another key to a successful design is to make certain that input filtering cannot be missed. Achieving this depends entirely upon your design, but if you initialize your variables and enforce a naming convention, any flaw in your design will cause a variable to be empty rather than have an arbitrary value set by an attacker.

Until Next Time...

Input filtering is possibly the most important topic that I will cover here in Security Corner, and it is likely to be covered again (perhaps with more of a focus on practical implementations). If you design applications with a focus on how data enters the system and is validated, you're far less likely to experience an endless series of security holes.

It is easier to forgive a developer whose input filtering has weaknesses than one who completely fails to filter input at all. Hopefully you now understand the importance of this step and will never skip it.

Until next month, be safe.

About this article

Input Filtering was last updated on 18 May 2004. Follow me on Twitter.

9 comments

1.Gordon wrote:

A very cheap and cheerful way to quickly validate data is to use typecasting functions built into PHP such as intval(). For example:

$page_id = 0;

$page_id = intval ($_POST ['page_id']);

Tue, 01 Feb 2005 at 17:49:31 GMT Link


2.Chris Shiflett wrote:

If you abide by the rule that filtering should only be the process by which you inspect data to determine whether it's valid, then casting a supposed integer to an int violates this. It gives the same result in that you are guaranteed an integer when you're finished, but it's considered a bad practice to modify invalid data to make it valid.

Mon, 14 Mar 2005 at 20:29:56 GMT Link


3.Cyril Y. Kobets wrote:

$page_id = 0;

$page_id = intval ($_POST['page_id']);

There is no need for the first line. $page_id will have value anyway.

Mon, 28 Mar 2005 at 06:50:59 GMT Link


4.James wrote:

Greate site. Thank you :)

Mon, 31 Oct 2005 at 04:08:07 GMT Link


5.bob wrote:

for a beginner like me that was a really nice read... made me realise how bad i was about to fsck up.

thanks

Fri, 06 Jan 2006 at 02:09:24 GMT Link


6.Casper wrote:

Thanks. I've been searching for this for awhile now. You explain it very well.

Sun, 08 Jan 2006 at 09:48:51 GMT Link


7.Andy wrote:

I always wondered how forms are spoofed and after reading various articles on your site it has made me more wise, I am now implementing data filtering on all of my forms.

Thanks again for explaining this in a way that is easy to understand, well done.

Sat, 01 Apr 2006 at 20:30:45 GMT Link


8.Erik Bauffman wrote:

For PHP developers that use firefox as their browser the extensions "Tamper Data" & "Webdeveloper" can be a great help to test your forms.

Tamper Data: this extension gives you complete control over every page (get/post) that's being sent. You get textfields where you can fill out your own headers, post vars etc..

Webdeveloper: this has an option that enables you to convert GET's to POST, which makes it possible to simulate post by tampering with the url. I like this method more, since my google sync doesnt like tamper data :)

Sat, 14 Apr 2007 at 21:32:47 GMT Link


9.Jonas Abrahamsson wrote:

I'm currently reconsidering my input filtering policys and find your articles very helpful.

I think Gordon's comment above is very interesting because thats is exactly my approach, to convert whatever data is received to the right type of data.

I know some cases when it's not good to accept any kind of malformed data as input, because it gives an apearans that anything goes which does not encurage strictness, and when something goes wrong it's hard to know exactly what, because there are no error messages. But it also lead to a kind of robustness, the appliation will work no matter what the input data is, its a kind of multi level robustness if used when data is sent within the application.

However I look mainly from the perspective of the users of my applications, I find users are often frustrated by the application telling them for example they have a bad char in the end of a phone-number. "Why could not the comupter just do it itself rather than tell me to do it?". Most people entering wrong data and seeing it change and pass, will think of the function as 'smart', and to strict rules about the data input is not very user-friendly.

In cases when the action isn't undoable I use to provide the users with a confirmation where they see the data they provided, and then they have the opportunity to find data that has changed from what they entered (assuming they entered something that wasn't allowed, but they thougt so).

The approach is a bit like Google's 'Did you mean' (wich btw have degraded my spelling, another downside I acknowledge).

I had not heard of the rule, but I kind of had an itch of it being there because of how most people code.

What is your take on the issue? Are there any pitfalls I've missed?

And do you know about any further reading about this issue?

Tue, 01 Apr 2008 at 18:57:11 GMT Link


Hello! What’s your name?

Want to comment? Please connect with Twitter to join the discussion.