Thanks very much to everyone who participated in this year's PHP Advent Calendar. The entire calendar is available at the following URL:
http://shiflett.org/blog/2007/dec
For reference, the complete list of entries is below. (See also Chris Cornutt's list and Sean Coates's calendar.)
Sean Coates
Elizabeth Naramore (Writing Code is Like Doing the Dishes (5 Reasons Why Documenting Your Code Makes You a Better Coder))
Sebastian Bergmann
James McGlinn
Cal Evans (Five Resources Every PHP Developer Should Know About)
Davey Shafik (APIs, UIs, and Other Underused Acronyms)
Elizabeth Smith (SPL to the Rescue)
Matthew Weier O'Phinney (Don't Reinvent the Wheel)
Ivo Jansch (Design Patterns)
Chris Cornutt
Ben Ramsey
Ed Finkler
Terry Chay (Filter Input; Escape Output: Security Principles and Practice)
David Sklar (Timing and Profiling)
Paul Reinheimer (Channels and Output)
Jeff Moore (What We Can Learn about Software Development from a Failing Restaurant)
Ilia Alshanetsky
Christian Wenz (WSDL Despite PHP 5)
Marcus Börger
Adam Trachtenberg (User-Defined Functions in SQLite)
Luke Welling (Following the Big Dogs on Web Application Security)
Derick Rethans
Jay Pipes
Nate Abele
Coordinating this turned out to be a lot of work, but I hope to do it again next year. There are lots of people in the PHP community who have something useful to share, and one reason to continue putting this calendar together each year is to get some original content published in December, a month when many people get busy, and blogs go dormant. With a little bit of prodding, we all hopefully learned a little more than we would have otherwise, and the people who were gracious enough to share something deserve our thanks.
Happy holidays, everyone. See you in 2008.
Today's entry is provided by Nate Abele.

- Name
- Nate Abele
- Blog
- cake.insertdesignhere.com
- Biography
- Nate Abele of OmniTI has been a core developer of the CakePHP web framework for over two years. He is known in some circles as the Johnny Cash (or "Man in Black") of the PHP community.
- Location
- New York, New York
Today, dear readers, I offer you no lofty words of wisdom, no dishwasher analogies, and no deep thoughts to ponder. What I can offer is a piece of simple, practical advice that you can use today, and some bits of code that I humbly submit to you below. Without further ado, let's begin.
Late last year, Chris and I (mostly Chris), came up with a clever defense-in-depth strategy for session security over a couple of beers. (We were sober most of the time; why do you ask?) The basic idea is that when dealing with session security, you want to do the best you can to ensure that the authenticated user you started talking to is the same user you're talking to now. You're probably familiar with at least some of the various forms of session attacks, so I won't bore you with the details.
There are all sorts of details in each request that you can potentially use to help be sure that you're still talking to the same user. Examples include the user's IP address and the User-Agent header. Hopefully you know that most of these details aren't reliable, because proxies and various other factors can change these things over the course of a legitimate user's session, causing you to potentially mistake a good guy for a bad guy. Common wisdom is that you can therefore never use such information.
The solution might not be black and white. If the user's IP address has been consistent over the last 50 requests, is it reasonable to assume that it's going to be the same for the next request? If it's suddenly not the same, then it seems reasonable to treat the request with a modicum of suspicion. To mitigate the risk, all you need to do is ask the user to re-authenticate. This can be as simple as prompting for the password again, a minor inconvenience for legitimate users, but a big hinderance to attackers.
In order to implement this simple system, you need two things:
A piece of information (key) you believe can be consistent, at least for a reasonably large portion of your users. (For example, an IP address.)
A threshold, after which you believe you can rely on this piece of information to be consistent in future requests.
The threshold is the important part; it's what makes this idea work. The threshold can be either a number of requests or an interval of time, but after this threshold has been met, the key is considered trustworthy, and any changes to its value will raise a red flag. Note that it's important to properly tune your threshold rules so that they're appropriate for your application, your users, and the piece of information you're tracking. Analyze your log files, and find and a sensible balance.
While the concept itself is simple, and can be implemented in any number of ways, I've created it as a component for CakePHP that you can use as follows:
<?php
class AccountsController extends AppController {
var $components = array('Trending');
function beforeFilter () {
$this->Trending->track(array('User-Agent' => 20,
'Net:!' => '+30 minutes'));
}
function index() {
}
function edit($id = null) {
}
}
?>
This example sets up two rules: one for the User-Agent header, and one for the IP address. In the context of domain-specific language I've constructed for these rules, : denotes a type of matching other than HTTP headers (e.g., Net), and ! means an exact match. So, this example only establishes a trend if the IP address is an exact match during a 30-minute period. For specifying thresholds, use strings to represent time (relative to now), and integers to represent a number of requests. So, this example establishes another trend if the User-Agent header remains consistent for 20 requests.
If either of these values change after the threshold is met, the trending component is going to to simply blackhole the request by sending back an error page and preventing any further execution. However, it has a callback property which can be used for more graceful handling, such as redirecting to a login page.
For reference, the current implementation is provided in Listing 1.
Listing 1:
<?php
uses('security');
class TrendingComponent extends Object {
var $components = array('Session', 'Security', 'RequestHandler');
var $rules = array();
var $failures = array();
var $callback = null;
function track($rules) {
$_rules = array();
foreach ($rules as $criteria => $rule) {
if (is_int($criteria)) {
$criteria = $rule;
$rule = 20;
}
$rule = is_array($rule) ? $rule : array('threshold' => $rule);
$key = $criteria;
if (strpos($criteria, ':') === false) {
$criteria = 'HTTP_' . up(r(array('-', ' '), '_', $criteria));
} elseif (strpos($criteria, 'Net:') === 0) {
} else {
}
$rule['key'] = $key;
$_rules[$criteria] = $rule;
}
$this->rules = $_rules;
}
function startup(&$controller) {
$this->failures = $this->incrementAndInvalidate();
if (!empty($this->failures)) {
if (!empty($this->callback)) {
$controller->{$this->callback}();
} else {
$this->Security->blackHole($controller, 'login');
}
}
}
function incrementAndInvalidate() {
$failures = array();
foreach ($this->rules as $header => $rule) {
$cur = $this->_config($header);
if (strpos($header, ':') === false) {
$hashValue = Security::hash(Configure::read('Security.salt') . env($header));
if (isset($cur['value']) && $this->thresholdMet($header)) {
if ($cur['value'] != $hashValue) {
$failures[] = $header;
}
} else {
$cur = $this->increment($rule, $cur, $hashValue);
}
} else {
list($type, $match) = explode(':', $header, 2);
switch ($type) {
case 'Net':
$value = $this->RequestHandler->getClientIP();
if (!isset($cur['value'])) {
$cur['value'] = $this->_getNetworkAddressPattern($match, $value);
}
if (preg_match($cur['value'], $value)) {
if (!$this->thresholdMet($header)) {
$cur = $this->increment($rule, $cur, $value);
}
} else {
if (!$this->thresholdMet($header)) {
$cur = $this->increment($rule, null, $cur['value']);
} else {
$failures[] = $header;
}
}
break;
}
}
$this->_config($header, $cur);
}
return $failures;
}
function increment($rule, $state, $value) {
$state = is_array($state) ? $state : array();
$state['value'] = isset($state['value']) ? $state['value'] : $value;
if (is_int($rule['threshold'])) {
$state['count'] = ($state['value'] == $value) ? $state['count'] + 1 : 0;
} else {
if (empty($state['count'])) {
$state['count'] = strtotime($rule['threshold']);
}
}
return $state;
}
function thresholdMet($rule) {
if (!isset($this->rules[$rule])) {
foreach ($this->rules as $key => $val) {
if ($val['key'] == $rule) {
$rule = $key;
break;
}
}
}
if (!isset($this->rules[$rule])) {
return false;
}
$config = $this->rules[$rule];
$cur = $this->_config($rule);
if (!$cur['count']) {
return false;
}
$base = strtotime(date('Y') . '-1-1');
if ($cur['count'] < $base && $config['threshold'] < $base) {
return ($cur['count'] >= $config['threshold']);
} else {
return ($cur['count'] < strtotime('now'));
}
return false;
}
function _config($key, $state = null) {
$key = r(array('.', '!', ':', '{', '}', '(', ')'), '_', $key);
if ($state === null) {
$state = $this->Session->read("Session.History.{$key}");
return (empty($state) ? array('count' => 0) : $state);
} else {
return $this->Session->write("Session.History.{$key}", $state);
}
}
function _getNetworkAddressPattern($pattern, $base) {
$result = '.*';
if ($pattern == '!') {
$result = preg_quote($base, '/');
}
return '/' . $result . '/';
}
}
?>
Future plans for the component include scoping by network address class and additional flagging options, but there are plenty of other things you can do such as integrating an IP-to-coordinate system, and making the threshold geography-based. There are numerous of other possibilities as well, but the important thing is to start thinking more broadly about your security strategies. Find patterns in your data and learn to use them to your advantage.
Happy holidays!
Today's entry is provided by Jay Pipes.

- Name
- Jay Pipes
- Blog
- jpipes.com
- Biography
- Jay Pipes is the North American Community Relations Manager at MySQL. Coauthor of Pro MySQL (Apress, 2005), Jay regularly assists software developers in identifying how to make the most effective use of MySQL. He has given sessions on performance tuning at the MySQL Users Conference, RedHat Summit, NY PHP Conference, OSCON, SCALE, and Ohio LinuxFest, amongst others. He lives in Columbus, Ohio, with his wife, Julie, and his four animals. In his abundant free time, when not being pestered by his two needy cats and two noisy dogs, he daydreams in PHP code and ponders the ramifications of
__clone().
- Location
- Columbus, Ohio
Recently, I've been busy getting the program for the MySQL Conference and Expo finalized (it's a big job!), and I wanted to take some time off
from the incessant stream of emails to speakers and sponsors to send a little gift to the blogosphere. My gift for the PHP Advent Calendar is
two completely random tips for you PHP and MySQL developers out there trying to squeeze performance out of your schemata and code. So, without further ado, here's my two random MySQL holiday tips.
Storing and Querying IPv4 Addresses
Many people aren't aware that any IPv4 address, commonly written using the dotted quad notation, are actually unsigned 32-bit integers. People are used to seeing the dotted quad notation, which contains 4 separate integers with values from 0 to 255 separated by dots. Together, the integers represent the class of the network within which the host machine is located. For instance, on my laptop, sitting here at home on my local (class C) network, I see that my IP address is 192.168.0.2.
When application developers want to store an IP address in a database, I typically see a column definition such as the following:
CREATE TABLE users (
user_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
/* More Columns */
ip_address CHAR(15) NOT NULL,
INDEX (ip_address)
);
This makes sense, since the total amount of space possibly needed by a dotted quad notation is 15 characters (3 dots and 4 integers with a max
of 3 characters per integer). However, the dotted quad notation is merely a textual representation of an unsigned integer that makes it easier for us to remember the IP address. When we store data in our schemata, though, we want to squeeze as many records into a single block of memory (or disk) as we possibly can. An INT UNSIGNED data type in MySQL needs 4 total bytes of storage, whereas the CHAR(15) needs 15 bytes of storage. If we store IPv4 addresses as unsigned integers, we can store four times as many records in an index block.
MySQL comes with two functions that translate between dotted quad and the internal unsigned integer representation of an IP address. The two
functions are INET_ATON() and INET_NTOA(). The former takes the dotted quad notation and converts it into an unsigned integer. The latter does the reverse.
Using these two functions, you can both store and retrieve IPv4 addresses easily. To store, do the following:
INSERT
INTO users (user_id, ip_address)
VALUES (NULL, INET_ATON('192.168.0.2'));
Now the IP address is stored as an unsigned integer. SELECTing from the table shows us this:
mysql> SELECT *
-> FROM users;
+---------+------------+
| user_id | ip_address |
+---------+------------+
| 1 | 3232235522 |
+---------+------------+
1 row in set (0.00 sec)
3232235522 isn't exactly friendly to the eyes. To convert, we use the INET_NTOA() function:
mysql> SELECT user_id, INET_NTOA(ip_address) as ip_address
-> FROM users;
+---------+-------------+
| user_id | ip_address |
+---------+-------------+
| 1 | 192.168.0.2 |
+---------+-------------+
1 row in set (0.00 sec)
To retrieve a range of users that have IP addresses in, my local network for instance, I can use:
mysql> SELECT *
-> FROM users
-> WHERE ip_address
-> BETWEEN INET_ATON('192.168.0.1') AND INET_ATON('192.168.0.255');
+---------+------------+
| user_id | ip_address |
+---------+------------+
| 1 | 3232235522 |
+---------+------------+
1 row in set (0.00 sec)
Tobias Asplundh and I did some benchmarks of storing IPv4 addresses as unsigned integers instead of CHAR(15) for last year's MySQL conference and found about an 8% performance difference searching small to medium-sized ranges on just one million records. So, it's worth the
small effort to streamline your schemata and use the appropriate data types for IPv4 addresses.
The Worst-Named MySQL Status Variable Ever
Anyone who's ever seen me speak at conferences on performance tuning MySQL knows that I am fairly glib about my pet peeves with MySQL (and
other things!). My biggest pet peeve used to be the old configuration variable log_long_format, which actually meant log any
query not using indexes to the slow query log. Luckily, this configuration variable was renamed in MySQL 5.0.12 to log_queries_not_using_indexes. (Imagine that.)
Luckily for log_long_format, a new variable has taken its place on my pet peeves list: the status variable Qcache_free_blocks.
Before I get into why this little monster of a status variable is so, well, monstrous, here's a little background on the MySQL query cache.
Introduced way back in MySQL 4.0.1, the query cache stores a hash of the SELECT query issued against the server and the actual result set of data returned by the SELECT query. Barring any modifications to the underlying tables, a subsequent request for the exact same SELECT statement will not need to be optimized, analyzed, or even hit the storage engine layer. Instead, the query cache will simply return the pre-packaged result set directly to the requesting client. The way the query cache is structured is essentially a linked list of these result sets, stored in blocks. When an underlying table is modified, blocks containing queries that reference the modified table are marked as dirty, to be flushed out of the query cache at a later time.
For read-heavy applications, the query cache can be a significant performance boost. More on mixed and write-heavy applications later.
Now back to the little monster status variable.
The status variables which begin with Qcache represent counters that the MySQL query cache keeps in regard to the health and hit ratios of stored MYSQL_RESULTs. So, looking at the following output, what would you expect the Qcache_free_blocks variable to mean?
mysql> SHOW STATUS LIKE 'Qcache%';
+-------------------------+----------+
| Variable_name | Value |
+-------------------------+----------+
| Qcache_free_blocks | 22087 |
| Qcache_free_memory | 64887904 |
| Qcache_hits | 23945162 |
| Qcache_inserts | 8200434 |
| Qcache_lowmem_prunes | 658819 |
| Qcache_not_cached | 8052109 |
| Qcache_queries_in_cache | 34818 |
| Qcache_total_blocks | 91967 |
+-------------------------+----------+
8 rows in set (0.01 sec)
If I were a betting man — and those of you whom I have played poker with know that I am — I would bet money that the Qcache_free_blocks meant the number of free blocks in the query cache available to store stuff in. Right?
Wrong. The Qcache_free_memory actually does mean what it looks like; it is the amount of bytes available to store results within the query cache. However, Qcache_free_blocks means the number of blocks within the query cache that are fragmented and need to be cleaned up. Read that again to make sure you understand it. Even the MySQL
documentation doesn't understand its purpose.
What exactly do the above statistics actually tell you? (By the way, the above is from one of the MySQL.com domain's primary DB servers.)
It tells you:
There are a total of 91,967 blocks in the query cache, of which 22,087 are marked as containing results that are no longer valid. In
this case, more than 25% of the blocks contain invalid result sets. (That's bad.)
How do you relieve the query cache of this fragmentation? You could issue a FLUSH QUERY CACHE and wait a bit. FLUSH TABLES would also do it. The point is to make you aware that the tricky Qcache_free_blocks variable makes it seem as if all is good with the query cache, when in fact, it isn't!
The Qcache_lowmem_prunes counter variable is 685,819. This means that the query cache has had to prune dirty, old,
or invalid blocks more than 680 thousand times. Take a look at how long this server has been up and running:
mysql> SHOW STATUS LIKE 'Upt%';
+---------------+---------+
| Variable_name | Value |
+---------------+---------+
| Uptime | 6449178 |
+---------------+---------+
1 row in set (0.00 sec)
So, the server has been up and running for 6,449,178 seconds, or about 75 days. If you divide the seconds by the lowmem prunes, you see that the
query cache is having prune itself about once every 9 seconds. (That's bad.)
Finally, find out the hit ratio for the MySQL query cache on this server. Only be concerned with the SELECT queries issued against the server, since those are the only ones that can be stored in the query cache. How do you get the number of SELECTs issued?
mysql> SHOW GLOBAL STATUS LIKE 'Questions%';
+---------------+-----------+
| Variable_name | Value |
+---------------+-----------+
| Questions | 271117447 |
+---------------+-----------+
1 row in set (0.00 sec)
There have been a total of 271,117,447 queries on this server since going online. Compare this number with the Qcache_hits status variable of 23,945,162. So, approximately 1 in 10 queries against the database is handled directly by the query cache. Is this good? Not particularly, but there is a catch. The server I have been showing is a master server, so it's handling the write load of the MySQL.com domain. The slave servers handle much of the read load.
In this scenario, it may be best to simply switch off the query cache on this master server due to the heavy fragmentation and lowmem pruning.
By contrast, the MySQL Forge 2.0 server, which is read-heavy, shows the following information:
mysql> SHOW GLOBAL STATUS LIKE 'Qcache%';
+-------------------------+-------+
| Variable_name | Value |
+-------------------------+-------+
| Qcache_free_blocks | 45 |
| Qcache_free_memory | 34744 |
| Qcache_hits | 79095 |
| Qcache_inserts | 43276 |
| Qcache_lowmem_prunes | 7396 |
| Qcache_not_cached | 20955 |
| Qcache_queries_in_cache | 6539 |
| Qcache_total_blocks | 13517 |
+-------------------------+-------+
8 rows in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Questions%';
+---------------+--------+
| Variable_name | Value |
+---------------+--------+
| Questions | 201050 |
+---------------+--------+
1 row in set (0.00 sec)
Here you see about 1 in every 3 queries against the database being served directly from the query cache.
The bottom line is this: investigate the query cache status variables to ensure:
Your cache is not badly fragmented.
The query cache hit ratio is decent. (If it isn't, you might consider turning off the query cache entirely.)
Happy holidays, y'all!
Today's entry is provided by Derick Rethans. Today also happens to be Derick's birthday, so I hope you'll join me in wishing him a very happy birthday. (Because I'm a little late posting this, and Derick lives in Norway, I'm afraid this is a belated birthday wish. Sorry, Derick!)

- Name
- Derick Rethans
- Blog
- derickrethans.nl
- Biography
- Derick Rethans has contributed in a number of ways to the PHP project, including the
mcrypt, date, and filter extensions; bug fixes; additions; and leading the QA team. He now works as project leader for the eZ components project for eZ systems A.S. In his spare time he likes to work on Xdebug, watch movies, travel, and practice photography.
- Location
- Skien, Telemark, Norway
This might not seem like a useful gem to most of you, but it has a coolness factor that I hope you appreciate. I am a geek and engineer, and I
like to know how things work. Because I deal with lots of PHP things, I want to know how PHP works. So, I spend lots of time figuring this out while working on Xdebug, but that doesn't always go deep enough.
So, some years ago, I starting hacking on a little tool called VLD. The original goal was to turn this into an encoder, but as I don't really care about encoding PHP files, it never made it that far.
What does it do? For each script, function, class, and method, this extension shows the internal execution units that represent your PHP code. There are a couple of things that you have to do before VLD shows you any output. First, of course, you have to install it. Follow the instructions, and add extension=vld.so to your php.ini file. If all is well, there should be a VLD section in your phpinfo() output. Because VLD outputs all of the opcodes to standard error, it's not really useful to run it through Apache; it is more suited to run from the command line.
Let's see what it does for the following script:
<?php
$a = 42;
if ($a < 50) {
for ($b = 0; $b < $a; $b++) {
echo sqrt($b), "\n";
}
} else {
echo "The value $a is too high.\n";
}
?>
After running this with the following command:
php -dvld.active=1 -dvld.verbosity=0 example1.php
You see output like this:
filename: /tmp/example1.php
function name: (null)
number of ops: 23
compiled vars: !0 = $a, !1 = $b
line # op fetch ext return operands
-------------------------------------------------------------
3 0 ASSIGN !0, 42
5 1 IS_SMALLER ~1 !0, 50
2 JMPZ ~1, ->15
6 3 ASSIGN !1, 0
4 IS_SMALLER ~3 !1, !0
5 JMPZNZ 9 ~3, ->14
6* POST_INC ~4 !1
7* FREE ~4
8* JMP ->4
7 9 SEND_VAR !1
10 DO_FCALL 1 'sqrt'
11 ECHO $5
12 ECHO '%0A'
8 13 JMP ->6
9 14 JMP ->21
10 15* INIT_STRING ~6
16* ADD_STRING ~6 ~6, 'The+value+'
17* ADD_VAR ~6 ~6, !0
18* ADD_STRING ~6 ~6, '+is+too+high.%0A'
19* PRINT ~7 ~6
20* FREE ~7
13 21* RETURN 1
22* ZEND_HANDLE_EXCEPTION
For every executable unit (script, function, method), it generates this type of output, showing the filename and function/method name, the number of opcodes, the IDs of compiled variables, and the opcodes (execution units) themselves.
By playing with the verbosity, you can control what kind of information VLD displays. A verbosity of 1 will add code path analysis to the output, showing which parts of the code can be executed (indicated by the * after the opcode #). A verbosity of 4 shows all possible information VLD can gather about the execution units. You can also instruct VLD not to execute the script you feed
to PHP. Simply add the -dvld.execute=0 statement to the command line.
Interpreting this data is non-trivial, but I am sure you can figure it out. As a hint, !0 is a compiled variable, ~1 is a temporary value and ->15 is a jump instruction. In case you have questions, feel free to send me an email.
One last warning (just in case the big red warning on the site is not enough): VLD cannot be used to decode encoding files. Please do not ask me questions about this.
Today's entry, provided by Luke Welling, is entitled Following the Big Dogs on Web Application Security.

- Name
- Luke Welling
- Blog
- lukewelling.com
- Biography
- Luke Welling is from Melbourne, Australia, but currently lives near Washington, DC, where he ekes out a living as a security nerd at OmniTI. He sees lots of good PHP and bad PHP, and tries to write more good than bad. Over the last decade, he has applied PHP in many places where it was intended, and in many places where it was never meant to go. With his wife Laura, he wrote the bestselling book PHP and MySQL Web Development and often speaks about PHP at conferences and user groups. His hobbies include riding his horses and sticking Splayds in toasters, although he has not yet attempted to do both at once.
- Location
- Washington, District of Columbia
At this time of year, people are apt to get all warm and sentimental, right up until their first trip to a mall on a Saturday when they go back to hating their fellow man and instituting an "If Amazon don't sell it, you're not getting it" policy on gift giving. December is very important to retail, and very important to retail sites.
I remember some good advice I read a long time ago. Vincent Flanders & Michael Willis in Web Pages That Suck suggested you "follow the big dogs." In other words, copy Amazon. Their reasoning was sound. You will likely get it wrong on your first try, you can't afford to run usability studies of your own, and don't want to spend months and numerous iterations getting it right. Learning from other people's mistakes is always less embarrassing than learning from your own.
I have had to paraphrase here, because I opted to recycle nearly all my old books rather than ship them half way around the world. Had I wanted to check the accuracy of my quote, it would have cost me one cent to buy a second-hand copy of that book.
While the long-term relevance of most of the advice in old computer books is fairly accurately reflected by that valuation, it was good advice in 1998. If you were embarking on an ecommerce venture at a time when there was a shortage of people who knew what they were doing, best practice conventions were not settled, and innovation was rapid, there were worse philosophies you could have than "What Would Amazon Do?"
The same idea is popular today, and for the same reason. There is always a shortage of people who really know what they are doing, so there are plenty of people making decisions by asking "What Would Google/Amazon/Microsoft/eBay/PayPal/Flickr/Yahoo/YouTube/Digg/Facebook Do?" If you are in a space where nobody really knows the best way yet, copying the segment leader is a low risk, low talent shortcut to making mainly good decisions, even if does mean you are always three months behind.
The idea does not apply well to web application security. There are two main reasons for this: first, the big dogs make plenty of mistakes, and second, good security is invisible.
You might notice mistakes, you might read about exploited vulnerabilities, and you might notice PR-based attempts at the illusion of security, but you probably don't notice things quietly being done well.
Common big dog mistakes include:
- Inviting People to Click Links in Email Messages
- You would think that, as one of the most popular phishing targets out there, PayPal would not want to encourage people to click links in emails. Yet, if you sign up for a PayPal account, the confirmation screen requests that you do exactly that:

- Stupid Validation Rules
- We all want ways to reject bad data, but it is usually not easy to define hard and fast rules to recognize it, even for data with specific formatting. Everybody wants a simple regex to ensure email addresses are well formed. Unfortunately, to permit any email that would be valid according to RFC 2822, a simple one is not going to cut it. As a result, many people add validation that is broken and reject some real addresses. Most are not as stupid as the one AOL used to have for signing up for AIM, which insisted that all email addresses end in
.com, .net, .org, .edu, or .mil, but many will reject + and other valid non-alphanumeric characters in the local part of an address (the bit before the @).
- Stupid Censorship Systems
- Simple keyword-based censorship always annoys people. Eventually, somebody named Woodcock is going to turn up. Xbox Live is infamous for rejecting gamertags and mottos after validating them against an extensive list of "inappropriate" words. Going far beyond comedian George Carlin's notorious Seven Dirty Words, there is a list of about 2,700 words that are supposedly banned. By the time you add your regular seven, all possible misspellings thereof, most known euphemisms for body parts, racial epithets, drug-related terms, Microsoft brand names, Microsoft competitors' brand names, terms that sound official, and start heading off into foreign languages, you end up catching a lot of innocent phrases.
- Broken HTML Filtering
- Stripping all HTML from user-submitted content and safely displaying the result is often done badly, but is not that difficult. On the other hand, allowing some HTML formatting as user input, but disallowing "dangerous" parts is not an easy problem, especially if you are trying to foster an ecosystem of third party developers. The MySpace Samy worm worked not because MySpace failed to filter input, but because of a series of minor exploits that, combined, allowed arbitrary JavaScript. Once you choose to allow CSS, so that users can add what passes for style on MySpace, it becomes very hard to limit people to only visual effects. eBay has had less well-known problems with a similar cause, but without a dramatic replicating worm implementation. Earlier this year, scammers were placing large transparent
divs over their listings, so that any click on the page triggered a mailto or loaded a page of their own. I could not see examples today, so I assume they have fixed the specific vector, but giving users a great deal of freedom to format the content that they upload makes ensuring that content is safe for others to view very difficult.
- Stupidly-Long URLs
- The big dogs love long, complicated URLs.
https://chat.bankofamerica.com/hc/LPBofA2/?visitor=&mses
sionkey=&cmd=file&file=chatFrame&site=LPBofA2&channel=we
b&d=1185830684250&referrer=%28engage%29%20https%3A//site
key.bankofamerica.com/sas/signon.do%3F%26detect%3D3&sess
ionkey=H6678674785673531985-3590509392420069059K35197612
Letting people get used to that sort of garbage from sites that they should be able to trust, you can't really be surprised when normal people can't tell the difference between an XSS attack hidden in URL-encoded JavaScript and a real, valid, safe URL. Even abnormal people who can decode a few common URL encodings in their heads are not really scrolling across the hidden nine tenths of the address bar to look at all that.
- Looking for Simple Solutions
- Security is not one simple problem, or even a set of simple problems, so looking for simple solutions such as the proposed
.bank TLD is rarely helpful. This is not helped by the vendor-customer nature of much of the computer industry. The idea that you can write a check to somebody and a problem goes away is very compelling; buy a more expensive domain name, or a more expensive Extended Validation Certificate, or run an automated software scan to meet PCI compliance and you might sleep more soundly at night, but many users already don't understand the URL and other clues that their browser provides them. Giving more subtle clues to them is unlikely to help. Displaying a GIF in the corner of your web page bragging about your safety might create the illusion of security and might well help sales, but it won't actually help safety on its own.
You can't follow the public example of the big dogs. They still make some dumb decisions, they still make the small mistakes that allow the CSRF and XSS exploits that are endemic, and they are often not very responsive to disclosures. If a major site makes 99 good security decisions and one bad one, you won't notice the 99. Unfortunately, with security, you are still far better off seeing how others have been exploited and critically evaluating what they say they should be doing, rather than trying to watch what they actually are doing.
Oh, and remember to stay away from malls on weekends in December.
Today's entry, provided by Adam Trachtenberg, is entitled User-Defined Functions in SQLite.

- Name
- Adam Trachtenberg
- Blog
- trachtenberg.com
- Biography
- Adam Trachtenberg is the Senior Manager of Platform Evangelism and Disruptive Innovation at eBay, where he preaches the gospel of the eBay platform to developers and businessmen around the globe. He's the author of Upgrading to PHP 5 and coauthor of PHP Cookbook, both published by O'Reilly Media. Adam lives in San Francisco, California, which he wishes was closer to the office.
- Location
- San Francisco, California
SQLite is a database that's bundled with PHP 5. Unlike most other databases, SQLite is not a separate application; it's an extension that reads from and writes to regular files.
Although the name SQLite hints at a less than full-featured product, besides the usual INSERTs and SELECTs, SQLite also boasts transactions, subselects, and triggers.
My favorite SQLite feature is the one that allows you to write your own SQL functions. Because, in addition to all the built-in SQL functions, such as lower() and upper(), you can extend SQLite to include your own functions that you write in PHP.
These are known as user-defined functions, or UDFs for short. With a UDF, you embed logic into SQLite and avoid doing it yourself in PHP. Thus, you can take advantage of all of the features inherent in a database, such as sorting and finding distinct entries.
UDFs are good for chopping up strings so you can perform nonstandard collations and groupings. For example, you want to sort through a list
of URLs, maybe from a referrer log file, and create a list of unique hostnames sorted alphabetically. So, http://example.com/directory/index.html and http://example.com/page.html would both map to one entry: http://example.com/.
To do this in PHP, you need to retrieve all the URLs, process them inside your script, and then sort them. Plus, somewhere in all that, you need to remove the duplicates. If it weren't for that pesky URL-conversion process, this could all be done in SQL using DISTINCT and ORDER BY.
With a UDF like the following, you foist all that hard work back onto SQLite where it belongs:
<?php
$db = sqlite_open('/www/support/log.db');
$sql = 'CREATE TABLE access_log(url);';
$urls = array('http://example.com/directory/index.html',
'http://example.com/page.html');
foreach ($urls as $url) {
$sql .= "INSERT
INTO access_log
VALUES ('$url');";
}
sqlite_query($db, $sql);
function url2host($url) {
$parts = parse_url($url);
return "{$parts['scheme']}://{$parts['host']}/";
}
sqlite_create_function($db, 'host', 'url2host', 1);
$r = sqlite_query($db, 'SELECT
DISTINCT host(lower(url))
AS clean_host
FROM access_log
ORDER BY clean_host;');
while ($row = sqlite_fetch_array($r)) {
echo "{$row['clean_host']}\n";
}
?>
As expected, this outputs the following:
http://example.com/
To use a UDF, you first write a regular function in PHP. The function's arguments are what you want to pass in during the SELECT,
and the function should return a single value. The url2host() function takes a URL, calls the built-in PHP function parse_url() to break the URL into its component parts, and returns a string containing the scheme (http) and the host. So, http://example.com/directory/index.html gets broken apart into many pieces. http is stored in $parts['scheme'], and example.com goes in $parts['host']. This creates a return value of http://example.com/.
The next step is to register url2host() with SQLite using sqlite_create_function(). This function takes four arguments:
The Database Handle
The SQLite Function Name of Your Choice
The Function's Name in PHP
The Number of Expected Arguments
The last argument is optional, but if you know for certain that your function only accepts a specific number of arguments, providing this information helps SQLite optimize things behind the scenes. In this example, the SQL function is host(), while the PHP function is url2host(). These names can be the same; they're different here to make the distinction between them clear.
Now you can use host() inside any SQL query that use the same database connection. The SQL above SELECTs host(lower(url)) AS clean_host. This takes the URL stored in the url column, converts it to lowercase, and calls the UDF host().
The function is not permanently registered with the database, and it goes away when you close the database. If you want to use it when you
reopen the database, you must reregister it. Also, the function is only registered for that database; if you open up a new database using
sqlite_connect(), you need to call sqlite_create_function() again.
The returned string is then named AS clean_host; this lets you refer to the results later on in the SQL query as well as access the value in PHP using that name. Since you're still in SQLite, you can take advantage of this to sort the list using ORDER BY clean_host. This sorts the results in alphabetical order.
There are plenty of other things you can do with UDFs, but hopefully this is enough to pique your interest and get you started.
Today's entry is provided by Marcus Börger.

- Name
- Marcus Börger
- Blog
- marcus-boerger.de
- Biography
- Marcus Börger is a specialist in C, C++, databases, UML, XML, and of course PHP. To the PHP community, he is also known as helly. As a core developer, he contributes a lot to PHP and focuses on the new OO features of PHP 5 and Zend Engine 2. Marcus has been hacking around on all sorts of things for over 15 years, and being an avid snowboarder, he happily accepted an offer from Google to work in their Zürich office. He can often be found snatching gummy bears from his favorite Italian pizzeria and taking photographs.
- Location
- Zürich, Switzerland
Before I joined Google, a Swiss engineer had an idea to create an internal service at Google that generates graphs from simple URLs. Shortly thereafter, the service became quite popular internally. During this time, use of the service was restricted; a key was required. This past summer, we decided to open the service to the general public, hoping a few people would like the idea as much as we did. There is now an easy, keyless service that generates various types of charts such as line charts, bar charts, pie charts, and more.
Consider the following example chart:

This chart is generated from the following URL:
http://chart.apis.google.com/chart?...mp;chd=s:foobar
This URL consists of the base URL (http://chart.apis.google.com/chart?) followed by the query string:
The chart type (cht=p3) is a 3D pie chart.
The chart size (chs=200x100) is 200 pixels wide and 100 pixels high.
The chart title (chtt=Hello+World) is Hello World.
The chart data (chd=s:foobar) is the simple string foobar. Choices for encoding are simple (s), extended (e), and text (t). The encoding is separated from the data by a colon.
Incoming data must be scaled to the range of the encoding, so you need to provide a way to do this. When using simple encoding, the API accepts 62 values:
Uppercase A (0) through Z (25)
Lowercase a (26) through z (51)
Digits 0 (52) through 9 (61)
You can indicate a missing value with an underscore (_), and you can separate data sets with a comma (,).
The following function (simple_encoding()) takes care of the encoding for you:
<?php
function simple_encoding(array $data, $min = 0, $max = 61) {
$codes = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' .
'abcdefghijklmnopqrstuvwxyz' .
'0123456789';
$min = (float)$min;
$max = (float)$max;
$diff = ($max - $min) / 61;
$result = '';
foreach ($data as $value) {
if (is_null($data)) {
$result .= '_';
} else {
$value = (int)(((float)$value - $min) / $diff);
if ($value < 0 || $value > 61) {
$result .= '_';
} else {
$result .= $codes[$value];
}
}
}
return $result;
}
var_dump(simple_encoding(array(-10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), 20));
var_dump(simple_encoding(array(-10, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), 20, 100));
?>
This produces the following:
string(12) "___AOds7____"
string(12) "___AHPWemt19"
More about the API can be found at chart.apis.google.com.
Today's entry, provided by Christian Wenz, is entitled WSDL Despite PHP 5.

- Name
- Christian Wenz
- Blog
- hauser-wenz.de/blog/
- Biography
- Christian Wenz got hooked on PHP when he introduced it to one of the largest web sites back in early '99, and he hasn't looked back since.
- Location
- Munich, Germany
We are working a lot with web services these days. One of the things I like best, from a technical perspective, is the WSDL standard. WSDL is used to provide information about a SOAP web service: operation, data types, messages, location, and the like. Using WSDL, consuming a web service is very easy. Most technologies support an implementation of the proxy pattern and can automatically create a local proxy object from a WSDL description. Here's a quick example that shows how PHP 5's very own SOAP extension works with WSDL:
<?php
$c = new SoapClient('/path/to/wsdl');
$result = $c->myWebServiceMethod($arg1, $arg2);
?>
Calling myWebServiceMethod() on the local client object instructs the SOAP extension to call the remote method of the same name and parse the SOAP data coming back from the service. Unfortunately, the SOAP extension does not support automatic WSDL creation for a SOAP service, whereas most competing technologies do. One of the beauties of PHP is that the language is not strongly typed, which makes WSDL generation more difficult. However, there are several ways to generate WSDL with PHP 5, with a little help.
Userland Code
One of the oldest web services libraries for PHP is NuSOAP. The original site has not been updated in over three years, but NuSOAP remains in active development on SourceForge. When creating a NuSOAP service, call the configureWSDL() method, and provide the data types for input arguments and the return value when calling register(). Here's a short example (sans the actual business logic of the service, which resides in myWebServiceMethod()):
<?php
function myWebServiceFunction($inputValue) {
}
require 'nusoap.php';
$soap = new soap_server();
$soap->configureWSDL('NuSOAPService', 'http://hauser-wenz.de/');
$server->wsdl->schemaTargetNamespace = 'http://soapinterop.org/xsd/';