About the Author

Chris Shiflett

Chris Shiflett is an author and speaker who leads the web application security practice at OmniTI.


Formatting and Highlighting PHP Code Listings

For the impatient, here's a direct link to the example that highlights itself:

http://shiflett.org/code/highlight.php

As I mentioned in the previous post, shiflett.org is being redesigned and redeveloped from the ground up. (Nope, it's not finished yet; you'll know it when you see it.) One of the things I want to improve is commenting. This blog has been getting a lot of comments, and I really appreciate that. (Thanks!) Since the topics I talk about (PHP, MySQL, etc.) are technical, I want to let you add formatted code listings to your comments.

I've been playing with this tonight. Feel free to follow along as I go. The first thing you want to do is create an ordered list from the code you want to format ($code in these examples). This provides line numbers, among other things:

<?php
 
/* HTML Output */
$html = array();
 
/* Normalize Newlines */
$code = str_replace("\r", "\n", $code);
$code = preg_replace("!\n\n\n+!", "\n\n", $code);
 
$lines = explode("\n", $code);
 
/* Output Listing */
echo "<ol class=\"code\">\n";
foreach ($lines as $line) {
    if (empty($line)) {
        $html['line'] = '&#160;';
    } else {
        $html['line'] = htmlentities($line, ENT_QUOTES, 'UTF-8');
    }
 
    echo "  <li><code>{$html['line']}</code></li>\n";
}
echo "</ol>\n";
 
?>

In order to make <code> tags preserve whitespace, you can add this to your CSS:

code {
    white-space: pre;
}

Pretty easy, right? Now that you have a good foundation, you can start to improve it. First, add class="even" to every other list item:

<?php
 
foreach ($lines as $key => $line) {
    if (empty($line)) {
        $line = '&#160;';
    }
 
    $html['line'] = htmlentities($line, ENT_QUOTES, 'UTF-8');
 
    if ($key % 2) {
        echo "    <li class=\"even\"><code>{$html['line']}</code></li>\n";
    } else {
        echo "    <li><code>{$html['line']}</code></li>\n";
    }
}
 
?>

This lets you add a subtle background color to the even rows, making the code easier to read:

ol.code li.even {
    background:#f3f3f0;
}

The next step is to add syntax highlighting. This is a bit more involved, but only if you're picky. (I am.) You can use token_get_all() and loop through the tokens yourself, or you can use highlight_string() and try to clean up its output. I have chosen the latter.

You avoid some of the cleanup by using this idea I got from Wez:

<?php
 
ini_set('highlight.comment', 'comment');
ini_set('highlight.default', 'default');
ini_set('highlight.keyword', 'keyword');
ini_set('highlight.string', 'string');
ini_set('highlight.html', 'html');
 
$code = highlight_string($code, TRUE);
 
?>

This gets rid of colors and uses meaningful names instead, but it leaves behind plenty of ugliness. If you're like me, the first thing you want to do is get rid of the extra crap that highlight_string() adds to the front and end of the string:

<?php
 
$code = substr($code, 33, -15);
 
?>

If you're using PHP 4, this is going to be different. You can do something more clever to accommodate both. I didn't.

A simple replacement can turn inline styles into classes:

<?php
 
$code = str_replace('<span style="color: ', '<span class="', $code);
 
?>

If you're using PHP 4, you're going to need to do this for <font> tags instead, but it's the same basic idea.

Might as well turn &nbsp; back into a space, &amp; into &#38;, and <br /> back into a newline while you're at it:

<?php
 
$code = str_replace('&nbsp;', ' ', $code);
$code = str_replace('&amp;', '&#38;', $code);
$code = str_replace('<br />', "\n", $code);
 
?>

Now you can put the pieces together, but there's one more obstacle to overcome. The highlight_string() function closes a <span> tag just before opening the next one, sometimes several lines later. This can yield output that looks like this:

<li><code><span class="comment">...</code></li>
<li><code>...</code></li>
<li><code>...</span></code></li>

You want it to look more balanced, like this:

<li><code><span class="comment">...</span></code></li>
<li><code><span class="comment">...</span></code></li>
<li><code><span class="comment">...</span></code></li>

Feel free to solve this one on your own. (Solving this almost made me wish I had used token_get_all() instead of highlight_string().) If you're interested in seeing my solution, I've got an example that highlights itself, complete with a document type, styles, and everything else needed to make it validate as XHTML 1.0 Strict. (View source if you want to really appreciate the XHTML goodness.)

Thanks to Jon Tan for the styles and colors. He's the accessibility, usability, standards, and design expert that's helping with the new site.

I'll probably be making some minor improvements to this code before using it in production on the new site. If you notice any bugs or can think of any improvements, please leave a comment. Thanks!

About This Post

Formatting and Highlighting PHP Code Listings was posted on Thu, 26 Oct 2006 at 22:39:28 GMT.

9 Comments

1. David's GravatarDavid said:

This is something I have been working on for a while so I am glad to find a site like yours talking about it.

Now, after downloading your example I see that the there is a problem with your code. (though it might not be to some people).

You highlighter wraps everything in code not just PHP/XHTML code.

The same way php.nethandles it in the manual.

So how would you stop the highlighter from going beyond "< ?php" or "< code >" tags?

Otherwise you might as well clean the data then just run

<?php
 
function highlight_php($code) {
    $code = '<div class="php">'. highlight_string($code, true). '</div>';
    return $code;
}
 
?>

Also, it seems that even doing this will allow attacks so why not just go with BBcode since something like http://htmlpurifier.org/ is over 350kbs? - and yet other than removing all HTML code it is the only thing that works?

Fri, 25 May 2007 at 18:39:36 GMT Link


2. Chris Shiflett's GravatarChris Shiflett said:

Hi David,

I see that the there is a problem with your code. (though it might not be to some people). Your highlighter wraps everything in code not just PHP/XHTML code.

As you'll note with your own comment, not everything is enclosed in <code> tags. Only the code is.

So how would you stop the highlighter from going beyond <?php or <code> tags?

More information is available at the following URL:

http://shiflett.org/blog/2007/mar/a...-preventing-xss

Regarding your last statement, BBCode does nothing to improve security, and Paul's comment is about strip_tags().

Hope that helps.

Fri, 25 May 2007 at 18:57:16 GMT Link


3. David's GravatarDavid said:

As you'll note with your own comment, not everything is enclosed in <code> tags. Only the code is.

hmm.... well you the code in your example must be missing something. Because while the code that process your comments can tell code from text - the highlighter can't - or at least I am missing something. I have tried a couple things and I can't get it to stop highlighting after (or before) the code.

<blockquote><p>More information is available at the following URL:</p></blockquote>

I read the whole page and all the comments but I don't see anywhere where it talks about stopping the highlighter from highlighting everything..?

Regarding your last statement, BBCode does nothing to improve security, and Paul's comment is about strip_tags().

But since browsers ignore bbcode - it seems like the most secure way to process input is if you striped ALL html code from the text (thereby avoiding what preinheimer was talking about) and then used something like phpBB's bbcode processor to accomplish the same thing that the 350k htmlpurifier does.

Of course this is assuming that all of the highlighting code examples I have seen on this site are incomplete (at least for a newbe like me) and we are still looking for a 99% secure way to process code - right? or am I missing somthing?

Also, what about adding to your code using some kind of preg_replace() to clean out "style="color:#000;"", or "onClick="dothis()"", or "a href="javascript:alert('XSS')"" - would that work?

Fri, 25 May 2007 at 19:39:54 GMT Link


4. Chris Shiflett's GravatarChris Shiflett said:

Hi David,

Interestingly enough, I think you managed to reveal a bug in my code with your second use of the <blockquote> tag. I'll have to look into that.

Regarding how to distinguish code, I employ a style guide. Just as you did in your first comment, people who wish to include code in their comments use the <code> tag to do so. I just read my other post again, and I see that I don't explain this very well. Assuming you have the code highlighting method defined in a class, you can use a regular expression for the replacement:

<?php
 
$html = preg_replace('!^&lt;code&gt;((.|\n)*)&lt;\/code&gt;$!meU',
                     '$this->code(\'$1\', TRUE)',
                     $html);
 
?>

Hope that's a bit clearer.

But since browsers ignore bbcode - it seems like the most secure way to process input is if you striped ALL html code from the text

If someone takes the time to comment on my blog, I think it would be rude for me to remove part of their comment, just because I'm too lazy to do the right thing. In the other post I keep referring to, I demonstrate a technique that's better than this in the first example. (It weighs in at just over 300 bytes, including comments.) The proper thing to do is escape the content for the appropriate context (and the appropriate character set).

Also, what about adding to your code using some kind of preg_replace() to clean out "style="color:#000;"", or "onClick="dothis()"", or "a href="javascript:alert('XSS')"" - would that work?

Depends on what you mean by work. If you want to keep people from talking about these things, then sure, it would work. If I did this, you wouldn't have been able to ask this question.

That's not what I want. :-)

Fri, 25 May 2007 at 21:30:53 GMT Link


5. David's GravatarDavid said:

Assuming you have the code highlighting method defined in a class, you can use a regular expression for the replacement:

That is what I was looking for :D

You highlighter runs everything through it - but by placing it in a function I can use regex to limit it to just run through the text in-between the opening an closing code brackets.

I also found another way to do it. Split the text up into different array elements and only run the "code" elements through the highlighter.

The rest of the code gets the old "htmlspecialchars" treatment and then, like you show in your other post (url not needed at this point), you can use more regex to only allow certain codes like "em".

Great, now I have something to play with ;)

Only the "preg_replace" you showed be didn't work - but that's fine I just wanted the logic.

Thanks!

Fri, 25 May 2007 at 21:52:39 GMT Link


6. John Schulz's GravatarJohn Schulz said:

Hey Chris,

How have you been? ;)

What do you think of adding class="php" to your code tags?

You could do:

<code class="php>

<span class="keyword">if</span>

</code>

or:

<span class="php keyword">if</span>

The CSS would be:

.php .keyword, /* class php with descendent keyword */

.php.keyword { /* both php AND keyword in same tag */

/* your style */

}

I used the typical code and span markup (to make sure the example got through your Friggin' Sharks With Friggin' Laser Beams form processing) but the CSS doesn't care about the tags used, only the classes applied in them.

Then when you want to post examples of crappy Ruby || Perl || JavaScript you can use different styles for them.

Perhaps you already do this but not on the comments, I'm too lazy to look.

Later,

John

Sun, 17 Jun 2007 at 23:36:41 GMT Link


7. Tim Wood's GravatarTim Wood said:

I want to first say thanks to a great code highlighting solution.

Also for those of us using it for php files I wanted to add a function for the processing to link function names to the manual on php.net

function function_link($test_string) {
    $linked_string = '';
    //$manual = 'http://www.php.net/function.';
    $manual = 'http://www.php.net/';
    $linked_string = preg_replace(
        // Match a highlighted keyword
        '~([\w_]+)(\s*</span>)'.
        // Followed by a bracket
        '(\s*<span\s+class="' . $this->previous . '">\s*\()~m',
        // Replace with a link to the manual
        '<a href="' . $manual . '$1" target="_blank">$1</a>$2$3', $test_string);
    return $linked_string;
  }

Didn't know if that would be of interest to anyone

Wed, 29 Aug 2007 at 11:18:43 GMT Link


8. karixma's Gravatarkarixma said:

how can display code without line numbers ?

Fri, 30 Nov 2007 at 11:18:26 GMT Link


9. dyron's Gravatardyron said:

Not using <ol> or CSS like

ol { list-style: none; }

Tue, 09 Jun 2009 at 10:20:44 GMT Link


Post A Comment

Personal Details and Comment

Style Guide

Line breaks are converted to paragraphs. Also use:

  • <a href="" title="">text</a>1
  • <em>text</em>
  • <blockquote><p>text</p></blockquote>
  • <code>2  <?php  if ($foo) {      $foo = TRUE;  }  ?></code>
  1. Note: <code> can be used inline (e.g. in paragraphs) or in a block as shown. Include whitespace and newlines in blocks.

Please enter Chris (my first name) below. This is a primitive spam prevention technique, and I apologize for the inconvenience.

Preview and Submit

Upcoming Talks

php|tek

19 - 22 May 2009

At Sheraton Gateway Suites Chicago O'Hare, Chicago, Illinois.

OSCON

20 - 24 Jul 2009

At San Jose McEnery Convention Center, San Jose, California.

New Comments

Ronald wrote:

A little hard for a rookie like me, but useful. I also thought you'd like to know there is a grea...

Posted in A rev="canonical" HTTP Header
Alex wrote:

Aren't you forgetting that the session will expire if _write() is never called? That excludes ...

Posted in
Andy Mabbett wrote:

@Chris Shiflett, #4, belatedly: Google only accepts rel=canonical within the same domain. My s...

Posted in A rev="canonical" HTTP Header
Kenneth Udut wrote:

I've implemented this rev="canonical" idea on http://free.naplesplus.us in the hopes that it catc...

Posted in Save the Internet with rev="canonical"
Mark wrote:

After reading your article and all the comments, what I got out of this was that sessions are not...

Posted in

Browse Comments