Jens Roland's Gravatar Jens Roland's Profile

About Me:

Last 10 Comments

1

I spent my afternoon Googling for solutions for this, and most solutions are either too huge (HTML Tidy, HTML Purifier) or ridiculously lacking. I found one regex-powered snippet that did the job quite decently, and after some heavy nitpicking and rewriting, it works like a charm.

It is *not* foolproof, but it does some pretty advanced HTML parsing & indenting, as well as rudimentary (but far from complete) Javascript indenting.

<blockquote><p>Optional settings:

* $indent variable to define what type of tabbing to use (tabs/spaces/etc.)

* $no_indent array of tags that you don't want to indent

* no_linebreak array of tags that you don't want to linebreak, ie. 'inline' tags

</p></blockquote>

Also, it automatically handles self-closing XHTML tags, stand-alone HTML tags, trailing whitespace and a bunch of obscure special cases.

Anyway, here it is, take it for a spin if you like ;)

<?php
 
/**
 * Indents and removes blank lines in HTML code
 * Created by Jens Roland, 2009
 * (adapted from a snippet by JonHoo @ http://snippets.dzone.com/)
 * 
 * The code is not 100% foolproof, but close enough, and surprisingly fast
 * 
 * Known gotchas:
 * - Two closing Javascript brackets on the same line will only count as
 *   one, if there are any non-whitespace characters between them
 * - If a Javascript line containing an opening bracket happens
 *   to have a bracketed expression later in the same line, the next line
 *   will not be indented
 * - If a line contains two opening tags with non-whitespace content
 *   between them, and their corresponding closing tags are on separate
 *   lines below, or don't have non-whitespace content between them,
 *   the two opening tags will only increase the indent by one, but the
 *   two closing tags will decrease the indent by two. The inverse can
 *   also happen
 * 
 * A lot more could be done to make this even better, but I'd rather not
 * sacrifice more performance for a detail only source-snoopers will see.
 * 
 */
function clean_html_code($uncleanhtml)
{
    // Set wanted indentation
    $indent = "\t";
    // Set tags that should not indent
    $no_indent = array ('html', 'head', 'body', 'script');
    // Set tags that should not linebreak
    $no_linebreak = array ('a', 'b', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'i', 'span', 'strong', 'title');
    /* STRIP SUPERFLUOUS WHITESPACE */
    // Remove all indentation
    $uncleanhtml = preg_replace("/[\r\n]+[\s\t]+/", "\n", $uncleanhtml);
    // Remove all trailing space
    $uncleanhtml = preg_replace("/[\s\t]+[\r\n]+/", "\n", $uncleanhtml);
    // Remove all blank lines
    $uncleanhtml = preg_replace("/[\r\n]+/", "\n", $uncleanhtml);
    /* INSERT LINE SEPARATORS */
    // Separate 'whitespace-adjacent' tags with newlines, unless they are a pair
    $fixed_uncleanhtml = preg_replace("/>[\s\t]*</", ">\n<", $uncleanhtml);
    $fixed_uncleanhtml = preg_replace("/((<[a-zA-Z]>)|(<[^\/][^>]*[^\/>]>))\n(<\/)/U", "\${1}\${4}", $fixed_uncleanhtml);
    // Separate closing Javascript brackets with newlines
    $fixed_uncleanhtml = preg_replace("/\}[\s\t]*\}/", "}\n}", $fixed_uncleanhtml);
    /* FIX 'HANGING' TAGS */
    // Insert newlines before 'hanging' closing tags (ie. <p>\nSome text</p>\n)
    $fixed_uncleanhtml = preg_replace("/(\n[^<\n]*[^<\n\s\t])[\s\t]*(<\/[^>\n]+>[^\n]*\n)/U", "\${1}\n\${2}", $fixed_uncleanhtml);
    // Insert newlines after 'hanging' opening tags (ie. <p>Some text\n</p>)
    $fixed_uncleanhtml = preg_replace("/((<[a-zA-Z]>)|(<[^\/][^>]*[^\/]>))[\s\t]*([^\s\t(<\/)\n][^(<\/)\n]*\n)/", "\${1}\n\${4}", $fixed_uncleanhtml);
    /* HANDLE THE NO_LINEBREAK LIST */
    // Remove newlines after opening tags from our no_linebreak list (unless they are self-closing)
    $fixed_uncleanhtml = preg_replace("/(<(" . implode('|', $no_linebreak) . ")((\s*>)|(\s[^>]*[^\/]>)))\n/U", "\${1}", $fixed_uncleanhtml);
    // Remove newlines before closing tags from our no_linebreak list
    $fixed_uncleanhtml = preg_replace("/\n(<\/(" . implode('|', $no_linebreak) . ")[\s\t]*>)/U", "\${1}", $fixed_uncleanhtml);
    /* OK, READY TO INDENT */
    $uncleanhtml_array = explode("\n", $fixed_uncleanhtml);
    // Sets no indentation
    $indentlevel = 0;
    foreach ($uncleanhtml_array as $uncleanhtml_key=>$currentuncleanhtml)
    {
        $replaceindent = "";
        // Sets the indentation from current indentlevel
        for ($o = 0; $o < $indentlevel; $o++)
        {
            $replaceindent .= $indent;
        }
        // If self-closing tag, simply apply indent
        if (preg_match("/<(.+)\/>/", $currentuncleanhtml))
        {
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
        }
        // If doctype declaration, simply apply indent
        else if (preg_match("/<!(.*)>/", $currentuncleanhtml))
        {
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
        }
        // If opening AND closing tag on same line, simply apply indent
        else if (preg_match("/<[^\/](.*)>/", $currentuncleanhtml) && preg_match("/<\/(.*)>/", $currentuncleanhtml))
        {
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
        }
        // If closing HTML tag AND not a tag from the no_indent list, or a closing JavaScript bracket (with no opening bracket on the same line), decrease indentation and then apply the new level
        else if ((preg_match("/<\/(.*)>/", $currentuncleanhtml) && !preg_match("/<\/(".implode('|', $no_indent).")((>)|(\s.*>))/", $currentuncleanhtml)) || preg_match("/^\}{1}[^\{]*$/", $currentuncleanhtml))
        {
            $indentlevel--;
            $replaceindent = "";
            for ($o = 0; $o < $indentlevel; $o++)
            {
                $replaceindent .= $indent;
            }
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
        }
        // If opening HTML tag AND not a stand-alone tag AND not a tag from the no_indent list, or opening JavaScript bracket (with no closing bracket first), increase indentation and then apply new level
        else if ((preg_match("/<[^\/](.*)>/", $currentuncleanhtml) && !preg_match("/<(link|meta|base|br|img|hr)(.*)>/", $currentuncleanhtml) && !preg_match("/<(" . implode('|', $no_indent) . ")((>)|(\s.*>))/", $currentuncleanhtml)) || preg_match("/^[^\{\}]*\{[^\}]*$/", $currentuncleanhtml))
        {
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
            $indentlevel++;
            $replaceindent = "";
            for ($o = 0; $o < $indentlevel; $o++)
            {
                $replaceindent .= $indent;
            }
        }
        // If both a closing and an opening JavaScript bracket (like in a condensed else clause), decrease indentation on this line only
        else if (preg_match("/^[^\{\}]*\}[^\{\}]*\{[^\{\}]*$/", $currentuncleanhtml))
        {
            $indentlevel--;
            $replaceindent = "";
            for ($o = 0; $o < $indentlevel; $o++)
            {
                $replaceindent .= $indent;
            }
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
            // Reset indent to previous level
            $indentlevel++;
            $replaceindent .= $indent;
        }
        else
        // Else, only apply indentation
        {
            $cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;
        }
    }
    // Return single string separated by newline
    return implode("\n", $cleanhtml_array);
}
 
?>

Posted in /blog/2005/oct/php-stripping-newlines.

Tue, 13 Jan 2009 at 14:41:14: Link


Stats

  • Member Since: 13 Jan 2009
  • Comments: 1

Web Site

Jens.Roland.myopenid.com

Blog Posts


Work and Books

Analog Essential PHP Security HTTP Developer's Handbook