I spent my afternoon Googling for solutions for this, and most solutions are either too huge (HTML Tidy, HTML Purifier) or ridiculously lacking. I found one regex-powered snippet that did the job quite decently, and after some heavy nitpicking and rewriting, it works like a charm.
It is *not* foolproof, but it does some pretty advanced HTML parsing & indenting, as well as rudimentary (but far from complete) Javascript indenting.
<blockquote><p>Optional settings:
* $indent variable to define what type of tabbing to use (tabs/spaces/etc.)
* $no_indent array of tags that you don't want to indent
* no_linebreak array of tags that you don't want to linebreak, ie. 'inline' tags
</p></blockquote>
Also, it automatically handles self-closing XHTML tags, stand-alone HTML tags, trailing whitespace and a bunch of obscure special cases.
Anyway, here it is, take it for a spin if you like ;)
<?php/** * Indents and removes blank lines in HTML code * Created by Jens Roland, 2009 * (adapted from a snippet by JonHoo @ http://snippets.dzone.com/) * * The code is not 100% foolproof, but close enough, and surprisingly fast * * Known gotchas: * - Two closing Javascript brackets on the same line will only count as * one, if there are any non-whitespace characters between them * - If a Javascript line containing an opening bracket happens * to have a bracketed expression later in the same line, the next line * will not be indented * - If a line contains two opening tags with non-whitespace content * between them, and their corresponding closing tags are on separate * lines below, or don't have non-whitespace content between them, * the two opening tags will only increase the indent by one, but the * two closing tags will decrease the indent by two. The inverse can * also happen * * A lot more could be done to make this even better, but I'd rather not * sacrifice more performance for a detail only source-snoopers will see. * */function clean_html_code($uncleanhtml){// Set wanted indentation$indent = "\t";// Set tags that should not indent$no_indent = array ('html', 'head', 'body', 'script');// Set tags that should not linebreak$no_linebreak = array ('a', 'b', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'i', 'span', 'strong', 'title');/* STRIP SUPERFLUOUS WHITESPACE */ // Remove all indentation$uncleanhtml = preg_replace("/[\r\n]+[\s\t]+/", "\n", $uncleanhtml);// Remove all trailing space$uncleanhtml = preg_replace("/[\s\t]+[\r\n]+/", "\n", $uncleanhtml);// Remove all blank lines$uncleanhtml = preg_replace("/[\r\n]+/", "\n", $uncleanhtml);/* INSERT LINE SEPARATORS */ // Separate 'whitespace-adjacent' tags with newlines, unless they are a pair$fixed_uncleanhtml = preg_replace("/>[\s\t]*</", ">\n<", $uncleanhtml);$fixed_uncleanhtml = preg_replace("/((<[a-zA-Z]>)|(<[^\/][^>]*[^\/>]>))\n(<\/)/U", "\${1}\${4}", $fixed_uncleanhtml);// Separate closing Javascript brackets with newlines$fixed_uncleanhtml = preg_replace("/\}[\s\t]*\}/", "}\n}", $fixed_uncleanhtml);/* FIX 'HANGING' TAGS */ // Insert newlines before 'hanging' closing tags (ie. <p>\nSome text</p>\n)$fixed_uncleanhtml = preg_replace("/(\n[^<\n]*[^<\n\s\t])[\s\t]*(<\/[^>\n]+>[^\n]*\n)/U", "\${1}\n\${2}", $fixed_uncleanhtml);// Insert newlines after 'hanging' opening tags (ie. <p>Some text\n</p>)$fixed_uncleanhtml = preg_replace("/((<[a-zA-Z]>)|(<[^\/][^>]*[^\/]>))[\s\t]*([^\s\t(<\/)\n][^(<\/)\n]*\n)/", "\${1}\n\${4}", $fixed_uncleanhtml);/* HANDLE THE NO_LINEBREAK LIST */ // Remove newlines after opening tags from our no_linebreak list (unless they are self-closing)$fixed_uncleanhtml = preg_replace("/(<(" . implode('|', $no_linebreak) . ")((\s*>)|(\s[^>]*[^\/]>)))\n/U", "\${1}", $fixed_uncleanhtml);// Remove newlines before closing tags from our no_linebreak list$fixed_uncleanhtml = preg_replace("/\n(<\/(" . implode('|', $no_linebreak) . ")[\s\t]*>)/U", "\${1}", $fixed_uncleanhtml);/* OK, READY TO INDENT */$uncleanhtml_array = explode("\n", $fixed_uncleanhtml);// Sets no indentation$indentlevel = 0; foreach ($uncleanhtml_array as $uncleanhtml_key=>$currentuncleanhtml) {$replaceindent = "";// Sets the indentation from current indentlevelfor ($o = 0; $o < $indentlevel; $o++) {$replaceindent .= $indent; }// If self-closing tag, simply apply indentif (preg_match("/<(.+)\/>/", $currentuncleanhtml)) {$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml; }// If doctype declaration, simply apply indentelse if (preg_match("/<!(.*)>/", $currentuncleanhtml)) {$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml; }// If opening AND closing tag on same line, simply apply indentelse if (preg_match("/<[^\/](.*)>/", $currentuncleanhtml) && preg_match("/<\/(.*)>/", $currentuncleanhtml)) {$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml; }// If closing HTML tag AND not a tag from the no_indent list, or a closing JavaScript bracket (with no opening bracket on the same line), decrease indentation and then apply the new levelelse if ((preg_match("/<\/(.*)>/", $currentuncleanhtml) && !preg_match("/<\/(".implode('|', $no_indent).")((>)|(\s.*>))/", $currentuncleanhtml)) || preg_match("/^\}{1}[^\{]*$/", $currentuncleanhtml)) {$indentlevel--;$replaceindent = ""; for ($o = 0; $o < $indentlevel; $o++) {$replaceindent .= $indent; }$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml; }// If opening HTML tag AND not a stand-alone tag AND not a tag from the no_indent list, or opening JavaScript bracket (with no closing bracket first), increase indentation and then apply new levelelse if ((preg_match("/<[^\/](.*)>/", $currentuncleanhtml) && !preg_match("/<(link|meta|base|br|img|hr)(.*)>/", $currentuncleanhtml) && !preg_match("/<(" . implode('|', $no_indent) . ")((>)|(\s.*>))/", $currentuncleanhtml)) || preg_match("/^[^\{\}]*\{[^\}]*$/", $currentuncleanhtml)) {$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;$indentlevel++;$replaceindent = ""; for ($o = 0; $o < $indentlevel; $o++) {$replaceindent .= $indent; } }// If both a closing and an opening JavaScript bracket (like in a condensed else clause), decrease indentation on this line onlyelse if (preg_match("/^[^\{\}]*\}[^\{\}]*\{[^\{\}]*$/", $currentuncleanhtml)) {$indentlevel--;$replaceindent = ""; for ($o = 0; $o < $indentlevel; $o++) {$replaceindent .= $indent; }$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml;// Reset indent to previous level$indentlevel++;$replaceindent .= $indent; } else// Else, only apply indentation{$cleanhtml_array[$uncleanhtml_key] = $replaceindent.$currentuncleanhtml; } }// Return single string separated by newlinereturn implode("\n", $cleanhtml_array);}?>
Last 10 Comments
1