About the Author

Chris Shiflett

Hi, I’m Chris: web craftsman, community leader, husband, father, and partner at Fictive Kin.


PHP Advent Calendar Day 3

Today's entry is provided by Sebastian Bergmann.

Sebastian Bergmann

Name
Sebastian Bergmann
Blog
sebastian-bergmann.de
Biography
Sebastian Bergmann is a long-time contributor to various PHP projects, including PHP itself. He is the developer of PHPUnit and offers consulting, training, and coaching services to help enterprises improve the quality assurance process for their PHP-based software projects.
Location
Siegburg, Germany

Where do most bugs hide in a software project? A small script written in PHP can help us answer this question by mining a version control repository for the relevant information. This assumes, of course, that you are using version control software to manage your project, and that you are using consistent messages when you commit a bug fix, and only touch source code files relevant to the bug fix in that commit.

So, let us assume that we are using Subversion to manage our project's source code, and that we use messages such as "Fix #2204." when a bug fix is committed. We also assume that this script has filesystem access to the Subversion repository. We start with some configuration (repository location) and variable initialization:

<?php
 
// Configure the repository location.
$repository = '/var/svn/phpunit';
 
$paths      = array();
$repository = realpath($repository);
 
?>

The first step is to look for all commits made to the repository for which the commit message matches our bug fix format. The svn log command can help us here. It shows log messages from the repository and does so, optionally, in XML format. PHP's SimpleXML extension provides a very simple and easily usable toolset to parse XML.

In our script, we use shell_exec() to run the svn log --xml command on our repository. The generated XML is then loaded via simplexml_load_string() into an object that we can iterate.

<?php
 
$log = simplexml_load_string(
    shell_exec(sprintf('svn log --xml file://%s', $repository))
);
 
?>

For each revision that matches our search criteria, we use the svnlook changed command to get the paths that were changed in that particular revision.

<?php
 
foreach ($log->logentry as $logentry) {
    $attributes = $logentry->attributes();
    $revision   = (int)$attributes['revision'];
    $message    = (string)$logentry->msg;
 
    if (preg_match('/Fix #([0-9]*)/i', $message, $matches)) {
        $ticket = (int)$matches[1];
 
        $changedPaths = explode(
            "\n",
            shell_exec(
                sprintf(
                    'svnlook changed -r %d %s',
                     $revision,
                     $repository
                )
            )
        );
 
        unset($changedPaths[count($changedPaths) - 1]);
 
        foreach ($changedPaths as $changedPath) {
            $changedPath = substr($changedPath, 4);
 
            if (!isset($paths[$changedPath])) {
                $paths[$changedPath] = array(
                    array(
                        'revision' => $revision,
                        'ticket'   => $ticket
                    )
                );
            } else {
                $paths[$changedPath][] = array(
                    'revision' => $revision,
                    'ticket'   => $ticket
                );
            }
        }
    }
}
 
?>

For each source code file that is changed at least once as part of a bug fix, we maintain an array with the information of the respective revision and ticket number. In the end, we use uasort() to sort that array and print a list of the source code files that were involved in a bug in descending order respective to the number of bugs.

<?php
 
uasort($paths, 'cmp');
 
foreach ($paths as $path => $data) {
    printf("%4d: %s\n", count($data), $path);
}
 
function cmp($a, $b)
{
    $a = count($a);
    $b = count($b);
 
    if ($a == $b) {
        return 0;
    }
 
    return ($a > $b) ? -1 : 1;
}
 
?>

This entry shows you how easy it is to parse XML data with PHP in order to solve a problem that might look hard at first glance: mining a code repository for data to map past bugs to source code files. The resulting ranking of the most bug-prone source code files is a perfect base to decide which parts of your code base need more tests.

If this got you interested in quality assurance for PHP projects, you might be interested in the PHPUnit and phpUnderControl projects.

About this post

PHP Advent Calendar Day 3 was posted on Mon, 03 Dec 2007. If you liked it, follow me on Twitter or share:

9 comments

1.Sebastian Bergmann said:

I just discovered that svn log --verbose --xml includes the changed paths information in the XML logfile. This means that the call to svnlook is not neccessary and the script does not need local access to the repository.

Tue, 04 Dec 2007 at 06:49:48 GMT Link


2.Uzi said:

This script is really pointless because no one names their subversion commits with names like "Fix #2244"

Besides, PHP is not C. PHP coders don't normally use functions like printf() because we can avoid them.

Tue, 04 Dec 2007 at 10:26:56 GMT Link


3.Jamie L said:

Well, the "Fix #XXX" has been popularized recently by Project Management portals like Trac which create handy shortcut links, but yes I'm sure there are very few projects (other than a handful of respected Open Source Projects eg. PHPUnit) where the developers will do a single commit per bug fix and adhere to a standard naming convention for these commits.

Tue, 04 Dec 2007 at 10:41:07 GMT Link


4.Jamie L said:

But perhaps therein lies the tip for today :)

"Standard your bug fixing conventions, and thou shalt get statistics"

Tue, 04 Dec 2007 at 10:43:38 GMT Link


5.Sebastian Bergmann said:

Most of the companies I visited this year adhere to a standard such as the one mentioned in the posting for bugfix commit messages.

And as Jamie mentions, every project that uses Trac is likely to use the format used in the script to get the benefit of a Trac feature .

Tue, 04 Dec 2007 at 12:00:57 GMT Link


6.Lars Strojny said:

Trac comes with a two pretty cool scripts which helps to enforce the "single bugfix per commit" rule. trac-pre-commit-hook checks weither the commit message includes something like "fixes #123", "closes #123", "refs #123" and trac-post-commit-hook changes the related ticket accordingly (closes it when a fix is committed and references the commit, when it is just referenced). You can find the scripts here: http://trac.edgewall.org/browser/trunk/contrib

Tue, 04 Dec 2007 at 12:51:47 GMT Link


7.Sean Coates said:

We use the "fix #123" "fixes #234" "see #456" "re 567" notation extensively, internally. It makes trac much nicer to work with, and svn's event hooks are just awesome.

S

Tue, 04 Dec 2007 at 20:14:14 GMT Link


8.Olle Jonsson said:

Thanks Sebastian, for a dip into what SimpleXML holds. The "wet-finger-in-the-air" metric that this script gives is very neat. Is the code copy-pasteable in full anywhere?

Give something (follow a strict convention), get something (greppable, mineable datasets). Or as your Dad would've said "Quid pro quo".

And, +1: That Trac postcommit hook revolutionized the usage of atomic commits at my workplace, too.

Wed, 05 Dec 2007 at 08:54:25 GMT Link


9.Sebastian Bergmann said:

The current version of the full script can be found here.

Thu, 06 Dec 2007 at 08:23:48 GMT Link


Hello! What’s your name?

Want to comment? Please connect with Twitter to join the discussion.