SoftwareProcess.es Wiki - Abram Hindle: WhiteSpace

Ranking Changes by their Indentation is Analogous to Ranking Changes by their Complexity

Abstract: Maintainers often face the daunting task of wading through a collection of both new and old revisions, trying to ferret out those that warrant detailed inspection. Perhaps the most obvious way to rank revisions is by lines of code (LOC); this technique has the advantage of being both simple and fast. However, most revisions are quite small, and so we would like a way of distinguishing between simple and complex changes of equal size. Classical complexity metrics,such as Halstead’s and McCabe’s?, could be used but they are hard to apply to code fragments of different programming languages. We propose a language-independent approach to ranking revisions based on the indentation of their code fragments. We use the statistical moments of indentation as a lightweight and revision/diff friendly metric to proxy classical complexity metrics. We found that ranking revisions by the variance and summation of indentation was very similar to ranking revisions by traditional complexity measures since these measures correlate with both Halstead and McCabe? complexity; this was evaluated against the CVS histories of 278 active and popular SourceForge? projects. Thus, we conclude that measuring indentation alone can serve as a cheap and accurate proxy for computing the code complexity of revisions.

Papers:

A Draft of our journal paper on measuring indentation and correlating it with complexity: http://churchturing.org/x/journal.20090204.041229.Feb.04.2009.pdf
"From Indentation Shapes to Code Structures", Abram J. Hindle, Michael W. Godfrey, and Richard C. Holt. Proc. of the 8th IEEE Intl. Working Conference on Source Code Analysis and Manipulation (SCAM 2008), 28–29 September 2008, Beijing, China. [Acceptance rate: 23/61 or 38% for full papers]
"Reading Beside the Lines: Indentation as a Proxy for Complexity Metrics", by Abram J. Hindle, Michael W. Godfrey, and Richard C. Holt. Proc. of the 2008 IEEE Intl. Conference on Program Comprehension (ICPC-08), June 2008, Amsterdam. [Acceptance rate: 20/57 or 35% for full papers]

Software:

This software can read source files, diffs and RCS files. It can measure indentation and calculate the McCabe? Cyclometic Complexity (MCC) and Halstead's Complexity metrics for C, C++, Java, Perl, PHP, and Python
View the source code at http://churchturing.org/x/whitespace-dist/
Download the program at http://churchturing.org/x/whitespace-dist.20090412.tar.gz

Data:

This data is unclean, you're going to have to derive what is represented from the source code and the example header files. Also if you use it, please cite my journal paper

@article{hindle09sciprog,
title = "Reading beside the lines: Using indentation to rank revisions by complexity",
journal = "Science of Computer Programming",
volume = "74",
number = "7",
pages = "414 - 429",
year = "2009",
note = "Special Issue on Program Comprehension (ICPC 2008)",
issn = "0167-6423",
doi = "DOI: 10.1016/j.scico.2009.02.005",
url = "http://www.sciencedirect.com/science/article/B6V17-4VT14CM-1/2/e0e0ddda7661dc0b291216e2025cc9e4",
author = "Abram Hindle and Michael W. Godfrey and Richard C. Holt",
keywords = "Indentation",
keywords = "Complexity",
keywords = "McCabe?",
keywords = "Halstead",
keywords = "Metrics",
abstract = "
Maintainers often face the daunting task of wading through a collection of both new and old revisions, trying to ferret out those that warrant detailed inspection. Perhaps the most obvious way to rank revisions is by lines of code (LOC); this technique has the advantage of being both simple and fast. However, most revisions are quite small, and so we would like a way of distinguishing between simple and complex changes of equal size. Classical complexity metrics, such as Halstead's and McCabe?'s, could be used but they are hard to apply to code fragments of different programming languages. We propose a language-independent approach to ranking revisions based on the indentation of their code fragments. We use the statistical moments of indentation as a lightweight and revision/diff friendly metric to proxy classical complexity metrics. We found that ranking revisions by the variance and summation of indentation was very similar to ranking revisions by traditional complexity measures since these measures correlate with both Halstead and McCabe? complexity; this was evaluated against the CVS histories of 278 active and popular SourceForge? projects. Thus, we conclude that measuring indentation alone can serve as a cheap and accurate proxy for computing the code complexity of revisions."
}

More summary data files of the indentation results http://swag.uwaterloo.ca/~ahindle/whitespace-stuff.tar.gz
The data of indentation per column of the source code extracted http://swag.uwaterloo.ca/~ahindle/whitespace-stuff-outs.tar.gz . Here's a possible header for those CSV files: http://softwareprocess.es/y/metricsmapping
Mirror of repositories of many projects: Please repair the URL before downloading: http : / / swag dot uwaterloo dot ca/ahindle/repomirror.sfmirror.20071213.tar.gz (15 gb)
Mirror of the raw data http : / / softwareprocess dot es/z/whitespace-data-raw.tar.gz (10gb)
- Sha1sum e42eee777b11ad0beda043183a2a3ecf0095e1d3