Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Diggit: Automated Code Review Via Software Repository Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Diggit: Automated Code Review via Software

Repository Mining
Robert Chatley Lawrence Jones
Imperial College London GoCardless Ltd
180 Queen’s Gate 338-346 Goswell Road
London, UK London, UK
rbc@imperial.ac.uk lawrence@gocardless.com

Abstract—We present Diggit, a tool to automatically generate When developing their static analysis tool Infer[2], Facebook
code review comments, offering design guidance on prospective examined how quickly results needed to be returned in order
changes, based on insights gained from mining historical changes for their developers to pay attention to them. They determined
in source code repositories. We describe how the tool was built
and tuned for use in practice as we integrated Diggit into the that a window of 10 minutes was the maximum that they could
working processes of an industrial development team. We focus allow for static analysis to run before developers would give up
on the developer experience, the constraints that had to be met waiting and move on. Based on this, we have also taken this 10
in adapting academic research to produce a tool that was useful minute timebox as a benchmark for our tools, which informed
to developers, and the effectiveness of the results in practice. our decisions when selecting and tuning analysis algorithms.
Index Terms—software maintenance; data mining;
Another observation from Facebook’s work on Infer was that
developers had a very low tolerance for false positives. As soon
I. I NTRODUCTION as their tool began warning about potential bugs that turned
Peer code review is a well established practice amongst out not to be real problems, developers began to ignore the
development teams aiming to produce high quality software, tool entirely. We therefore paid great attention to false positive
in both open source and commercial environments[1]. The rates and the relevance of generated comments.
way that code review is most commonly carried out today This paper presents Diggit, a tool for running automated
is that the reviewer is presented with just a snapshot of the analysis to produce code review comments. Diggit will com-
proposed change, without the historical context of how this ment when a) past modifications suggest files are missing
code has changed over time. Mostly commonly they see a diff from proposed changes, b) trends suggest that code within
giving a before/after comparison with the preceding version. the current change would benefit from refactoring, c) edited
We show that tooling can support and improve code review by files display growing complexity over successive revisions.
automatically extracting relevant information about historical We show that repository analysis using a well-chosen algo-
changes and trends from the version control system, and rithm can provide automated, high quality feedback in a timely
presenting these as part of the review. fashion, and allows us to present results in a context where
Institutional memory is codified in version control systems. immediate action can be taken. We evaluate the effectiveness
By building analysis tools on top of this that integrate into of these methods when used by an industrial team.
a developer’s workflow, we are able to automatically provide Our priority when building Diggit was to produce a tool that
guidance for a more junior developer, or someone new to a would be useful in practice. To encourage adoption we needed
codebase. This paper reports on the experiences of a commer- to reduce the barrier to entry, making it easy for developers to
cial development team using our tool, but we also see potential integrate our tool into their existing processes. This resulted
for use in open source projects, where there may commonly in us paying a lot of attention to aspects like authentication,
be a large number of contributors proposing individual patches creating a smooth setup experience for new users, integration
and changes, compared to a relatively small community of core with existing tools, and the general user experience – things
maintainers who may need to review these changes. that research projects might often place less emphasis on. The
We aimed to integrate our analysis tools as seamlessly as Diggit tool is now available as open source software2 .
possible into developers’ regular workflow. We have therefore
developed tools that integrate with the GitHub pull request1 II. R ELATED C HANGE S UGGESTION
and review flow. When a pull request is made, our analysis Working on large codebases can be disorientating, even for
engine runs, and our code review bot comments on the pull seasoned developers [3], [4]. Making a change is often not
request. In order to provide timely feedback, we need to ensure just a case of adding new code, but also integrating with and
that we can perform our analysis in a relatively short time-box. adapting existing parts of the system, test suites, configuration,
1 https://help.github.com/articles/about-pull-requests/ 2 https://github.com/lawrencejones/diggit
documentation etc. New developers especially may struggle if
they have yet to become familiar with the conventions, patterns
and idioms used in a particular project. Even for developers
relatively familiar with a codebase, it is easy to miss things.
For example, perhaps in a particular codebase when a change
is made in module X, it is normally required that a system test
in module Y is updated too, but our developer has neglected
this, either by accident or because they did not know about
this relationship. Or it may be the case that the developer has
correctly changed the relevant file, but simply forgotten to add
it to the current commit. Diggit’s related change suggestion
analysis aims to suggest files that may have been omitted
Fig. 2. Highlighting code quality trends over successive changes.
from the current commit. It mines common file groupings
based on temporal coupling[5] from a codebase’s Git history.
Such suggestions act as both a safeguard and learning tool for There is not room in this paper to present a full analysis
developers as they work within a project, raising awareness of of our performance experiments, or how parameters such as
file coupling and the nature of idiomatic changes. mimimun support were tuned, but full details can be found
in the accompanying technical report [11]. As a summary,
FP-Growth produced results orders of magnitude faster than
Apriori in our experiments, whilst still giving useful results.
Diggit therefore uses FP-Growth for its analysis.
III. C ODE Q UALITY T RENDS
Agile methods are prevalent in industrial software devel-
opment, and as such it is expected that a codebase will be
changed frequently over time. In such an environment it is
common for a codebase to deteriorate with age [12], especially
if there is no sustained effort to improve code quality, through
continuous refactoring [13].
The continuous application of small improvements helps
Fig. 1. Suggesting files commonly changed together.
developers to maintain hygiene standards in a codebase, and
to prevent the accumulation of technical debt that may make
We mine association rules from a given codebase’s version future changes difficult and possibly economically unviable.
control history. Similar analysis has previously been performed Developers rarely introduce significant design problems all at
in work such as ROSE [6] and TARMAQ [7]. Here we do not once – or at least if they do, we hope that the process of
aim to present a novel algorithm, but to describe the forces at code review should catch them before the change is integrated.
play when a tool based on research was developed to integrate More difficult to detect is when there is a gradual trend of
into the working processes of an industrial team. things getting worse over time. Agile teams often favour a
For our implementation we investigated a number of possi- culture of collaborative code ownership [14], so it may well be
ble algorithms for mining association rules and generating sug- that every developer on a team changes many different areas
gestions. We started with the Apriori algorithm, together with of a codebase during their work on a system, but may not
the Apriori-TID optimisation, as described by Agrawal [8]. have a long term engagement with any particular area of the
We compared this with an implementation of the FP-Growth code. They may make a minor change to an existing class or
algorithm described by Han et al [9], with optimisations as method that adds some new functionality, a small addition to
discussed by Borgelt [10] to improve the memory usage. an existing foundation. If every change makes the code just a
We had two goals in selecting and tuning an analysis tiny bit worse, then we may have the problem of the proverbial
algorithm. We wanted to provide feedback in a timely manner, “boiled frog”, where we do not notice until it is too late.
but also to maintain a low false-positive rate. Our tool will be To help with this, we added the detection of code quality
no use if suggestions are simply ignored by developers due to trends to Diggit. When a change is made, Diggit analyses
latency or inaccuracy. Requiring high confidence might mean trends in this particular area of code over previous revisions,
that for many projects, especially those without a long revision and generates comments suggesting that the developer might
history, we simply do not have enough data in the repository want to consider code quality at this point. The aim is to
to produce many suggestions. Allowing a lower confidence provide feedback “just in time” to encourage developers to
may well mean that we produce erroneous suggestions. For refactor before the work on this particular area of code is
a compelling developer experience it was important that we completed. This is in contrast to an offline review tool that
come up with appropriate confidence values. may be able to highlight areas of a codebase that might benefit
Fig. 3. Diggit correctly highlighting a forgotten change. Fig. 4. Manual reviews addressing growing complexity.

from refactoring, should the team ever get around to it as a


separate maintenance activity.
The first quality trend that Diggit analyses is based upon
Feathers’ observations on refactoring diligence [15]. Feathers
analyses repository data to give a summary profile, for exam-
ple revealing that there are 135 methods that increased in size
the last time they were changed, 89 methods that increased
Fig. 5. Not all of Diggit’s suggestions are helpful.
in size the last two times they were changed, and so on.
High numbers of methods that are consistently expanded are
indicative of a lack of diligence in refactoring. has a development process that involves manual code review
In generating code review comments, we do not generate managed through GitHub pull requests. The team used Diggit
a profile for the whole codebase (as Feathers does), but trace on one of their core services, for historical reasons known
back through the history of the code in the current change, simply as gocardless. This is the main repository for the
and look for consecutive increases in method length. GoCardless API, with around 150k lines of Ruby code with
We created a similar analysis module to highlight trends in contributions from over 50 developers over the lifetime of the
computational complexity, using a method based on measuring project. Development typically proceeds at a rate of around ten
whitespace and indentation [16] to give an approximation of pull requests a day. Diggit was set to analyse pull requests on
code complexity whilst preserving some language indepen- the gocardless repository, at first in a hidden mode, so that
dence. It is not always the case that complex code needs to be we could see what it would do and tune the parameters, and
simplified – some modules implement complex algorithms, but then in a live mode, where it commented on the developers’
are closed and need no further change. The more problematic pull requests as part of their normal review process.
case is when we have code of high complexity that changes We asked the developers to provide feedback on comments
often. Therefore, detecting increases in complexity at the point that have been useful to them during review, as well as those
of change (and review) allows us to highlight a combination that were not helpful so that we could further improve the
of high (or increasing) complexity and frequent change, which system. One example where the automation worked very well
may indicate a hotspot in the codebase where refactoring is a change made by a developer from the GoCardless support
would be likely to be beneficial. The tool can generate com- team, who wanted to modify a schema description in response
ments about trends either over the last n revisions, whenever to a comment from a customer. The developer who made this
those changes occurred, or trends over a period of time. In our change, Lewis, makes infrequent changes to gocardless
trials, a threshold of n = 3 was used to trigger a warning. and consequently forgot to rebuild the schema files after
IV. U SE I N P RACTICE changing the schemata3 .
The pull request in Figure 3 was created by user lewisblack-
To explore whether Diggit is a useful tool in practice, we wood, and we see that the first comment to appear is from
studied its use with the development team at GoCardless. the Diggit bot, suggesting that schema-secret.json and
GoCardless is a company that runs a platform facilitating schema-private.json were likely missing from this
electronic payments. They have a development team com-
prising approximately 20 developers. The company already 3 https://github.com/interagent/prmd
Fig. 6. Comment occurance on pull request 8322 of gocardless (some filenames anonymised).

change. Examining the analyses in Diggit’s database revealed requests upgrading each dependency of gocardless, with
that Lewis subsequently rebuilt the schema, pushed a new Diggit commenting (Figure 5) on pull requests that did not
commit to the pull request. As this problem was now resolved, change the Gemfile, only the Gemfile.lock.
when Diggit ran over later commits to the same pull request As greysteil comments in Figure 5, the association between
(before it was merged), the analysis produced no comments. Gemfile and Gemfile.lock is significant when a change
This is exactly the pattern that we would expect if a developer to Gemfile leaves Gemfile.lock untouched. Unfortu-
updates their change to resolve a problem that Diggit reports. nately the reverse is not useful. In Ruby projects the Gemfile
In Figure 3, we can see a correlation between the comment lists the required libraries, but the the Gemfile.lock
by jacobpgn and the comment that Diggit generated. The cor- records the particular versions of these libraries, so adding a
rectness of the analysis is further highlighted by the comment new library requires a change to both, but upgrading a version
from greysteil, a member of the technical leadership team, only requires a change to the lock file. Increasing Diggit’s
noting Diggit’s accuracy in this case. confidence threshold could prevent these warnings, but would
More evidence of Diggit’s accuracy was pull request 8322 reduce the overall recall. Also, tuning the parameters to
for gocardless, which was a large refactoring across 24 vary confidence requires specialist knowledge of the mining
files. The initial push modified 14 files, triggering several algorithm. Most GoCardless developers working with the tool
file suggestion and complexity warnings from Diggit. The preferred to treat it as a black box, and instead to specify
developer continued to refine this change over 6 revisions individual exceptions to the analysis rules using an ignore file.
before it was approved and merged into the master branch. This allowed them to filter out false positives.
The table in Figure 6 shows how Diggit commented on each GoCardless also highlighted a few spurious results from the
of those revisions, and how by the end of the process all the complexity analysis reporter. In some cases minor alterations
comments Diggit made were resolved. It is interesting to note to files were causing warning comments about increasing
that in the second push, two warnings from the first revision complexity, despite the change being isomorphic. Rubocop4 ,
were resolved, but four new ones were triggered. These are the prevalent Ruby linter, suggests that method parameters
file suggestions for aml/checkers which were triggered be aligned on following lines when a single line method call
when checker_a.rb was added to the diff, and then sub- would exceed the set line-length. This often leads to method
sequently fixed. The comment about the increased complexity calls where the code is formatted such that the parameters
of attach_check_to_last_verification.rb was are broken onto the next line. Whitespace integration often
matched by a manual review comment (Figure 4) that sug- detected this additional indentation, resulting in a large (but
gested moving the change out into a new action, hence splitting misleading) increase to reported complexity.
the code into a larger number of simpler components. Reducing our analysis sensitivity to these stylistic issues re-
quired the tool to have a great understanding of the code. This
V. U SER F EEDBACK demanded language-specific tooling (which we had initially
As well as gathering data from Diggit and the GoCard- been trying to avoid, in order to remain language agnostic),
less code repository to see what analysis was generated, and but once this decision was taken we could perform more
subsequent changes, we also asked the GoCardless developers detailed analysis. We changed Diggit’s analysis engine to use
to provide qualitative feedback and used this to refine the a plugin mechanism that allows language-specific analyses,
tool. One issue that GoCardless experienced with the file keyed against particular file extensions, and used an ABC
suggestions were false positives caused by links between two complexity measure [17] for Ruby files, making use of the
files where changing file A would require a developer to abstract syntax tree. This complexity metric is unaffected by
modify file B, but modifying B would not require modifications
to A. This issue was raised after an automated tool created pull 4 https://github.com/bbatsov/rubocop
formatting changes, and so although we lost a little generality, Given these results we believe that Diggit demonstrates the
we reduced false positives. potential for analysis tools to support code review. Although,
Diggit has a 21% comment ratio at GoCardless: it produces we note the particular attention that needs to be paid to
a comment on approximately 1 in every 5 pull requests practical issues to integrate tools into an existing development
processed. Anecdotal feedback from the team showed that this process, and have engineers engage with them. Diggit provides
amount of feedback felt about right to them. They were aware an example of successfully harnessing techniques developed
of the tool doing something useful, without it overwhelming in research, and applying them to historical data already
their existing review process. accumulated by the vast majority of industrial teams, and using
this to help developers improve code quality by providing
VI. E VALUATION AND C ONCLUSION timely automated feedback on proposed changes.
We evaluated the effectiveness of Diggit’s comments by R EFERENCES
comparing Diggit’s suggested actions for a given pull request
[1] S. McIntosh, Y. Kamei, B. Adams, and A. E. Hassan, “The impact
to those observed after manual code reviews. We explored this of code review coverage and code review participation on software
correlation by running analysis on twenty software projects quality: A case study of the qt, vtk, and itk projects,” in Proceedings
(inside and outside GoCardless), looking at pull requests with of the 11th Working Conference on Mining Software Repositories, ser.
MSR 2014. New York, NY, USA: ACM, 2014, pp. 192–201. [Online].
manual reviews. We consider a Diggit comment to be effective Available: http://doi.acm.org/10.1145/2597073.2597076
if it is generated for one revision within a pull request, but not [2] C. Calcagno and D. Distefano, “Infer: An automatic program verifier
in subsequent revisions (the problem has been fixed). If we see for memory safety of c programs,” in Proceedings of the Third
International Conference on NASA Formal Methods, ser. NFM’11.
this fix pattern for a pull request when the developers only had Berlin, Heidelberg: Springer-Verlag, 2011, pp. 459–465. [Online].
access to the manual review comments (with Diggit running in Available: http://dl.acm.org/citation.cfm?id=1986308.1986345
hidden mode), we infer that Diggit is automatically generating [3] S. Elliott Sim and R. C. Holt, “The ramp-up problem in
software projects: A case study of how software immigrants
similar feedback to what a human would give. naturalize,” in Proceedings of the 20th International Conference
on Software Engineering, ser. ICSE ’98. Washington, DC, USA:
IEEE Computer Society, 1998, pp. 361–370. [Online]. Available:
TABLE I http://dl.acm.org/citation.cfm?id=302163.302199
D IGGIT COMMENT RESOLVE RATES . [4] L. M. Berlin, “Beyond program understanding: A look at programming
expertise in industry,” Empirical Studies of Programming, vol. 93, no.
Reporter Resolve Rate Resolved Total 744, pp. 6–25, 1993.
[5] A. Tornhill, Your code as a crime scene : use forensic techniques to
Refactoring Diligence 38% 24 64 arrest defects, bottlenecks, and bad design in your programs. Frisco,
Complexity 44% 38 87 TX: Pragmatic Bookshelf, 2015.
Change Suggestion 59% 104 176 [6] T. Zimmermann, A. Zeller, P. Weissgerber, and S. Diehl, “Mining
version histories to guide software changes,” Software Engineering,
Overall 51% 166 327 IEEE Transactions on, vol. 31, no. 6, pp. 429–445, 2005.
[7] T. Rolfsnes, S. Di Alesio, R. Behjati, L. Moonen, and D. W. Binkley,
“Generalizing the analysis of evolutionary coupling for software change
Our results (Table I) support the conclusion that the same impact analysis,” in 2016 IEEE 23rd International Conference on Soft-
ware Analysis, Evolution, and Reengineering (SANER), vol. 1. IEEE,
issues Diggit highlights are being tackled during manual 2016, pp. 201–212.
review. On average, over 50% of the comments Diggit makes [8] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association
are resolved prior to merging of the pull request, suggest- rules,” in Proc. 20th int. conf. very large data bases, VLDB, vol. 1215,
1994, pp. 487–499.
ing a strong correlation between comments made by human [9] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate
reviewers and Diggit’s analysis. Change suggestion has the generation,” in ACM Sigmod Record, vol. 29, no. 2. ACM, 2000.
highest resolve rate, with 59% of suggestions fixed before the [10] C. Borgelt, “An implementation of the fp-growth algorithm,” in Pro-
ceedings of the 1st international workshop on open source data mining:
pull request is merged. Complexity and refactoring diligence frequent pattern mining implementations. ACM, 2005, pp. 1–5.
comments also show strong resolve rates, both seeing over a [11] L. Jones, “Diggit mining source code repositories for
third of comments being resolved before merge. developer insights,” Imperial College London, Tech. Rep., 2016.
[Online]. Available: http://www.imperial.ac.uk/computing/prospective-
In criticism of resolve rate, it only approximates developers students/distinguished-projects/ug-prizes/archive/
taking action on analysis suggestions. Comments made to files [12] D. L. Parnas, “Software aging,” in Proceedings of the 16th International
that were then later removed from the change would be seen Conference on Software Engineering, ser. ICSE ’94. Los Alamitos,
CA, USA: IEEE Computer Society Press, 1994, pp. 279–287. [Online].
as resolved, for example. Conversely, sometimes comments Available: http://dl.acm.org/citation.cfm?id=257734.257788
would suggest taking action that is subsequently addressed in [13] M. Fowler and K. Beck, Refactoring: Improving the Design of Existing
a separate change, which our statistics would miss. Overall Code, ser. Object Technology Series. Boston, MA, USA: Addison-
Wesley Longman Publishing Co., Inc., 1999.
these results indicate a general agreement between Diggit’s [14] K. Beck, Extreme Programming Explained: Embrace Change. Boston,
comments and the actions taken in code review, but show that MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2000.
there is still room for improvement. [15] M. Feathers, “Detecting refactoring diligence,” dec 2014.
[16] A. Hindle, M. W. Godfrey, and R. C. Holt, “Reading beside the lines:
Looking specifically at GoCardless projects, 65% of the Indentation as a proxy for complexity metric,” in 2008 16th IEEE
comments generated were taken as actionable by the develop- International Conference on Program Comprehension, June 2008, pp.
ers and resulted in a fix. This higher rate may be due to the 133–142.
[17] J. Fitzpatrick, “Applying the ABC metric to C, C++, and Java,” 1997.
ability to suppress false positives on a project-specific basis.

You might also like