We grouped the 244 primary studies in five themes. In the remainder of this section, we summarize their main contributions.
4.2.1 Support Systems for Code Reviews.
This theme includes primary studies that contribute to solutions to support the MCR process, such as review prioritization, review automation, and reviewer recommendation.
Reviewer recommendations. A majority of papers on this theme focus on proposing tools to recommend reviewers and validate their approaches using historical data extracted from open source projects.
Most approaches recommend code reviewers based on the similarity between files modified or reviewed by each developer and the files of a new pull request (path similarity) [
87,
106,
122,
144,
145,
158,
187,
198,
204,
222,
230,
238,
242,
268,
271,
273,
275,
279]. Some studies include other predictors such as previous interactions between submitter and potential reviewers [
144,
145,
187,
273,
275,
285], pull request content similarity [
145,
211,
268,
275], contribution to similar files [
122,
158,
231,
274], review linkage graphs [
132], and developer activeness in a project [
144,
145,
158,
211,
271]. Another popular predictor to recommend code reviewers is the similarity between the content of previous and new pull requests [
145,
152,
166,
268,
269,
270,
275,
276].
In one study, the authors used participants’ preferences in review assignment [
263], while in the other study, the authors combined the metadata of pull requests with the metadata associated with potential reviewers [
92]. Another study focuses on detecting and removing systematic labeling bias to improve prediction [
235]. Another interesting direction is to focus recommend reviewers that will ensure code base knowledge distribution [
86,
176,
207]. Finally, some studies have included balancing review workload as an objective [
43,
49,
86,
230].
In relation to how the predictors are used to recommend code reviewers, many employ traditional approaches (e.g., cosine similarity), while some use machine learning techniques, such as Random Forest [
92], Naive Bayes [
92,
235], Support Vector Machines [
144,
276], Collaborative Filtering [
87,
230], and Deep Neural Networks [
222,
274], or model reviewer recommendation as an optimization problem [
43,
86,
187,
207,
211].
The performance of the identified approaches varies a lot and is often measured using Accuracy [
92,
122,
144,
204,
230,
238,
270], Precision and Recall [
106,
145,
166,
187,
204,
222,
275,
276,
279,
285], or Mean Reciprocal Rank [
106,
144,
230,
230,
231,
235,
268,
279]. Of the identified studies, only a few [
49,
158,
198,
230] have evaluated code reviewer recommendation tools in live environments. Instead, the majority of the studies measures performance (accuracy, precision, recall, and mean reciprocal rank) by comparing the actual list of reviewers present in historical data with the list of developers recommended by their respective approaches.
One study focuses on identifying factors that should be accounted for when recommending reviewers [
223], such as the number of files and commits in a pull request, pull requester profile, previous interactions between contributors, previous experience with related code, and ownership of modified code are factors related to how code reviewers are selected.
Finally, only two studies evaluate whether reviewer recommendation really adds any value [
158,
230], with mixed results.
Understanding the code changes that need to be reviewed. Refactoring changes the code structure to improve testability, maintainability, and quality without changing its behavior. Supporting the review of such changes has been the focus of refactoring-aware tools. Refdistiller aims at detecting behavior-changing edits in refactorings [
46]. The tool uses two techniques: (a) a template-based checker that finds missing edits and (b) a refactoring separator that finds extra edits that may change a program’s behavior. In a survey of 35 developers, they found that it would be useful to differentiate between refactored and behavior-changing code, making reviews more efficient and correct. ReviewFactor is a tool able to detect both manual and automated refactorings [
111]. The evaluation of the tool showed that it can detect behavior-changing refactorings with high precision (92%) and recall (94%). RAID [
79] aims at reducing the reviewers’ cognitive effort by automatically detecting refactorings and visualizing information relevant for the refactoring to the reviewer. In a field experiment, professional developers reduced the number of lines they had to review for move and extraction refactorings. CRITICS is an interactive approach to review systematic code changes [
280]. It allows developers to find changes similar to a specified template, detecting potential mistakes. The evaluation indicates that (a) six engineers who used the tool would like to have it integrated in their review environment, and (b) the tool can improve reviewer productivity, compared to a regular diffing tool. A study at Microsoft proposes a solution to automatically cluster similar code changes [
58]. The clusters are then submitted for review. A preliminary user study suggests that the understanding of code did indeed improve when the changes were clustered. Similarly, SIL identifies similar code changes in a pull request and flags potential inconsistent or overlooked changes [
51]. In an inspection of 453 code changes in open source projects, it was found that up to 29% of the changes are composite, i.e., address different concerns [
234]. Decomposing large pull requests into cohesive change sets could therefore be valuable. A controlled experiment study found that change decomposition leads to fewer wrongly reported issues [
94]. ChgCutter [
118] provides an interactive environment that allows reviewers to decompose code changes into atomic modifications that can be reviewed and tested individually. Professional developers suggested in a interview study that the approach helps to understand complex changes. CoRA automatically decomposes commits in a pull request into clusters of related changes (e.g., bug fixes, refactorings) and generates concise descriptions that allows users to better understand the changes [
260]. Another idea to reduce review effort is to prioritize code that is likely to exhibit issues. One approach is to train a Convolutional Neural Network with old review comments and source code features to identify code fragments that require review [
229]. Similarly, CRUSO classifies to be reviewed code by identifying similar code snippets on StackOverflow and analyzing the corresponding comments and metadata, leveraging crowd-knowledge [
148,
217].
Other research looked into the order in which changed files should be presented to the reviewer to achieve an effective review process [
63]. The study proposes the following main principle for the ordering: related change parts should be grouped as closely as possible. Another contribution to improve the understanding of changed code suggests identifying the “salient” class, i.e., the class in which the main change was made and that affects changes in other dependent classes [
136]. The authors hypothesize that reviews could be more efficient if the salient class would be known, making the logic of the commit easier to understand. A preliminary evaluation (questionnaire-based) with 14 participants showed that the knowledge about the salient class improves the understanding of a commit. A follow-up study with a broader evaluation confirms these results [
137]. A similar idea is implemented in BLIMP tracer, which inspects the impact of changes on a file level rather than on a class level [
264]. The tool was evaluated with 45 developers, and it improved speed and accuracy of identifying the artifacts that are impacted by a code change. SEMCIA was developed to reduce noise in change impact analysis and uses semantic rather than syntactic relationships. This approach reduces false positives by up to 53% and reduces the change impact sets considerably [
121]. MultiViewer is a code change review assistance tool that calculates metrics to better understand the change effort, risk, and impact of a change request [
257]. A step further goes the approach implemented in the tool GETTY, which aims at providing meaningful change summaries by identifying change invariants through analyzing code differences and test run results [
173]. With GETTY, reviewers can determine if a set of code changes has produced the desired effect. The approach was evaluated with the participation of 18 practitioners. The main finding was that GETTY substantially modified the review process to a hypothesis-driven process that led to better review comments.
Another direction of research for improving code understanding for reviews uses visualization of information. For example, ViDI supports visual design inspection and code quality assessment [
245]. The tool uses static code analysis reports to identify critical areas in code, displays the evolution of the amount of issues found in a review session, and allows the reviewer to inspect the impact of code changes. Git Thermite focuses on structural changes made to source code [
214]. The tool analyzes and visualizes metadata gathered from GitHub, code metrics of the modified files, and static source code analysis of the changes in pull requests. DERT aims at complementing line-based code diff tools with a visual representation of structural changes, similarly to UML but in a dynamic manner, allowing the reviewer to see an overview as well as details of the code change [
57]. Similarly, STRIFFS visualizes code changes in an UML class diagram, providing the reviewer an overview [
103]. CHANGEVIZ allows developers to inspect method calls/declarations related to the reviewed code without switching context, helping to understand a change and its impact [
110]. OPERIAS focuses on the problem of understanding how particular changes in code relate to changes to test cases [
186]. The tool visualizes source code differences and a change’s coverage impact. Finally, a tool was developed to improve the review process of visual programming languages (such as Petri nets) [
202]. It supports the code review of visual programming languages, similarly to what is already possible with textual programming languages.
Meyers et al. [
175] developed a dataset and proposed a Natural Language–based approach to identify potential problems and elicit an active response from the colleague responsible for modifying the target code. They trained a classifier to identify acted-upon comments with good prediction performance (AUC = 0.85).
Monitoring review performance and quality. González-Barahona et al. [
116] have proposed to quantitatively study the MCR process, based on traces left in software repositories. Without having access to any code review tool, they analyzed changelog files, commit records, and attachments and flags in Bugzilla records to monitor the size of the review process, the involved effort, and process delay. A similar study focused on review messages wherein changes are first reviewed through communication in a mailing list [
141]. They developed a series of metrics to characterize code review activity, effort, and delays [
142], which are also provided through a dashboard that shows the evolution of the review process over time [
140]. Another study looked at code reviews managed with Gerrit and proposed metrics to measure velocity and quality of reviews [
161]. Similar metrics, such as code churn, modified source code files, and program complexity, were used to analyze reviewer effort and contribution in the Android open source project [
177]. Other tools to analyze Gerrit review data are ReDa, which provides reviewer, activity, and contributor statistics [
243], and Bicho, which models code reviews as information from an issue tracking system, allowing us to query review statistics with standard SQL queries [
115]. Finally, Codeflow Analytics aggregates and synthesizes code review metrics (over 200) [
72].
Determining the usefulness of code reviews. A study of three projects developed a taxonomy of review comments [
163,
164]. After training a classifier and categorizing 147K comments, they found that inexperienced contributors tend to produce code that passes tests while still containing issues, and external contributors break project conventions in their early contributions. In another study, Rahman et al. [
205] analyzed the usefulness of 1,116 review comments (a manual process that has also been attempted to be automatized [
192]) in a commercial system. They marked a comment as useful if it triggered a code change within its vicinity (up to 10 lines) and analyzed features of the review comment pertaining to its content and author. The results indicate that useful comments share more vocabulary with the changed code, contain relevant code elements, and are written by more experienced reviewers. Similarly, another study found that experienced reviewers are capable of pointing out broader issues than inexperienced ones [
131]. The study concluded that reviewer experience and patch characteristics such as commits with large and widespread modifications drive the number of comments and words in a comment [
131]. A study investigated the use of existing comments in code reviews [
225]. The study concluded that when the existing code review comment is about a type of bug, participants are more likely to find another occurrence of this type of bug. However, existing comments can also lead to availability bias [
225].
A study of 2,817 review comments found that only about 14% of comments are related to software design, of which 73% provided suggestions to address the concerns, indicating that they were useful [
278]. Another study investigated the characteristics of useful code reviews by interviewing seven developers [
78]. The study found that the most useful comments identify functional issues, scenarios where the reviewed code fails, and suggest API usage, design patterns, or coding conventions. “Useless” comments ask how an implementation works, praise code, or point to work needed in the future. Armed with this knowledge, the researchers trained a classifier that achieved 86% precision and 93% recall in identifying useful comments. Applying the classifier on 1.5M review comments, they found that (a) reviewer experience with the code base correlates with usefulness of comments, suggesting that reviewer selection is crucial, (b) the smaller the changeset, the more useful the comments, and (c) a comment usefulness density metric can be used to pinpoint areas where code reviews are ineffective (e.g., configuration and build files). Criticism of the above pure statistical, syntactic approaches arose as the actual meaning of comments is not analyzed [
100].
Managing code reviews. Before code hosting platforms, such as Github, became popular, researchers investigated how to provide support for reviews in IDEs. SeeCode integrates with Eclipse and provides a distributed review environment with review meetings and comments [
220]. Similarly, ReviewClipse supports a continuous post-commit review process [
70]. Scrub combines regular peer reviews with reports from static source code analyzers in a stand-alone application [
133]. Java Sniper is a web-based, collaborative code reviewing tool [
282]. All these early tools have been outlived by modern code hosting and reviewing infrastructure services such as GitHub, GitLab, BitBucket, Review Board, and Gerrit. However, while these platforms provide basic code reviewing functionalities, research has also looked at improving the reviewing process in different ways [
62]. For example, Dürschmid [
96] suggested continuous code reviews that allow anyone to comment code they are reading or reusing, e.g., from libraries. Developers can then push questions and comments to upstream authors from within their IDE without context switching. Fistbump is a collaborative review platform built on top of GitHub, providing an iteration-oriented review process that makes it easier to follow rationale and code changes during the review [
147].
Fairbanks [
104] has proposed the use of
Design by Contract (DBC) and highlighted how it can improve how software development teams do code reviews. When both the code author and code reviewer agree to a goal of writing code with clear contracts, they can look out for the DBC practices being followed (or not) in the code being reviewed. The author lists a few DBC examples that can be used by software development teams.
Balachandran and Vipin [
56] proposed changes in the developer code review workflow to leverage online clone detection to identify duplicate code during code review. They evaluated their approach through a developer survey and learned that the proposed workflow change will increase the usage of clone detection tools and can reduce accidental clones.
Hasan et al. [
123] developed and evaluated an approach to measure the effectiveness of developers when doing code reviews. They defined a set of metrics and developed a model to measure code review usefulness. Their approach improved the state of the art by ~25%. They conducted a survey with participants from Samsung and learned that the respondents found their approach useful.
Optimizing the order of reviews. Code reviewers often need to prioritize which changes they should focus on reviewing first. Many studies propose to base the review decision on the likelihood that a particular change will eventually be accepted/merged [
53,
105,
213]. Fan et al. [
105] proposed an approach based on Random Forest. They evaluated their approach using data from three open source projects and learned that their approach is better than a random guess. In addition to the acceptance probability, Azeem et al. [
53] also considered the probability that a code integrator will review/respond to a code review request. They rank the code review requests based on both the acceptance and response probabilities, which are calculated using machine learning models. They evaluated their approach using data from open source projects and obtained solid results. Saini and Britto [
213] developed a Bayesian Network to predict acceptance probability. The acceptance probability is combined with other aspects, such as task type and the presence of merge conflicts, to order the list of code review requests associated with a developer. They evaluated their approach both using historical data and user feedback (both from Ericsson). They learned that their approach has good prediction performance and was deemed as useful by the users.
PRioritizer is a pull request prioritization approach that, similarly to a priority inbox, sorts requests that require immediate attention to the top [
253]. Users of the system reported that they liked the prioritization but miss insights on the rationale for the particular pull request ordering.
Other studies looked especially at the historic proneness of files to defects to direct review efforts. A study suggests combining bug-proneness, estimated review cost, and the state of a file (newly developed or changed) to prioritize files to review [
47]. The evaluation, performed on two open source projects, indicates that the approach leads to more effective reviews. A similar approach attempts to classify files as potentially defective, based on historic instances of detected faults and features on code ownership, change request, and source code metrics [
170].
Some studies focus on predicting the time to review [
281,
284]. Zhao et al. [
284] focused on developing an approach that focus on time to review to prioritize review requests. In their approach, they employed a learning-to-rank approach to recommend review requests that can be reviewed quickly. They evaluated their approach through a survey with GitHub code reviewers. The survey participants acknowledge the usefulness of the approach. Zhang et al. [
281] employed Hidden Markov Chains to predict the review time of code changes. They evaluated their approach using data from open source projects, with promising results.
Wang et al. [
261] aimed at supporting review request prioritization by identifying duplicated requests. To do so, they consider a set of features, including the time when review requests are created. They developed a machine learning model to classify whether a code review request is duplicated. They validated their approach using data from open source, obtaining mixed results.
Automating code reviews. Gerede and Mazan [
112] have proposed to train a classifier on whether a change request is likely to be accepted or not, knowing in advance the likelihood of a rejected change request would reduce the review effort, as those changes would not even reach the reviewing stage. They found that the change requests by inexperienced developers that involve many reviewers are the most likely to be rejected. In the same line of research, Li et al. [
162] used Deep Learning to predict a change’s acceptance probability. Their approach, called DeepReview, outperformed traditional single-instance approaches.
Review Bot uses multiple static code analysis tools to check for common defect patterns and coding standard violations to create automated code reviews [
55]. An evaluation with seven developers found that they agreed to 93% of the automatically generated comments, likely due to the lack of consistent adoption of coding standards, which were the majority of the identified defects. Similarly, Singh et al. [
221] studied the overlap of static analyzer findings with reviewer comments in 92 pull requests from GitHub. Of 274 comments, 43 overlapped with static analyzer warnings, indicating that 16% of the review workload could have been reduced with automated review feedback.
A series of studies investigated the effect of bots on code reviewing practice. Wessel et al. [
266] conducted a survey to investigate how software maintainers see code review bots. They identified that the survey participants would like enhancements in the feedback bots provide to developers, along with more help from bots to reduce the maintenance burden for developers and enforce code coverage. A follow-up study [
267], in which 21 practitioners were interviewed, identified distracting and overwhelming noise caused by review bots as a recurrent problem that affects human communication and workflow. However, a quantitative analysis [
265] of 1,194 software projects from GitHub showed that review bots increase the number of monthly merged pull requests. It showed also that after the adoption of review bots, the time to review and reject pull requests decreased, while the time to accept pull requests was unaffected. Overall, bots seem to have a positive effect on code reviews, and countermeasures to reduce noise, as discussed by Wessel et al. [
267], can even improve that effect.
CFar has been used in a production environment resulting in (a) enhanced team collaboration as analysis comments were discussed; (b) improved productivity as the tool freed developers from providing feedback about shallow bugs; and (c) improved code quality, since the flagged issues were acted upon, and (d) the automatic review comments were found useful by the 98 participating developers [
127].
Recently, researchers have invested in using Deep Learning to aiming at code review automation [
52,
162,
218]. Some studies have focused on identifying the difference between different code revisions [
52,
218], while Tufano et al. [
244] focused on providing an end-to-end solution, from identifying code changes to providing review comments. Finally, Hellendoorn et al. [
126] evaluated if it is feasible at all to automate code reviews by developing a Deep Learning–based approach to identify the location of comments. They concluded that just this simple task is very challenging, indicating that a lot of research is still required before fully automated code review becomes a reality.
Analyzing sentiments, attitudes, and intentions in code reviews. Understanding review comments in greater detail could lead to systems that support reviewers in both formulating and interpreting the intentions of code reviews. A study on Android code reviews investigated the communicative goals of questions stated in reviews [
98], identifying five different intentions: suggestions, requests for information, attitudes and emotion, description of a hypothetical scenario, and rhetorical questions. A study at Microsoft showed that the type of a change intent can be used to predict the effort for a code review [
262]. A study on the Chromium project found that code reviews with lower inquisitiveness, higher sentiment (positive or negative), and lower syntactic complexity were more likely to miss vulnerabilities in the code [
180].
Several studies investigated how sentiments are expressed in code reviews [
42,
101,
134]. SentiCR flags comments as positive, neutral, or negative with 83% accuracy [
42] and was later compared to classifiers developed for the software engineering context. Surprisingly, it was outperformed by Senti4SD [
80]. The same investigation found that contributors often express their sentiment in code reviews and that negative and controversial reviews lead to a longer review completion time [
102]. A study at Google investigated interpersonal conflict and found in a survey that 26% have at least once a month negative experiences with code reviews [
101]. Furthermore, they found that rounds of a review, reviewing and shepherding time, have a high recall but low precision in predicting negative experiences. Other research has focused on nonverbal physiological signals, such as electrodermal activity, stress levels, and eye movement, to measure affect during code reviews. These signals were associated with increased typing duration and could be used in the future to convey emotional state to improve the communication in code reviews that are typically conducted without direct interaction [
256]. A study categorized incivility in open source code review discussions [
107]. The results indicate that more than half (66.66%) of the non-technical emails included uncivil features. Frustration, name calling, and impatience are the most frequent features in uncivil emails. The study also concluded that sentiment analysis tools cannot reliably capture incivility. In a study of six open source projects, men expressed more sentiments (positive/negative) than females [
196].
Code reviews on touch enabled devices. Müller et al. [
179] have proposed to use multi-touch devices for collaborative code reviews in an attempt to make the review process more desirable. The approach provides visualizations, for example, to illustrate code smells and metrics. Other researchers have compared reviews performed on the desktop and on mobile devices [
108]. In an experiment, they analyzed 2,500 comments, produced by computer science students, and found that (a) the reviewers on the mobile device found as many defects as the ones on the desktop and (b) seemed to pay more attention to details.
Other solutions. Some primary studies propose an initial proof of concept approaches for different purposes: to automatically classify commit messages as clean or buggy [
160], to eliminate stagnation of reviews and to optimize the average duration of open reviews [
255], use of interactive surfaces for collaborative code reviews [
200], and to link code reviews to code changes [
188]. In addition, a study [
167] compares two reviewer recommendations algorithms, concluding that recommendation models needs to be trained for a particular project to perform optimally.
Links to tools and databases available reported in the primary studies: We extracted the links to tools and databases reported in the primary studies providing solutions to support modern code reviews. Only a few primary studies provide links to the proposed tools or databases used in the studies. Most of the proposed solutions were for supporting reviewer recommendations. However, only 2 of 36 solutions provided links to the tools, and seven primary studies provided links to the database they used in their studies. We observed most reporting of links (17/28) to tools and databases for the primary studies providing support to understand changes that need to be reviewed. The complete list of the links to the tools and databases, along with the purpose of the links, is available in our online repository [
6].
4.2.2 Human and Organizational Factors.
This theme includes primary studies that investigate the reviewer and/or contributor as subject, for example, reviewer experience and social interactions. Studies that contribute to the human factors (e.g., experience) and organizational factors (e.g., social network) are categorized into this theme.
Review performance and reviewers’ age and experience. The most investigated topic in this theme is the relation between the reviewers’ age and experience on the review performance. Studies found that reviewer expertise is a good indicator of code quality [
90,
154,
155,
174,
208,
240]. In addition, studies found that reviewers’ experience [
154] and developers’ experience [
66,
157] influence the code review outcome such as review time and patch acceptance or rejection. A study investigated human factors (review workload and social interactions) that relate to the participation decision of reviewers [
210]. The results suggest that human factors play a relevant role in the reviewer participation decision. Another study investigated if age affects reviewing performance [
181]. The study compared students in their 20s and 40s showed no difference based on age or development experience. Finally, there exists some early work on harvesting reviewer experience through crowdsourcing the creation of rules and suggestions [
139].
Review performance and reviewers’ reviewing patterns and focus. Eye tracking has been used in several studies to investigate how developers review code. Researchers found that a particular eye movement, the scan pattern, is correlated with defect detection speed [
67,
216,
252]. The more time the developer spends on scanning, the more efficient the defect detection [
216]. Based on these results, researchers have also stipulated that reviewing skill and defect detection capability can be deduced from eye movement [
83]. Studies compared the review patterns of different types of programmers [
124,
138]. A study compared novice and experienced programmers and based on the eye movements and reading strategies concluded that experienced programmers grasped and processed information faster and with less effort [
124]. When comparing the eye-tracking results based on gender, a study found that men fixated more frequently, while women spent significantly more time analyzing pull request messages and author pictures [
138].
Review performance and reviewers’ workload. The impact of workload on code reviews has been investigated from two perspectives. First, a study found that workload (measured in pending review requests) negatively impacts review quality in terms of bug detection effectiveness [
155]. Second, a study crossing several open source projects found that workload (measured in concurrent and remaining review tasks) negatively impacts the likelihood that the reviewers accepts a new review invitation [
210].
Review performance and reviewers’ social interactions. Code reviews have been studied with different theoretical lenses on social interactions. A study used social network analysis to model reviewer relationships and found that the most active reviewers are at the center of peer review networks [
272]. Another study used the snowdrift game to model the motivations of developers participating in code reviews [
153]. They describe two motivations: (i) a reviewer has a motive of choosing a different action (review, not review) from the other reviewer, and (ii) a reviewer cooperates with other reviewers when the benefit of review is higher than the cost. A study found that past participation in reviews on a particular subsystem is a good predictor for accepting future review invitations [
210]. Similarly, another study looking at review dynamics found the amount of feedback at patch has received is highly correlated with the likelihood that the patch is eventually voted to be accepted by the reviewer [
237].
Review performance and reviewers’ understanding of each other’s comments. A study on code reviews investigated if reviewers’ confusion can be detected by humans and if a classifier can be trained to detect reviewers’ confusion in review comments [
97]. The study concludes that while humans are quite capable of detecting confusion, automated detection is still challenging. Ebert et al. [
99] identified causes of confusion in the code: the presence of long or complex code changes, poor organization of work, dependency between different code changes, lack of documentation, missing code change rationale, and lack of tests. The study also identified the impact of confusion and strategies to cope with confusion.
Review performance and reviewers’ perception of code and review quality. A survey study conducted among reviewers identified factors that determine their perceived quality of code and code reviews [
154]. High-quality reviews provide clear and thorough feedback, in a timely manner, by a peer with a supreme knowledge of the code base, strong personal and interpersonal qualities. Challenges to achieve high-quality reviews are of technical (e.g., familiarity with the code) and personal (e.g., time management) nature.
The difference between core and irregular contributors and reviewers. Studies investigated the difference between core and irregular contributors and reviewers in terms of review requests, frequency, and speed [
65,
73,
75,
150,
199]. A study found that contributions from core developers were rejected faster (to speed-up development), while contributions from casual developers were accepted faster (to welcome external contributions) [
65]. Similar observations were made in other studies [
75,
150], while Bosu and Carver [
73] found that top code contributors were also the top reviewers. A study explored different characteristics of the patches submitted to company-owned OSS project and found that volunteers face 26 times more rejections than employees [
199]. In addition, the review of patches submitted by volunteers have to wait, on average, 11 days, whereas employees wait 2 days on average. Studies also investigated the acceptance likelihood of core and irregular contributors [
75,
125]. Bosu and Carver [
75] found that core contributors are more likely to have their changes accepted to the code base than irregular contributors. A potential explanation for this observation was found in another study [
125], showing that rejected code is significantly different (due to different code styles) to the project code than accepted code. More experienced contributors submit code that is more compliant to the project’s code style. A study investigated the consequences of disagreement between reviewers who review the same patch [
130]. The study found that more experienced reviewers are more likely to have a higher level of agreement than less-experienced reviewers. A study investigating the career paths of contributors (from non-reviewer, i.e., developer, to reviewer, to core reviewer) found that (a) there is little movement between the population of developers and reviewers, (b) the turnover of core reviewers is high and occurs rapidly, (c) companies are interested in having core reviewers in their full-time staff, and (d) being a core reviewer seems to be helpful in achieving a full-time employment in a project [
254].
The effect of the number of involved reviewers on code reviews. A study found that the more the developers are involved in the discussion of bugs and their resolution, the less likely the reviewers are to miss potential problems in the code [
155]. The same holds not true for reviewer comments: Surprisingly, the studied data indicate that the more reviewers participate with comments on reviews, the more likely they miss bugs in the code they review. Another study also made a counter-intuitive observation: Files vulnerable to security issues tended to be reviewed by more people [
174]. One reported explanation is that reviewers get confused about what their role in the review is if there are many reviewers involved (diffusion of responsibility). Similar results were found in a study of a commercial application: The more reviewers are active, the less efficient the review and the lower the comment density [
95]. In a study including both open source and commercial projects, it was observed that it is general practice to involve two reviewers in a review [
209].
Information needs of reviewers in code reviews. A study identified the following information need categories: alternative solutions and improvements, correct understanding, rationale, code context, necessity, specialized expertise, and splitability of a change [
195]. The authors of the study find that some of the information needs can be satisfied by current tools and research results, but some aspects seem not to be solved yet and need further investigation. Studies investigated the use of links in review comments [
143,
259]. A case study of the OpenStack and Qt projects indicated that the links provided in code review discussion served as an important resource to fulfill various information needs such as providing context and elaborating patch information [
259]. Jiang et al. [
143] found that 5.25% of pull requests in 10 popular open source projects have links. The authors conclude that pull requests with links have more comments, more commenters and longer evaluation time. Similar results were found in a study of three open source projects [
258] where patches with links took longer to review. The study also finds combining two features (i.e., textual content and file location) to be effective in detecting patch linkages. Similarly machine learning classifiers can be used to automate patch linkages [
132].
4.2.3 Impact of Code Reviews on Product Quality and Human Aspects (IOF).
This theme includes primary studies that investigate the impact of code reviewers on artifacts such as code, design, and human aspects, such as attitude and understanding.
The impact of code reviews on defect detection or repair. A study showed that unreviewed commits have over twice as many chances of introducing bugs than reviewed commits [
7]. Similarly, observations from another study show that both defect-prone and defective files tend to be reviewed less rigorously in terms of review intensity, participation, and time than non-defective files [
239].
Another study has investigated how code review coverage (the proportion of reviewed code of the total code), review participation (length and speed of discussions), and reviewer expertise affect post-release defects in large open source projects [
172]. The findings suggest that reviewer participation is a strong indicator for defect detection ability. While high code review coverage is important, it is even more important to monitor the participation of reviewers when making release decisions and select reviewers with adequate expertise on the specific code. However, these findings could not be confirmed in a replication study [
159]. They found that review measures are neither necessary nor sufficient to create a good defect prediction model. The same conclusions we confirmed in a project of proprietary software [
219]. In their context, other metrics such as the proportion of in-house contributions, the measure of accumulated effort to improve code changes, and the rate of author self-verification contributed significantly to defect proneness [
219].
Defective conditional statements are often the source of software errors. A study [
248] found that negations in conditionals and implicit indexing in arrays are often replaced with function calls, suggesting that reviewers found that this change leads to more readable code.
The impact of code reviews on code quality. Studies were conducted to find the problems fixed by code reviews. A study concluded that 75% of the defects identified during code review are evolvability type defects [
201]. They also found that code review is useful in improving the internal software quality (through refactoring). Similarly, other studies [
68,
171] found that 75% of changes are related to evolvability and only 25% of changes are related to functionality.
Studies investigated the impact of code reviews on refactoring. A study on 10 JAVA OSS projects found the most frequent changes in MCR commits are on code structure (e.g., re-factorings and reorganizations) and software documentation [
194]. An investigation of 1,780 reviewed code changes from 6 systems pertaining to two large open source communities found that refactoring is often mixed with other changes such as adding a new feature [
191]. In addition, developers had explicit intent of refactoring only in 31% of review that employed refactoring [
191]. An empirical study on refactoring-inducing pull requests found that 58.3% presented at least one refactoring edit induced by code review [
88]. In addition, Beller et al. [
68] found that 78–90% of the triggers for code changes are review comments. The remaining 10–22% are “undocumented.” Another study showed that reviewed commits have significantly higher readability and lower complexity. However, no conclusive evidence was reported on coupling [
7].
A study of Openstack patches found higher code conformance of a patch after being reviewed than a patch that was first submitted [
228]. An investigation on the impact of code review on coding convention violation found that convention violations disappear after code reviews. However, only a minority of the violations were removed, because they were flagged in a review comment [
119]. The comparison of cost required to produce quality programs using code reviews and pair programming showed that code reviews costs 28% less compared to pair programming [
233].
The impact of code reviews on detection or fixes of security issues. According to a study [
77], code review leads to the identification and fixes of different vulnerability types. The experience of reviewers regarding vulnerability issues is an important factor in finding security-related problems, as a study indicates [
174]. Another large study [
236] also has similar findings. The results indicate that code review coverage reduces the number of security bugs in the investigated projects. A study looked into the language used in code reviews to find if the linguistics characters could explain developers missing a vulnerability [
180]. The study found that code reviews with lower inquisitiveness (fewer questions per sentence), higher positive or negative sentiment, lower cognitive load, and higher assertions are more likely to miss a vulnerability. A study investigated the security issues identified through code reviews in an open source project [
93]. They found that 1% of reviewers’ comments are security issues. Language-specific issues (e.g., C++ issues and buffer overflows) and domain-specific ones (e.g., such as Cross-Site Scripting) are often missed security issues, and initial evidence indicates that reviews conducted by more than two reviewers are more successful at finding security issues. Another online study on freelance developers’ code review process has similar findings indicating that developers did not focus on security in their code reviews [
91]. However, the results showed that prompting for finding security issues in code reviews significantly affects developers’ identification of security issues. A study found the relevant factors in successful identification of security issues in code reviews [
197]. The results indicate that the probability of security issues identification decreases with the increase in review factors such as number of reviewer’s prior reviews and number of review comments authored on a file during the current review cycle. In addition, the probability of security issues identification increases with review time, number of mutual reviews between the code author and a reviewer, and a reviewer’s number of prior reviews of the file.
The impact of code reviews on software design. A study [
178] found that high code review coverage can help to reduce the incidence of anti-patterns such as Blob, Data class, Data clumps, Feature envy and Code Duplication in software systems. In addition, the lack of participation (length and speed of discussions) during code reviews has a negative impact on the occurrence of certain code anti-patterns. Similarly, a study [
184] specifically looked for the occurrences of review comments related to five code smells (Data Clumps, Duplicate Code, Feature Envy, Large Class, and Long Parameter List) and found that the code review process did identify these code smells. An empirical study of code smells in code reviews in two most active OpenStack projects (Nova and Neutron) found that duplicated code, bad naming, and dead code are the most frequently identified smells in code reviews [
120]. Another investigation of 18,400 reviews and 51,889 revisions found that 4,171 of the reviews led to architectural changes, 731 of which were significant changes [
189].
The impact of code reviews on design degradation is investigated in two studies [
246,
247]. A study on code reviews in OSS projects found that certain code review practices such as long discussions and reviewers’ disagreements can lead to design degradation [
247]. To prevent design degradation, it is important to detect design impactful changes in code reviews. A study found that technical features (code change, commit message, and file history dimensions) are more accurate than social ones in predicting (un)impactful changes [
246].
The impact of code reviews on teams’ understanding of the code under review. An interview study [
227] found that code reviews help to improve the team’s understanding of the code under review. In addition, code review may be a valuable addition to pair programming, particularly for newly established teams [
227]. Similarly, a survey of developers and a study of email threads found that developers find code review dialogues useful for understanding design rationales [
232]. Another survey of developers [
76] found code reviews to help in knowledge dissemination. This was also found in a survey of reviewers that code review promotes collective code ownership [
69]. However, Caulo et al. [
82] were not able to capture the positive impact of code review in knowledge translation among developers. The authors contribute the negative results to fallacies in their experiment design and notable threats to validity.
The impact of code reviews on peer impression in terms of trust, reliability, perception of expertise, and friendship. A survey of open source contributors [
74] found that there is a high level of trust, reliability, and friendship between open source software projects’ peers who have participated in code review for some time. Peer code review helped most in building a perception of expertise between code review partners [
74]. Similarly, another survey [
76] found that the quality of the code submitted for review helps reviewers form impressions about their teammates, which can influence future collaborations.
The impact of code reviews on developers’s attitude and motivation to contribute. An analysis of two years of code reviews showed that review feedback has an impact on contributors becoming long-term contributors [
185]. Specific feedback such as “Incomplete fix” and “Sub-optimal solution” might encourage contributors to continue to work in open source software projects [
185]. Similarly, a very large study found that negative feedback has a significant impact on developers’ attitude [
215]. Developers might not contribute again after receiving negative feedback, and this impact increases with the size of the project [
215].
4.2.4 Modern Code Review Process Properties (CRP).
This theme includes primary studies investigating how and when reviews should be conducted and characteristics such as review benefits, motivations, challenges, and best practices.
When should code reviews be performed? Research shows that code reviews in large open source software projects are done in short intervals [
208,
209]. In particular, large and formal organizations can benefit from creating overlap between developers’ work, which produces invested reviewers, and from increasing review frequency [
208].
What are the benefits of code reviews besides finding defects? A study on large open source software projects found that code reviews act as a group problem-solving activity. Code reviews support team discussions of defect solutions [
208,
209]. The analysis of over 100,000 peer reviews found that reviews also enable developers and passive listeners to learn from the discussion [
208,
209]. A similar observation was made in a survey of 106 practitioners, where, besides knowledge sharing, the development of cognitive empathy was identified as a benefit of code reviews [
89].
How are review requests distributed? Research found that reviews distributed via broadcast (e.g., mailing list) were twice as fast as unicast (e.g., Jira). However, reviews distributed via unicast were more effective in capturing defects [
48]. In the same investigation, code reviewers reported that a unicast review allows them to comment on specific code, visualize changes, and have less traffic of patches circulating among reviewers. However, new developers learn the code structure faster with frequent circulation of patches among those who subscribe to broadcast reviews.
Efficiency and effectiveness of code reviews compared to team walkthroughs. Team walkthroughs are often used in safety-critical projects but come with additional overhead. In a study that developed an airport operational database, the MCR process was compared with a walkthrough process [
71]. The authors suggest to adopt MCR to ensure coverage while adapting the formality to the criticality of the item under review.
Over-the-shoulder (OTS) reviews are synchronous code reviews where the author leads the analysis. A study compared an experiment OTS with
tool-assisted (TA), asynchronous, code reviews. It was found that OTS generates higher-quality comments about more important issues and better supports knowledge transfer, while TA generates more comments [
146].
Mentioning peers in code review comments. A study explored the use of @-mentions, a social media tool, in pull requests [
283]. The main findings were that @-mentions are used more frequently in complex pull requests and lead to shorter delays in handling pull requests. Another study investigated which socio-technical attributes of developers are able to predict @-mentions. It found that a developers visibility, expertise, and productivity are associated with @-mentions, while, contrary to intuition, responsiveness is not [
149]. Generalizing the idea of @-mentions, other researchers investigated to what information objects to stakeholders refer to in pull request discussions. Building taxonomies of reference and expression types, they found that source code elements are most often referred to, even though the studied platform (GitHub) does not provide any support in creating such links (in contrast to references to people or issue reports) [
84].
Test code reviews. Observations on code reviews found that the discussions on test code are related to testing practices, coverage, and assertions. However, test code is not discussed as much as production code [
224]. When reviewing test code, developers face challenges such as lack of testing context, poor navigation support (between test and production code), unrealistic time constraints imposed by management, and poor knowledge of good reviewing and testing practices by novice developers [
224]. Test-driven code review is the practice of reviewing test code before production code and studied in a controlled experiment and survey [
226]. It was found that the practice does not change review comment quality nor the overall amount of identified issues. However, more test issues were identified on the expense of maintainability issues in production code. Furthermore, in a survey it was found that reviewing tests was perceived as having low importance and lacking tool support.
Decision-making in the code review process. The review process and the resulting artifacts are an important source of information for the integration decision of pull requests. In a qualitative study limited to two OSS projects, it was found that the common, most frequent reason for rejection is unnecessary functionality [
117]. In a quantitative study of 4.8K GitHub repositories and 1M comments, it was found that there are proportionally more comments, participants and comment exchanges in rejected than in accepted pull requests [
114]. Another aspect of decision-making in code reviews is multi-tasking. It was observed that reviewers participating simultaneously in several pull requests (which happens in 62% of the 1.8M studied pull requests) increase the resolution latency [
135]. MCR processes often contain a voting mechanism that informs the integrator about the community consensus about a patch. The analysis of a project showed that integrators use patch votes only as a reference and decide in 40% of the cases against the simple majority vote [
129]. Still, patches that receive more negative than positive votes are likely to be rejected.
Comparison of pre-commit and post-commit reviews. In change-based code reviews, one has the choice to perform either pre-commit or post-commit reviews. Researchers have created and validated a simulation model, finding that there are no differences between the two approaches in most cases [
59]. In some cases, post-commits were better regarding cycle time and quality. For pre-commit reviews, the review efficiency was better.
Strategies for merging pull requests. A survey of developers and analysis of data from a commercial project found that pull request size, the number of people involved in the discussion of a pull request, author experience, and their affiliation are significant predictors of review time and merge decisions [
156]. It was found that developers determine the quality of a pull request by the quality of its description and complexity and the quality of the review process by the feedback quality, test quality, and the discussion among developers [
156].
Motivations, challenges, and best practices of the code review process. Several studies have been conducted to investigate benefits and challenges of modern code reviews. An analysis found that improving code, finding defects, and sharing knowledge were the top three of nine identified benefits associated with code reviews [
169]. Similar studies identified knowledge sharing [
34,
89], history tracking, gatekeeping, and accident prevention as benefits of code reviews [
34]. Challenges such as receiving timely feedback, review size, and managing time constraints were identified as the top 3 of 13 identified challenges [
3,
169]. Challenges such as geographical and organizational distance, misuse of tone and power, unclear review objectives and context were also identified [
34]. In the context of refactoring, a survey found that changes are often not well documented, making it difficult for reviewers to understand the intentions and implications of refactorings [
45]. The best practices for code authors include writing small patches, describing and motivating changes, select appropriate reviewers, and being receptive toward reviewers’ feedback [
169]. The code reviewers should provide timely and constructive feedback through effective communication channels [
169]. Code reviews are a well-established practice in open source development (FOSS). An interview study [
44] set out to understand why code review works in FOSS communities and found that (1) negative feedback is embraced as a mean for a positive opportunity for improvement and should not be reduced nor eliminated, (2) the ethic of passion and care create motivation and resilience to rejection, and (3) both intrinsic (altruism and enjoyment) and extrinsic (reciprocity, reputation, employability, learning opportunities) motivation are important. Another study proposes a catalog of MCR anti-patterns that describe reviewing behaviour or process characteristics that are detrimental to the practice: confused reviewers, divergent reviewers, low review participation, shallow review, and toxic review [
85]. Preliminary results from studying a small sample (100) of code reviews show that 67% contain at least one anti-pattern.
4.2.5 Impact of Software Development Processes, Patch Characteristics, and Tools on Modern Code Reviews (ION).
This theme includes primary studies investigating the impact of processes (such as continuous integration), patch characteristics (such as change size, descriptions), and tools (e.g., statics analyzers) on modern code reviews.
The impact of static analyzers on the code review process. A study on six open source projects analyzed which defects are removed by code reviews and are also detected by static code analyzers [
193]. In addition, a study [
249] found that the issues raised by coding style checker can improve patch authors’ coding style to avoid the same type of issues in subsequent patch submissions. However, the warnings from static analyzers could be irrelevant for a given project or development context. To address this issue, a study [
250] proposed a coding convention checker that detects project-specific patterns. While most of the produced warnings would not be flagged in a review, addressing defects regarding imports, regular expressions, and type resolutions before the patch submission would indeed reduce the reviewing effort. Through an experiment [
128], it was found that the use of a symbolic execution debugger to identify defects during the code review process is effective and efficient compared to a plain code-based view. Another study [
168] proposed a static analyzer for extracting first-order logic representations of API directives that reduces the code review time.
The impact of gamification elements on the code review process. Gamification mechanisms for peer code review are proposed in a study [
251]. However, an experiment with gamification elements in the code review process found that there is no impact of gamification on the identification of defects [
151].
The impact of continuous integration on the code review process. Experiments with 26,516 automated build entries reported that successfully passed builds are more likely to improve code review participation and frequent builds are likely to improve the overall quality of the code reviews [
203]. Similar findings were confirmed in a study [
277] that found that passed builds have a higher chance of being merged than failed ones. On the impact of CI on code reviews, a study [
81] found that on average CI saves up to one review comment per pull request.
The impact of code change descriptions on the code review process. Interviews with industrial and OSS developers concluded that providing motivations for code changes along with a description of what is changed reduces the reviewer burden [
206]. Similarly, an analysis of OSS projects found that a short patch description can lower the likelihood of attracting reviewers [
241].
The impact of code size changes on the code review process. An investigation of a large commercial project with 269 repositories found that when patch size increases, the reviewers become less engaged and provide less feedback [
172]. An interview study with industrial and OSS developers found that code changes that are properly sized are more reviewable [
206]. The size of patches negatively affects the review response time, as observed in a study on code reviews [
66], and reduces the number of review comments [
165] and code review effectiveness, as shown in a study of an OSS project [
64]. Similarly, an analysis of more than 100,000 peer reviews in open source projects recommends that changes to be reviewed should be small, independent, and complete [
95].
The impact of commit history coherence on the code review process. An interview study on industrial and OSS project developers found that the commit messages that are self-explanatory and have meaningful messages are easier to review [
206]. In addition, interviewees suggest that the ratio of commits in a change to the number of files changed should not be high [
206].
The impact of review participation history on the code review process. An analysis of three OSS projects found that the likelihood of attracting reviewers is higher when past changes to the modified files are reviewed by at least two reviewers [
241]. Prior patches that had few reviewers tend to be ignored [
241]. Another study, looking at reviews from two OSS projects found that more active reviewers have faster response times [
66].
The impact of fairness on the code review process. Fairness, in general, refers to the decision and allocation of resources in a way that is fair to the individuals and the group. A study [
113] in an OSS project investigated different fairness aspects and recommends, besides the common aspects of fairness such as politeness and precise and constructive feedback, to (a) distribute reviews fairly and (b) establish a clear procedure for how reviews are performed. A study [
109] investigated how contributions from different countries are treated. The study found that developers from countries with low human development face rejection the most. From the perspective of bias, a study [
182] investigated the benefits of anonymous code reviews. The results indicate that while anonymity reduces bias, it is sometimes possible to identify the reviewer, and there are some practical disadvantages such as not being able to discuss with the reviewer. The study recommends to have a possibility to reveal the reviewer when required. Another qualitative study [
183] found that there may be perceptible race bias in the acceptance of pull requests. Similarly, a study investigated the impact of gender, human, and machine bias in response time and acceptance rate [
138]. The results indicate that gender identity has significant effect on response time, and all participants spend less time evaluating the pull requests of women and are less likely to accept the pull requests of machines.
The impact of rebasing operations in the code review process. An in-depth large-scale empirical investigation of the code review data of 11 software systems, 28,808 code reviews, and 99,121 revisions found that rebasing operations are carried out in an average of 75.35% of code reviews, of which 34.21% operations tend to tamper with the reviewing process [
190].