Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Li 2023 CRR

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Code Reviewer Recommendation for Architecture Violations:

An Exploratory Study
Ruiyin Li Peng Liang∗ Paris Avgeriou
School of Computer Science, Wuhan School of Computer Science, Wuhan Department of Mathematics and
University, Wuhan, China University, Wuhan, China Computing Science, University of
Department of Mathematics and Hubei Luojia Laboratory, Wuhan, Groningen, Groningen, The
Computing Science, University of China Netherlands
Groningen, Groningen, The liangp@whu.edu.cn p.avgeriou@rug.nl
Netherlands
ryli_cs@whu.edu.cn

ABSTRACT ACM Reference Format:


Code review is a common practice in software development and Ruiyin Li, Peng Liang, and Paris Avgeriou. 2023. Code Reviewer Recommen-
dation for Architecture Violations: An Exploratory Study. In Proceedings
often conducted before code changes are merged into the code
of the International Conference on Evaluation and Assessment in Software
repository. A number of approaches for automatically recommend- Engineering (EASE ’23), June 14–16, 2023, Oulu, Finland. ACM, New York,
ing appropriate reviewers have been proposed to match such code NY, USA, 10 pages. https://doi.org/10.1145/3593434.3593450
changes to pertinent reviewers. However, such approaches are
generic, i.e., they do not focus on specific types of issues during
code reviews. In this paper, we propose an approach that focuses
1 INTRODUCTION
on architecture violations, one of the most critical type of issues Code review is widely employed in modern software development
identified during code review. Specifically, we aim at automating and is recognized as a valuable and effective practice at all stages
the recommendation of code reviewers, who are potentially qual- of the development life cycle [2]. Active participation of developers
ified to review architecture violations, based on reviews of code in code review decreases defects, improves the software quality,
changes. To this end, we selected three common similarity detection and facilitates knowledge sharing through rich communication
methods to measure the file path similarity of code commits and among reviewers [2, 24]. Over the last decade, several tools have
the semantic similarity of review comments. We conducted a series been widely used in both industry and open-source communities
of experiments on finding the appropriate reviewers through evalu- to make the code review process more effective, such as Phabri-
ating and comparing these similarity detection methods in separate cator1 , Review-Board2 , and Gerrit3 . Although such tools provide
and combined ways with the baseline reviewer recommendation automated techniques to support the code review process, there is
approach, RevFinder. The results show that the common similar- still a significant amount of human factors that can influence code
ity detection methods can produce acceptable performance scores review activities, such as unqualified reviewers, response delays,
and achieve a better performance than RevFinder. The sampling and overloaded review workload [3, 7, 24].
techniques used in recommending code reviewers can impact the At the heart of the human-related issues lies the process of match-
performance of reviewer recommendation approaches. We also dis- ing code to reviewers: authors who submit new code patches to
cuss the potential implications of our findings for both researchers a code review system, need to invite (or the system can assign)
and practitioners. reviewers to manually check the uploaded code fragments based
on the reviewers’ expertise and past experience with reviews; this
may be a labor-intensive and time-consuming task, especially for
CCS CONCEPTS
large projects [31]. Previous studies [4, 7, 24] found that effective
• Software and its engineering → Designing software; • Gen- code review requires a significant amount of effort from reviewers
eral and reference → Empirical studies. who thoroughly understand the submitted code. However, inappro-
priate code reviewers might hinder the review process, delay the
KEYWORDS incorporation of a code change into a code base, and slow down
Code Review, Reviewer Recommendation, Architecture Violation the development process. Such problems arise from misunderstand-
ing or simply lacking knowledge of the intention or effect of code
changes [8]. A proper recommendation of code reviewers can help
Permission to make digital or hard copies of all or part of this work for personal or reduce delays and speed up development by finding appropriate re-
classroom use is granted without fee provided that copies are not made or distributed viewers who are more familiar with and spend less time reviewing
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the the submitted code fragments [3, 26].
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or There exist a number of code reviewer recommendation ap-
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
proaches in the literature (see Section 2.2). While these approaches
EASE ’23, June 14–16, 2023, Oulu, Finland
1 https://www.phacility.com/
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2 https://www.reviewboard.org/
ACM ISBN 979-8-4007-0044-6/23/06. . . $15.00
https://doi.org/10.1145/3593434.3593450 3 https://www.gerritcodereview.com/

42
EASE ’23, June 14–16, 2023, Oulu, Finland Li et al.

merged abandoned
can be effective, they are all generic in terms of the issues that re-
viewers focus. In this work, we focus on a particular type of issues:
repository
architecture violations. While architecture violations are one of
the most frequently identified types of architecture issues during approve reject
code review, they are not effectively covered by existing techniques
and tools [18]. If code fragments with architecture violations are
merged into the code base, it will increase the risk of architecture
developer review comments reviewer
erosion [17, 18] and gradually degrade architecture sustainability
and stability [27].
The goal of this work is to offer an automated recommenda- invited
tion of code reviewers, who are potentially qualified to review
commits sanity check
architecture violations. More specifically, we aim at recommending
potential code reviewers who have knowledge on architecture vio-
lations, through analyzing textual content of the review comments Figure 1: An overview of code review process in Gerrit
and the file paths of the reviewed code changes. Consequently,
our approach is not limited to specific programming languages.
This can act in a complementary way to a regular code review: a software quality [2]. The workflow of code review varies slightly be-
final check by reviewers who are knowledgeable in architecture tween different platforms, e.g., the pull-request workflow in GitHub
violations can act as a quality gate to avoid code changes with is different than code review in Gerrit. Gerrit is a commonly used
architecture violations merged into the code base. platform for coordinating code review activities and facilitating
Our proposed approach is novel in terms of mining semantic traceable code reviews for git-based software development. In this
information in review comments from code reviewers, based on work, we collected code review data from four OSS projects of
common similarity detection methods. To validate our approach, two large communities OpenStack and Qt (see Section 3.2), both
we conducted a series of experiments on 547 code review comments of which use Gerrit to conduct the code review process. We briefly
related to architecture violations from four Open-Source Software elaborate on the code review process with Gerrit (see Figure 1).
(OSS) projects (i.e., Nova, Neutron, Qt Base, and Qt Creator). The A developer can submit new code patches or modify the original
results show that the employed similarity detection methods can code fragments fetched from the repository through revisions; both
produce acceptable performance scores (i.e., values of top-k accu- take the form of commits. Gerrit then creates a page to record the
racy and mean reciprocal rank metrics) and achieve a better perfor- submitted commits after conducting automated tasks, like a sanity
mance than the baseline approach, RevFinder [26]. We managed to check. Other developers of the project will be invited as reviewers to
further explore the performance of the proposed approach on our inspect the submitted commits and offer feedback (i.e., code review
dataset, by using fixed sampling instead of incremental sampling. comments on the commits) to the developer. Such a review cycle
The main contributions of our work are: will stop until the reviewers either approve the submitted code
(status “Merged”) or reject it (status “Abandoned”). In this process,
• We explored the possibility of common similarity detection we argue that one of the code reviewers should have awareness of
methods on recommending code reviewers who have aware- architecture violations and provide a review on what they are and
ness of architecture violations. how to fix them. Such a code review that is focused on architecture
• We conducted experiments to evaluate and compare the violations would be complementary to regular code reviews.
performance of three similarity detection methods with the
baseline approach RevFinder on four OSS projects. 2.2 Code Reviewer Recommendation
• We shared the source code and dataset of our work [15] to en-
Expert recommendation is a common area in software engineering
courage further research on code reviewer recommendation
[25], and code reviewer recommendation is a typical application of
for architecture issues.
expert recommendation. Over the recent years, many approaches
The remainder of this paper is structured as follows: Section have been proposed for recommending code reviewers in the liter-
2 describes the background regarding performing code review in ature [12, 23, 26, 29, 30]; these are briefly introduced below based
Gerrit, code reviewer recommendation, and architecture violations. on the categories in previous studies [19, 31].
Section 3 elaborates on the research questions and study design. Heuristic-based approaches include problem-solving and prac-
Section 4 presents the results of the research questions and discusses tical methods. For example, the heuristic approaches, such as Re-
their implications. Section 5 clarifies the threats to validity and viewBot [3], cHRev [30], RevFinder [26], calculate heuristic scores
limitations of this study. Section 6 reviews the related work and through building expertise models to measure candidate reviewers’
Section 7 concludes this study with future directions. expertise.
Machine learning-based approaches usually utilize data-driven
2 BACKGROUND machine learning techniques (e.g., Support Vector Machine (SVM)
[12]) and genetic algorithms (e.g., Indicator-Based Evolutionary
2.1 Code Review Process in Gerrit Algorithm (IBEA) [7], NSGA-II [23]) to recommend reviewers. Such
Code review refers to the process of inspecting source code, which approaches are based on a series of features, such as patches, bug
is a critical activity during development and can help to improve report information.

43
Code Reviewer Recommendation for Architecture Violations EASE ’23, June 14–16, 2023, Oulu, Finland

Hybrid approaches combine different approaches (e.g., machine system community [31], i.e., Top-k Accuracy and Mean Reciprocal
learning [12], graph structure [29], genetic algorithm [7, 23]) for Rank.
recommending reviewers. For example, Xia et al. [28] developed a RQ2: How does the performance of the proposed similarity
hybrid incremental approach TIE (a Text mIning and filE location- detection methods compare against existing code reviewer rec-
based approach) to recommend reviewers through measuring tex- ommendation approaches?
tual content (i.e., multinomial Naive Bayes, a text mining technique) With this RQ, we want to compare the similarity detection meth-
and file path similarity (i.e., a VSM-based approach). ods with an existing approach using the artifacts that we collected
Different approaches utilize different types of artifacts to recom- in our study. Specifically, there are a number of code reviewer
mend code reviewers. According to the recent literature review by recommendation approaches [19, 25, 31], which can be compared
Çetin et al. [31], most of the studies use pull request history (e.g., against the proposed approach. To be able to make the comparison,
changes lines, paths of changed files and titles), and some studies the source code of these approaches must be publicly available in
also use code review history (including comments made by pull order to reproduce them.
requests and reviews). Our approach can be regarded as a heuristic- RQ3: Do the sampling techniques affect the performance of
based approach, and is based on both the review comments and file the proposed code reviewer recommendation approach?
paths of reviewed code changes. It differs from existing approaches As mentioned in a recent literature review [31], various sampling
as it focuses specifically on architecture violations. methods were used in code review recommendation. However, there
are no studies that investigate whether sampling methods (see Sec-
2.3 Architecture Violations tion 3.3) can impact the performance of reviewer recommendation
During software evolution, architecture erosion can degrade the approaches, and which sampling techniques can achieve relatively
stability and sustainability of system architecture due to increas- better performance. By answering this RQ, we aim at providing
ing changes and accumulated architecture violations [16, 17, 27]. empirical evidence about the influence of sampling techniques on
Architecture violations are the most common and prominent type reviewer recommendation performance.
of architecture erosion symptoms; various architecture violations
have been investigated in the literature [17]. Architecture violations 3.2 Data Collection
manifest in various ways: structural inconsistencies, violations of Projects and Code Review Comments. The original dataset used
design decisions, violations of design principles, violations of ar- in this study is from our previous work [14]. Through a series of
chitecture patterns, violations of API specification, etc. Previous tasks (e.g., keywords search, manual identification and labeling of
studies on architecture violations have focused on analyzing history architecture violation related review comments), we collected 606
versions of source code. For example, Brunet et al. [5] carried out a review comments on code changes and commit messages related to
longitudinal study to analyze the evolution of 19 bi-weekly versions architecture violations. In our work, we focused on recommending
of four OSS projects, by examining the life cycle and location of reviewers who have awareness of architecture violations regarding
architecture violations and comparing them to the intended archi- code changes, since one of the purposes of reviewer recommenda-
tecture. Maffort et al. [20] proposed an approach based on defined tion is to help selecting reviewers for code changes. Therefore, we
heuristics that can rapidly raise architecture violation warnings. further extracted 547 review comments from the original dataset
In contrast to the studies focusing on detecting architecture viola- [14] that are only related to code changes on architecture viola-
tions in source code, we aim at finding reviewers who can review tions [15]. The dataset contains the code review comments from
architecture violations during code review, regardless of the type four OSS projects, including Nova and Neutron from the Open-
of architecture (e.g., micro-services, layered architecture). Stack community4 , as well as Qt Base and Qt Creator from the Qt
community5 . As shown in Table 1, our dataset from the four OSS
3 RESEARCH METHODOLOGY projects includes code review comments regarding architecture
violations in eight years from June 2012 to December 2020. The
3.1 Research Questions
review comments are related to various architecture violations (e.g.,
RQ1: Can common similarity detection methods be effectively violations of design decisions, design principles, and architecture
used in recommending code reviewers for architecture viola- patterns), and were made by more than 200 reviewers. The items
tions? in the dataset contain review ID and patch information, including
This study aims at proposing an approach for the automated change_id, patch, file_url, line, and comment. The scripts and
recommendation of code reviewers who are knowledgeable on dataset of this work are available in [15].
architecture violations. To this end, we propose to use similarity
measurement, as this technique is commonly used to process tex- 3.3 Recommendation Approach
tual artifacts like code reviews [7, 10, 21]. With this RQ, we want to
Problem Statement. Since the artifacts we collected are from
investigate whether common similarity measurement techniques
the OSS projects that use Gerrit as the code review tool, we take
(i.e., Jaccard coefficient, adapted Hamming distance, cosine sim-
such projects as examples to formulate our approach. A software
ilarity) can indeed be useful for recommending code reviewers
project S contains a set of m developers D = {𝑑 1 , . . . , 𝑑𝑚 } and n
based on the review comments and file paths of the reviewed code
code reviewers R = {𝑟 1 , . . . , 𝑟𝑛 }, and includes a set of j source code
changes related to architecture violations (see Section 3.3). Specifi-
cally, we plan to evaluate the performance of similarity detection 4 https://www.openstack.org/
methods using metrics widely adopted by the recommendation 5 https://www.qt.io/

44
EASE ’23, June 14–16, 2023, Oulu, Finland Li et al.

Table 1: Details of the selected projects used in our work are often used in previous studies (e.g., [22, 29]) to measure the
semantic similarity of textual artifacts. Cosine similarity can be
Project Time Period Files1 Comments2 Reviewers3 utilized to determine lexical similarity between two entities repre-
Neutron 2013/11 - 2020/08 111 149 64 sented by two vectors of words. In our case, cosine similarity is used
Nova 2013/01 - 2020/08 126 206 67 to measure the semantic similarity of review comments regarding
Qt Base 2012/12 - 2020/12 124 139 48 architecture violations, as shown in Equation (2):
Qt Creator 2012/06 - 2020/11 49 53 25
1 Files: Code change files
vi · vj
2 Comments: Code review comments on architecture
Cos_Similarity(Ci, Cj ) = (2)
violations |vi | vj
3 Reviewers: Code reviewers of the code change files

where 𝐶𝑖 and 𝐶 𝑗 represent two code review comments, and 𝑣𝑖


and 𝑣 𝑗 denote their corresponding vectors. To generate vectors, we
files F = {𝑓1 , . . . , 𝑓 𝑗 } and a set of k code review comments C = adopted a pre-trained Word2vec model, which was trained based on
{𝑐 1 , . . . , 𝑐𝑘 }. In general, R is used to represent a set of candidate over 15 GB of textual data from Stack Overflow posts that contain a
code reviewers for code changes. Each reviewer 𝑟𝑖 has their own plethora of textual expressions and words in software engineering
expertise on certain source code file 𝑓𝑖 , and has a review comment domain [9]. The higher the similarity score, the closer the two
𝑐𝑖 on the corresponding commit. vectors that represent the two review comments.
In such a project, each new commit (i.e., code changes that are Regarding the Jaccard coefficient, as shown in Equation (1), X and
not yet merged into the code base) could be reviewed by a number Y represent two review comments in a set of tokens (i.e., tokenized
of invited (or assigned) code reviewers. Our proposal is to generate words). The more common tokens between X and Y, the higher
a list of recommended reviewers, in which each prospective re- similarity score of the two review comments. Note that, before we
viewer has a matched score representing their expertise. The higher calculated the semantic similarity by cosine similarity and Jaccard
the expertise score of a reviewer, the greater the probability for coefficient, we applied the following four pre-processing steps:
this reviewer to be recommended to review the commit. As men-
(1) Tokenization. The process of tokenization is to break a stream
tioned in Section 2.2, our reviewer recommendation approach is
of text into words, punctuation, and other meaningful ele-
based on the reviewer’s expertise, which is extracted and calculated
ments called tokens.
from historical commits (e.g., review comments and file paths) and
(2) Noise removal. Noise data usually does not contain valuable
commonly used in previous studies [6, 7, 10, 13, 26]. The input of
semantic information, and we therefore removed punctua-
our recommendation approach is the past commit files, including
tion, numbers, and special characters (e.g., “\”, “*”);
file paths, review comments regarding architecture violations, and
(3) Stop words removal. Stop words occur commonly but do not
the corresponding reviewers.
add valuable information to differentiate different text, such
Similarity Calculation. To answer RQ1 and present the exper-
as “the”, “are”, and “is”, which can be removed.
tise of reviewers, we chose cosine similarity, Jaccard coefficient, and
(4) Capitalization conversion. We converted all the text to lower
adapted Hamming distance to measure the similarity of file paths
case, which can help to maintain the consistency of word
and the semantic similarity of review comments on architecture
form and avoid recounting the words.
violations. In terms of the similarity of file paths, Jaccard coefficient
and adapted Hamming distance are two common methods used Reviewer Recommendation. For a new code change that has
to measure the similarity between file paths of code changes; they been commented by reviewers but has not been merged into the
are considered efficient similarity measures and widely adopted in code base, we aim at recommending code reviewers who are po-
previous studies (e.g., [7, 10, 21]). Jaccard coefficient is calculated tentially aware of architecture violations through measuring the
to measure similarity, as shown in Equation (1): similarity of the file paths and review comments of the reviewed
code changes.
|X ∩ Y| We ranked the candidate code reviewers through calculating the
Jac_Similarity(X, Y) = (1)
|X ∪ Y| reviewer scores using the file path similarity and the semantic simi-
where X and Y represent two entities whose similarity needs to larity of historical commits (i.e., file paths and review comments of
be measured. Here, to measure the file path similarity, X and Y the reviewed code changes). This includes File Path similarity by Jac-
represent two file paths (i.e., the sets of tokens of file paths), and the card Coefficient (FP_JC), File Path similarity by adapted Hamming
more common tokens between the file paths X and Y, the higher Distance (FP_HD), Review Comment semantic similarity by Cosine
similarity of the two file paths. Similarity (RC_CS), and Review Comment semantic similarity by
In addition, the similarity between file paths can be also cal- Jaccard Coefficient (RC_JC).
culated by the adapted Hamming distance (i.e., similarity score = Given a new code change file f, we extracted its file path 𝑓𝑛𝑒𝑤 and
Hamming distance for the same length strings + difference in length review comment 𝑐𝑛𝑒𝑤 , and then calculated the above-mentioned
of the two strings). If two file paths have the same paths, then the similarity scores between the current code change and each past
similarity score returns 1, otherwise it returns the reciprocal score code change, including the past file path 𝑓𝑝𝑎𝑠𝑡 and review comment
of the adapted Hamming distance of the two file paths. 𝑐 𝑝𝑎𝑠𝑡 . For example, FP_JS(𝑓𝑛𝑒𝑤 , 𝑓𝑝𝑎𝑠𝑡 ) calculates the file path sim-
In terms of the semantic similarity of code review comments, co- ilarity score between 𝑓𝑛𝑒𝑤 and the file path 𝑓𝑝𝑎𝑠𝑡 of a past code
sine similarity and Jaccard coefficient are used to measure the se- change by Jaccard coefficient. Similarly, RC_CS(𝑐𝑛𝑒𝑤 , 𝑐 𝑝𝑎𝑠𝑡 ) calcu-
mantic similarity between code review comments. The two methods lates the semantic similarity score between 𝑐𝑛𝑒𝑤 and the review

45
Code Reviewer Recommendation for Architecture Violations EASE ’23, June 14–16, 2023, Oulu, Finland

comment 𝑐 𝑝𝑎𝑠𝑡 of a past review comment by cosine similarity. Then, our approach. However, we were only able to do that for one ap-
the scores are assigned to the associated reviewers, respectively. proach, namely RevFinder [26]. The rest of the approaches had
In other words, each reviewer has four candidate similarity scores two main issues: (1) they require additional information that is not
by using the four similarity detection methods. By calculating the readily available (e.g., review workload); and (2) they do not make
similarity scores in separate and combined ways, a reviewer rec- their source code or datasets available; some approaches did share
ommendation list can be generated based on the sorted reviewers parts of the source code, but still they can not be reproduced. This
along with their scores. makes it difficult or even impossible to compare and evaluate our
Sampling and Validation. Sampling refers to the sampling approach with these approaches.
techniques for constructing the expertise model, and validation de- Therefore, we reproduced one baseline approach, RevFinder, on
notes the process of testing the performance and effectiveness of our collected dataset. RevFinder is a file path-based approach, which
certain sampling techniques. Unfortunately, most code reviewer is a specific expertise-based approach and supports recommending
recommendation studies did not provide detailed information and reviewers by measuring the file path similarity of commits. Specifi-
empirical validation on the sampling techniques for constructing cally, when there is a new commit, developers who have reviewed
their expertise models [31], and it is nontrivial to explore the effec- or engaged in similar revisions (i.e., with similar file paths) are
tiveness of the sampling and evaluation techniques with the purpose likely to be recommended. Previous studies (e.g., [7, 13, 19, 23, 30])
of providing such empirical validation. Therefore, to answer RQ3, using various artifacts (e.g., pull-requests, historical issues) usually
we planned to investigate whether and to what extent the sampling compared their approaches with RevFinder [26], since file path is a
techniques can impact the performance of the proposed code re- common feature of various artifacts related to code review.
viewer recommendation approach. According to a recent literature
review on code reviewer recommendation [31], incremental sam- 3.5 Evaluation Metrics
pling and fixed sampling are the two most popular sampling and To evaluate the similarity detection methods and the baseline ap-
validation techniques (see Figure 2), and have been commonly used proach, we adopted two of the most prevalent metrics used in
in code reviewer recommendation studies [31]. previous studies [31]: Top-k Accuracy and Mean Reciprocal Rank.
Since the code review data is temporal data, all prior studies We denote a code reviewer as r and a code reviewer set as R.
organized their dataset chronologically [31]. Thus, we also sorted Top-k Accuracy measures the percentage of code reviews for
our dataset in a chronological order. The incremental sampling which an approach can properly recommend the true code review-
technique takes the historical review data as the input of the exper- ers within the top-k positions in a ranked list of recommended code
tise model through increasing the sample number in each iteration reviewers. In other words, this accuracy is regarding the ratio of
(i.e., each step in Figure 2). The final performance of the recom- the number of correctly recommended reviewer r (i.e., isCorrect(r,
mendation approach is the average performance value of all the Top-k)) in the total number of reviewers of a ranked list of recom-
steps. In terms of the incremental sampling, we set four steps in mended reviewers. isCorrect(r, Top-k) returns 1 if there is at least
this work, and we took 10% of the new sample as the validation one top-k reviewer r who actually reviewed the code, otherwise,
set in each step. In terms of the fixed sampling, we employed a isCorrect(r, Top-k) returns 0 which means a wrong recommendation.
fixed percentage of the test set by randomly sampling with 10% in The higher the top-k accuracy value, the better the recommenda-
previous studies (e.g., [1]) of the dataset of the four projects. tion performance. By following the previous studies in Section 2.2,
we set the k values of 1, 3, 5, and 10.
Sorted chronologically
1 ∑︁
Dataset
Top-k Accuracy = isCorrect(r, Top-k) (3)
1 2 3 4 5 6 7 8 · · · · · · N-3 N-2 N-1 N |R|
r∈R
Mean Reciprocal Rank (MRR) calculates an average of re-
Step 1 Sample Valida�on
ciprocal ranks of correct code reviewers in a recommendation list.
Given a set of reviewers R, MRR can be calculated by Equation (4).
Incremental Step 2 Sample Valida�on
Sampling
rank(r) returns the value of the rank of the first correct reviewer
in the recommendation list for reviewer r. The value of 𝑟𝑎𝑛𝑘1 (𝑟 )
···

Step n Sample Valida�on returns 0 if there is no one who actually reviewed the code in the
recommendation list. Ideally, an approach that can provide a perfect
Fixed
1 2 3 4 5 6 7 8 · · · · · · N-3 N-2 N-1 N
ranking should achieve an MRR value of 1. Generally, the higher
Sampling the MRR value, the better the recommendation approach is.
Sample Valida�on
1 ∑︁ 1
MRR(R) = (4)
Figure 2: Overview of incremental and fixed sampling |R| rank(r)
r∈R

4 RESULTS AND DISCUSSION


3.4 Baseline Approach 4.1 RQ1: Effectiveness of Our Approach
To answer RQ2, we needed to reproduce existing expertise-based To answer RQ1, we evaluated the performance of the similarity
approaches, such as [7, 13, 26, 29], in order to compare them against detection methods for recommending code reviewers. We used

46
EASE ’23, June 14–16, 2023, Oulu, Finland Li et al.

Table 2: Top-k (1, 3, 5, 10) accuracy and MRR results of the selected similarity detection methods on four OSS projects

Project Neutron Nova


Similarity detection method Top-1 MRR Top-3 MRR Top-5 MRR Top-10 MRR Top-1 MRR Top-3 MRR Top-5 MRR Top-10 MRR
FP_JC 0.20 0.20 0.33 0.27 0.33 0.27 0.33 0.27 0.10 0.10 0.14 0.12 0.14 0.12 0.24 0.13
FP_HD 0.20 0.20 0.33 0.26 0.33 0.26 0.33 0.26 0.10 0.10 0.14 0.12 0.14 0.12 0.29 0.14
RC_CS 0.07 0.07 0.20 0.12 0.30 0.12 0.27 0.13 0.00 0.00 0.00 0.00 0.05 0.01 0.19 0.03
RC_JC 0.20 0.20 0.20 0.20 0.27 0.21 0.33 0.22 0.05 0.05 0.14 0.10 0.14 0.10 0.24 0.11
Average 0.17 0.17 0.27 0.21 0.31 0.22 0.32 0.22 0.06 0.06 0.11 0.09 0.12 0.09 0.24 0.10
FP_JC + FP_HD 0.27 0.27 0.33 0.29 0.33 0.30 0.33 0.30 0.10 0.10 0.14 0.12 0.14 0.12 0.29 0.14
FP_JC + RC_CS 0.20 0.20 0.20 0.20 0.27 0.22 0.27 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.19 0.02
FP_JC + RC_JC 0.27 0.27 0.33 0.26 0.33 0.29 0.33 0.29 0.05 0.05 0.05 0.05 0.19 0.08 0.33 0.09
FP_HD + RC_CS 0.13 0.13 0.13 0.13 0.27 0.16 0.27 0.16 0.00 0.00 0.00 0.00 0.05 0.01 0.19 0.03
FP_HD + RC_JC 0.00 0.00 0.07 0.03 0.07 0.03 0.20 0.05 0.00 0.00 0.05 0.02 0.19 0.06 0.24 0.06
RC_CS + RC_JC 0.00 0.00 0.07 0.03 0.07 0.03 0.20 0.05 0.00 0.00 0.05 0.02 0.10 0.03 0.19 0.04
Average 0.15 0.15 0.19 0.16 0.22 0.17 0.27 0.18 0.03 0.03 0.05 0.04 0.11 0.05 0.24 0.06
FP_HD + RC_CS + RC_JC 0.20 0.20 0.27 0.23 0.27 0.23 0.27 0.23 0.00 0.00 0.10 0.03 0.10 0.03 0.19 0.04
FP_JC + RC_CS + RC_JC 0.20 0.20 0.20 0.20 0.27 0.22 0.27 0.22 0.00 0.00 0.05 0.02 0.10 0.03 0.19 0.03
FP_JC + FP_HD + RC_JC 0.27 0.27 0.33 0.29 0.33 0.29 0.33 0.28 0.05 0.05 0.14 0.10 0.19 0.10 0.33 0.12
FP_JC + FP_HD + RC_CS 0.20 0.20 0.20 0.20 0.27 0.21 0.27 0.21 0.00 0.00 0.00 0.00 0.05 0.01 0.19 0.03
Average 0.22 0.22 0.25 0.23 0.29 0.24 0.29 0.24 0.01 0.01 0.07 0.04 0.11 0.04 0.23 0.06
FP_JC + FP_HD + RC_CS + RC_JC 0.20 0.20 0.20 0.20 0.27 0.22 0.27 0.22 0.00 0.00 0.10 0.03 0.10 0.03 0.24 0.05
Project Qt Base Qt Creator
Similarity detection method Top-1 MRR Top-3 MRR Top-5 MRR Top-10 MRR Top-1 MRR Top-3 MRR Top-5 MRR Top-10 MRR
FP_JC 0.00 0.00 0.14 0.05 0.14 0.05 0.36 0.08 0.00 0.00 0.20 0.07 0.40 0.11 0.40 0.11
FP_HD 0.00 0.00 0.00 0.00 0.21 0.05 0.29 0.06 0.20 0.20 0.20 0.20 0.20 0.20 0.40 0.23
RC_CS 0.00 0.00 0.00 0.00 0.21 0.05 0.29 0.06 0.00 0.00 0.20 0.10 0.60 0.19 0.80 0.21
RC_JC 0.07 0.07 0.21 0.14 0.21 0.14 0.21 0.14 0.20 0.20 0.40 0.27 0.60 0.32 0.60 0.32
Average 0.02 0.02 0.09 0.05 0.19 0.07 0.29 0.09 0.10 0.10 0.25 0.16 0.45 0.21 0.55 0.22
FP_JC + FP_HD 0.00 0.00 0.14 0.06 0.14 0.05 0.43 0.15 0.20 0.20 0.20 0.20 0.40 0.25 0.40 0.25
FP_JC + RC_CS 0.14 0.14 0.14 0.14 0.21 0.16 0.29 0.17 0.20 0.20 0.40 0.30 0.40 0.30 0.80 0.36
FP_JC + RC_JC 0.00 0.00 0.07 0.04 0.29 0.10 0.43 0.09 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.40
FP_HD + RC_CS 0.00 0.00 0.14 0.05 0.21 0.07 0.29 0.08 0.20 0.20 0.40 0.30 0.40 0.30 0.80 0.35
FP_HD + RC_JC 0.07 0.07 0.21 0.13 0.29 0.15 0.36 0.16 0.00 0.00 0.20 0.07 0.20 0.07 0.40 0.10
RC_CS + RC_JC 0.07 0.07 0.21 0.13 0.29 0.15 0.36 0.16 0.00 0.00 0.20 0.07 0.20 0.07 0.40 0.10
Average 0.05 0.05 0.15 0.09 0.24 0.11 0.36 0.14 0.17 0.17 0.30 0.22 0.33 0.23 0.53 0.26
FP_HD + RC_CS + RC_JC 0.07 0.07 0.14 0.11 0.29 0.14 0.29 0.14 0.20 0.20 0.40 0.30 0.40 0.30 0.80 0.36
FP_JC + RC_CS + RC_JC 0.07 0.07 0.14 0.11 0.21 0.13 0.29 0.14 0.40 0.40 0.40 0.40 0.40 0.40 0.80 0.46
FP_JC + FP_HD + RC_JC 0.07 0.07 0.14 0.10 0.21 0.11 0.43 0.14 0.40 0.40 0.40 0.40 0.40 0.40 0.40 0.40
FP_JC + FP_HD + RC_CS 0.14 0.14 0.21 0.17 0.21 0.17 0.29 0.18 0.40 0.40 0.40 0.40 0.40 0.40 0.80 0.45
Average 0.09 0.09 0.16 0.12 0.23 0.14 0.33 0.15 0.35 0.35 0.40 0.38 0.40 0.38 0.70 0.42
FP_JC + FP_HD + RC_CS + RC_JC 0.07 0.07 0.21 0.14 0.21 0.14 0.29 0.15 0.40 0.40 0.40 0.40 0.40 0.40 0.80 0.22

two similarity detection methods to measure the similarity of file method of FP_JC and RC_JC achieves the best top-5 (i.e., 0.33, 0.19,
paths, as well as two similarity detection methods to measure the 0.29, and 0.40) and top-10 accuracy (i.e., 0.33, 0.33, 0.43, and 0.40) on
similarity of code review comments, that is, FP_JC, FP_HD, RC_CS, the four projects. For the combinations of three similarity detection
and RC_JC (see Section 3.3). Table 2 presents the performance of methods, the mixed similarity detection method of FP_JC, FP_HD,
the similarity detection methods and their combinations. and RC_JC gets the best accuracy and MRR on the four projects. In
Firstly, we evaluated the performance of the individual similarity addition, we find that the performance of the combination of four
detection methods, as shown in the four top rows per project in similarity detection methods does not improve significantly when
Table 2. The grey cells indicate the best performance metrics. The compared to the mixed approaches of three similarity detection
results show that the similarity detection methods yield varying methods.
results on different projects. For example, the performance of FP_JC Considering the average results of the similarity detection meth-
and FP_HD on Neutron and Nova are better (with a higher top-k ods, we find that mixing three similarity detection methods can
accuracy and MRR) than on Qt Base and Qt Creator. Secondly, we achieve a slightly higher top-k accuracy and MRR on Neutron, Nova,
evaluated the performance of the combinations of similarity detec- and Qt Base, and a significantly better performance on Qt Creator
tion methods on the four projects (rows 6-17 per project in Table when compared to mixing two methods. However, combining four
2). For the combinations of two similarity detection methods, the similarity detection methods has no obvious performance improve-
mixed similarity detection method of FP_JC and FP_HD achieves ment. In general, the results show that combining three similarity
the best performance on Neutron and Nova projects with 0.33 accu- detection methods can relatively get the best performance of top-k
racy at top-10 recommendation, and the mixed similarity detection accuracy and MRR on the four projects.

47
Code Reviewer Recommendation for Architecture Violations EASE ’23, June 14–16, 2023, Oulu, Finland

Table 3: Average MRR results by the selected similarity de- four OSS projects. As shown in Figure 3, this includes the results of
tection methods compared with RevFinder the four individual similarity detection methods, the combinations
of two similarity detection methods (i.e., FP_JC + FP_HD and FP_JC
Project Neutron Nova Qt Base Qt Creator + RC_JC), one combination of three similarity detection methods
RevFinder 0.03 0.04 0.13 0.06 (i.e., FP_JC + FP_HD + RC_JC), and one combination of four simi-
FP_JC 0.25 0.12 0.05 0.07
larity detection methods (i.e., FP_JC + FP_HD + RC_CS + RC_JC).
FP_HD 0.24 0.12 0.03 0.21
RC_CS 0.11 0.01 0.03 0.13
Figure 3 shows different scale for Qt Creator, to observe and com-
RC_JC 0.21 0.09 0.12 0.28 pare the differences of the similarity detection methods. In terms
FP_JC + FP_HD 0.29 0.12 0.07 0.23 of top-k accuracy, the four individual similarity detection meth-
FP_JC + RC_JC 0.28 0.07 0.06 0.4
ods and their combinations outperform RevFinder approximately 4
FP_JC + FP_HD + RC_JC 0.28 0.09 0.11 0.4
FP_JC + FP_HD + RC_CS + RC_JC 0.21 0.03 0.13 0.36
times on Neutron; the combination of FP_JC and RC_JC and the
combination of FP_JC, FP_HD, and RC_JC achieve a better top-k
accuracy than RevFinder on Nova; RevFinder achieves a relatively
Discussion of RQ1: The experiment results in Table 2 indicate better top-k accuracy than the four individual similarity detection
that the selected similarity detection methods can produce accept- methods on Qt Base. Nearly all the similarity detection methods
able performance scores (MRR values between 0.06 and 0.36) on and their combinations outperform RevFinder on Qt Creator.
code reviewer recommendation for architecture violations on the Table 3 presents the average Mean Reciprocal Rank (MRR) of the
four projects, compared to the results (MRR values between 0.14 aforementioned similarity detection methods, their combinations,
and 0.59) of related studies on generic reviewer recommendation and RevFinder. The results show that the individual RC_JC method
(e.g., [7, 11]) with more reviewer candidates (which means poten- can achieve a better MRR on the four projects than RevFinder. All
tially better performance due to the larger datasets). Besides, we the mixed similarity detection methods can achieve a higher MRR
observed that the similarity detection methods can achieve varying than RevFinder on three of the four projects (except for Nova with
performances on different OSS projects. One possible reason is that mixing two and three similarity detection methods). Overall, the
the effectiveness of code reviewer recommendation approach can similarity detection methods and their combinations outperform
be influenced by project characteristics (e.g., size and type of project RevFinder in the majority of the cases.
datasets), which aligns with the findings by Chen et al. [6]. In ad- Discussion of RQ2: According to the results in Figure 3 and
dition, the results show that combining three similarity detection Table 3, we find that RevFinder does not perform as good as the
methods based on file paths and semantic information can achieve claimed results in the original work [26] when it runs on our dataset
the best performance of code reviewer recommendation. We cannot related to specific issues (i.e., architecture violations). One possible
observe significant improvement when combining four similarity reason could be that RevFinder recommends reviewers only by
methods. We conjecture that this is because certain similarity detec- comparing the file path similarity without considering the semantic
tion methods might affect the final performance to varying degrees similarity of related textual artifacts. Another potential reason is
on our dataset and need to be assigned with appropriate weights that the specific dataset, specifically the size and type of the dataset
to their similarity values; this requires further investigation, e.g., (review comments on architecture violations), in our work may
optimizing the performance by algorithms. impact the performance of reviewer recommendation. Besides, the
The results indicate that it is still challenging to recommend performance of RevFinder on the four projects also partially con-
code reviewers when specific issues (e.g., architecture violations) firms the finding of RQ1, that is, project characteristics can impact
are involved, and the performance cannot be always significantly the effectiveness of reviewer recommendation approaches.
improved by combining more similarity detection methods.
Finding 2: The selected similarity detection methods and their
Finding 1: The performance of the similarity detection meth- combinations achieve a better performance than RevFinder in
ods and their combinations can produce acceptable performance the majority of the cases.
scores, and achieve varying results on different projects. Combin-
ing three similarity methods can achieve the best performance
of reviewer recommendation.
4.3 RQ3: Comparison of Sampling Methods
To answer RQ3, we used the incremental sampling technique to
construct the expertise model and evaluated the performance of the
4.2 RQ2: Comparison of Recommendation selected similarity detection methods and their combinations on our
Approaches dataset, as described in Section 3.3. We used the same performance
For RQ2, we measured the performance of the similarity detection metrics and the baseline approach mentioned in Section 3.4 and
methods through a comparison with the baseline reviewer recom- Section 3.5. Due to space limitations, we only present the top-k
mendation approach, RevFinder [26]. As mentioned in Section 3.4, accuracy of the best-performing similarity detection methods and
this was the only reviewer recommendation approach that we could their combinations in Table 4.
reproduce. Specifically, we extracted the individual methods and According to the results in Table 4, when using the fixed sam-
the best-performing combinations of the similarity detection meth- pling technique, almost all the top-k accuracy of the similarity
ods from Table 2, and we compared them with RevFinder on the detection methods and their combinations have better performance

48
EASE ’23, June 14–16, 2023, Oulu, Finland Li et al.

Neutron Nova Qt Base Qt Creator


Ϭ͘ϱ Ϭ͘ϱ Ϭ͘ϱ ϭ
Ϭ͘ϰϱ Ϭ͘ϰϱ Ϭ͘ϰϱ Ϭ͘ϵ
Ϭ͘ϰ Ϭ͘ϰ Ϭ͘ϰ Ϭ͘ϴ
Ϭ͘ϯϱ Ϭ͘ϯϱ Ϭ͘ϯϱ Ϭ͘ϳ
Ϭ͘ϯ Ϭ͘ϯ Ϭ͘ϯ Ϭ͘ϲ
Ϭ͘Ϯϱ Ϭ͘Ϯϱ Ϭ͘Ϯϱ Ϭ͘ϱ
Ϭ͘Ϯ Ϭ͘Ϯ Ϭ͘Ϯ Ϭ͘ϰ
Ϭ͘ϭϱ Ϭ͘ϭϱ Ϭ͘ϭϱ Ϭ͘ϯ
Ϭ͘ϭ Ϭ͘ϭ Ϭ͘ϭ Ϭ͘Ϯ
Ϭ͘Ϭϱ Ϭ͘Ϭϱ Ϭ͘Ϭϱ Ϭ͘ϭ
Ϭ ϱ Ϭ Ϭ Ϭ
ϭ ϯ ϭϬ ϭ ϯ ϱ ϭϬ ϭ ϯ ϱ ϭϬ ϭ ϯ ϱ ϭϬ

ZĞǀ&ŝŶĚĞƌ &Wͺ, Zͺ: &Wͺ:нZͺ: &Wͺ:н&Wͺ,нZͺ^нZͺ:


&Wͺ: Zͺ^ &Wͺ:н&Wͺ, &Wͺ:н&Wͺ,нZͺ:

Figure 3: Performances of Top-k accuracy of mixed similarity detection methods compared to RevFinder

Table 4: Top-k accuracy by the selected similarity detection methods compared with RevFinder

Project Neutron Nova


Sampling Fixed Sampling Incremental Sampling Fixed Sampling Incremental Sampling
k 1 3 5 10 1 3 5 10 1 3 5 10 1 3 5 10
RevFinder 0.000 0.067 0.067 0.200 0.000 0.000 0.002 0.014 0.000 0.048 0.190 0.238 0.001 0.008 0.014 0.018
FP_JC 0.200 0.333 0.333 0.333 0.003 0.012 0.013 0.014 0.095 0.143 0.143 0.238 0.001 0.016 0.021 0.026
FP_HD 0.200 0.333 0.333 0.033 0.003 0.008 0.013 0.024 0.095 0.143 0.143 0.286 0.001 0.015 0.018 0.022
RC_CS 0.067 0.200 0.300 0.267 0.003 0.005 0.006 0.016 0.000 0.000 0.048 0.190 0.001 0.011 0.013 0.022
RC_JC 0.200 0.200 0.267 0.333 0.005 0.018 0.024 0.024 0.048 0.143 0.143 0.238 0.001 0.010 0.015 0.016
FP_JC + FP_HD 0.267 0.333 0.333 0.333 0.003 0.008 0.015 0.016 0.095 0.143 0.143 0.286 0.001 0.015 0.018 0.023
FP_JC + RC_JC 0.267 0.333 0.333 0.333 0.008 0.012 0.013 0.014 0.048 0.048 0.190 0.333 0.005 0.016 0.019 0.024
FP_JC + FP_HD + RC_JC 0.267 0.333 0.333 0.333 0.008 0.012 0.012 0.016 0.048 0.143 0.190 0.333 0.003 0.018 0.019 0.024
FP_JC + FP_HD + RC_CS + RC_JC 0.200 0.200 0.267 0.267 0.005 0.008 0.017 0.017 0.000 0.095 0.095 0.238 0.004 0.015 0.015 0.024
Project Qt Base Qt Creator
Sampling Fixed Sampling Incremental Sampling Fixed Sampling Incremental Sampling
k 1 3 5 10 1 3 5 10 1 3 5 10 1 3 5 10
RevFinder 0.071 0.214 0.286 0.357 0.004 0.022 0.025 0.038 0.000 0.200 0.400 0.400 0.003 0.006 0.011 0.033
FP_JC 0.048 0.143 0.143 0.238 0.019 0.028 0.030 0.032 0.000 0.200 0.400 0.400 0.007 0.017 0.021 0.039
FP_HD 0.000 0.000 0.214 0.286 0.013 0.013 0.013 0.026 0.200 0.200 0.200 0.400 0.012 0.012 0.012 0.029
RC_CS 0.000 0.000 0.214 0.286 0.003 0.005 0.021 0.036 0.000 0.200 0.600 0.800 0.003 0.003 0.009 0.009
RC_JC 0.071 0.214 0.214 0.214 0.010 0.016 0.023 0.033 0.200 0.400 0.600 0.600 0.000 0.000 0.000 0.013
FP_JC + FP_HD 0.000 0.071 0.286 0.429 0.015 0.015 0.015 0.024 0.200 0.200 0.400 0.400 0.004 0.009 0.019 0.033
FP_JC + RC_JC 0.400 0.400 0.400 0.400 0.014 0.016 0.020 0.025 0.400 0.400 0.400 0.400 0.004 0.004 0.015 0.033
FP_JC + FP_HD + RC_JC 0.071 0.143 0.214 0.429 0.015 0.016 0.018 0.024 0.400 0.400 0.400 0.400 0.004 0.012 0.018 0.033
FP_JC + FP_HD + RC_CS + RC_JC 0.071 0.214 0.214 0.286 0.015 0.017 0.022 0.040 0.400 0.400 0.400 0.800 0.009 0.009 0.013 0.013

scores than when using the incremental sampling technique. For review data, the first several iterations of the incremental sampling
example, in terms of Neutron, the combination of FP_JC, FP_HD, process may have relatively low performance and then decrease
and RC_JC achieves top-k accuracy of around 0.267, 0.333, 0.333, the final average top-k accuracy. This conjecture corroborates the
0.333 for k = 1, 3, 5, 10, respectively. In comparison, this combina- recent findings by Hu et al. [11] that the investigated code reviewer
tion method achieves top-k accuracy of 0.008, 0.012, 0.012, 0.016 for approaches are sensitive to training data on evaluation metrics.
k = 1, 3, 5, 10 when using incremental sampling. Moreover, the base-
line approach, RevFinder, also has such performance differences Finding 3: Sampling techniques can impact the performance of
when using fixed and incremental sampling techniques. Similar code recommendation approaches. In our work, using the fixed
observations are also valid for the MRR results of the rest of the sampling technique to construct an expertise model can achieve
recommendation approaches listed in Table 4. In general, compared a significantly better performance compared to the incremental
to using incremental sampling, the similarity detection methods sampling technique.
and their combinations can achieve a better performance when
using fixed sampling across all metrics.
Discussion of RQ3: The results of RQ3 show that all the ap- 4.4 Implications
proaches are sensitive to sampling techniques; it is clearly observed
4.4.1 Implications for researchers. Establish explicit standards
that all approaches can achieve a significantly higher recommenda-
for research on code reviewer recommendation. When a large
tion performance when using the fixed sampling technique. One
number of submitted code changes happen, it is necessary to auto-
possible reason could be that the size of the dataset used to con-
mate reviewer recommendation to speed up development iterations
struct expertise models can influence the accuracy of reviewer
and ensure quality code reviews. In this work, we encountered
recommendation. When small samples are taken as the historical
certain issues that may hinder further research on code reviewer

49
Code Reviewer Recommendation for Architecture Violations EASE ’23, June 14–16, 2023, Oulu, Finland

recommendation. For example, few existing studies chose to share Construct Validity: The main threat to construct validity in
their artifacts (e.g., datasets and source code). Moreover, most of the this study concerns the performance metrics (i.e., Top-k accuracy
studies on reviewer recommendation are purely academic [31] and and MRR) used in our work. This threat is partially mitigated in
lack support and validation from the industry. Besides, prior studies our study, as the chosen metrics are widely adopted in existing
often use different metrics (e.g., precision, recall, and F-measure) code reviewer recommendation studies [31]. Besides, we shared
and datasets (e.g., issues and pull-requests) in their experiments, our dataset and source code [15] to facilitate the replication of our
which makes it harder to compare the performance among differ- study and future research.
ent approaches. Thus, it is necessary to establish explicit standards, Reliability: The threats to reliability stem from how researchers
such as promoting open science, validation in industry, and stan- potentially influence the study implementation. Possible threats in
dardized metrics and datasets. this study might come from the experimental settings (e.g., sam-
Refine the existing approaches and focus on specific issues pling percentage and iteration steps) and the reliability of measures
during code review. The existing approaches for code reviewer (e.g., results of the similarity detection methods). As described in
recommendation should be refined through more empirical studies. Section 3.3, we used a consistent and reproducible process to con-
For example, developer turnover is quite common during OSS de- duct sampling and validation on our dataset. Besides, to compare
velopment, but is rarely considered in current studies. In addition, the effectiveness of our approach, we used the baseline approach,
employing hybrid methods to recommend reviewers by combin- RevFinder, which is a common baseline used in related studies.
ing different approaches (see Section 2.2) may be promising and External Validity: The threats to external validity pertain to
worth exploring in the future. Moreover, compared to general code the generalizability of our results. In this study, the experiment
reviewer recommendation, recommending reviewers who have results were produced based on the code review data of four OSS
awareness of specific types of issues, such as architecture violations, projects in Gerrit. Therefore, our results may not be generalized to
is important to detect and solve these issues during code review. In commercial projects or open-sourced projects on other platforms
this work, we conducted an exploratory study that attempts to find (e.g., GitHub). Our future work will try to explore commercial and
appropriate reviewers who have knowledge of architecture viola- GitHub projects with rich code review data to better generalize the
tions based on historical commits related to architecture violations. results of our approach.
It is worth investigating other types of issues (e.g., code smells,
cyclic dependencies), architectural or otherwise, in recommending
reviewers with pertinent knowledge. 6 RELATED WORK
Code reviewer recommendation has been gaining increasing atten-
4.4.2 Implications for practitioners. Apply and validate in indus- tion in software engineering research in recent years, but there is
try projects. Considering the characteristics of the open-source still a lack of tools for recommending code reviewers [31]. Balachan-
communities, the code reviewer recommendation approaches might dran [3] proposed automated reviewer recommendation through
have different performance on industrial projects; this is also pointed the ReviewBot tool, which aims at improving the review qual-
out in the study by Chen et al. [6]. Practitioners can employ re- ity in an industrial context through automated static code anal-
viewer recommendation approaches in their projects by taking the ysis. Patanamon et al. [26] proposed an expertise-based approach
associated project characteristics into consideration (e.g., construct- RevFinder based on file path similarity; their assumption is that
ing project-specific models). More empirical validation of these files with similar paths have close functionality and the associated
approaches from industrial projects is encouraged to consolidate reviewers are likely to have related experience. Zanjani et al. [30]
the findings on their performance. Moreover, there has been little developed the cHRev approach that considers the review history
research in code reviewer recommendation [31], and there is still including review number and review time. The cHRev approach
a lack of industrial tools for recommending code reviewers. More can build an expertise model based on historical code changes and
collaborations between academia and industry are indispensable then recommend relevant peer reviewers. Yu et al. [29] provided a
to devise dedicated tools (e.g., plug-ins or bots) for existing code reviewer recommendation approach by building a social network
review systems like Gerrit. named Comment Networks, which can capture common interests
Optimize code reviewer recommendation approaches. In in social activities between contributors and reviewers, and then
this work, our reviewer recommendation approach is based on the rank reviewers based on historical comments and the generated
similarity of file paths and the semantic similarity of review com- comment networks. Similarly, Kong et al. [13] proposed the Camp
ments. However, there are certain realistic factors that should be approach based on collaboration networks along with reviewers’
considered for practical software development, such as workload, expertise from pull requests and file paths.
availability, and developer turnover. Therefore, practitioners should Compared with the previous studies, our study specifically fo-
pay more attention to the above-mentioned factors when optimiz- cuses on semantic information in review comments on architecture
ing the existing approaches for code reviewer recommendation. violations. We aim at recommending code reviewers who have
For example, they could periodically generate an updated list of awareness of architecture violations and can have a final check on
candidate reviewers and add weights to the reviewers’ availability. the pending code changes (i.e., not yet being merged into the code
base) that may potentially lead to architecture violations; this com-
5 THREATS TO VALIDITY plements other reviewer recommendation approaches that can be
In this section, we discuss the threats to the validity of this study, used in combination with our approach to find pertinent (generic)
which may affect the results of our study. code reviewers.

50
EASE ’23, June 14–16, 2023, Oulu, Finland Li et al.

7 CONCLUSIONS [10] Mikołaj Fejzer, Piotr Przymus, and Krzysztof Stencel. 2018. Profile based recom-
mendation of code reviewers. Journal of Intelligent Information Systems 50, 3
When a large number of code changes are submitted to a code (2018), 597–619.
review system like Gerrit, it is more efficient to find suitable code [11] Yuanzhe Hu, Junjie Wang, Jie Hou, Shoubin Li, and Qing Wang. 2020. Is there a
"golden" rule for code reviewer recommendation?: An experimental evaluation.
reviewers through automated reviewer recommendation compared In Proceedings of the 20th IEEE International Conference on Software Quality,
to manually assigning reviewers. In this paper, we conducted an Reliability and Security (QRS). IEEE, Macau, China, 497–508.
exploratory study to recommend qualified reviewers who have [12] Jing Jiang, Jia-Huan He, and Xue-Yuan Chen. 2015. Coredevrec: Automatic
core member recommendation for contribution evaluation. Journal of Computer
awareness of architecture violations, as a promising and feasible Science and Technology 30, 5 (2015), 998–1016.
way to detect and prevent architecture erosion through code review. [13] Dezhen Kong, Qiuyuan Chen, Lingfeng Bao, Chenxing Sun, Xin Xia, and Shanping
Our study is the first attempt to explore the possibility of us- Li. 2022. Recommending Code Reviewers for Proprietary Software Projects: A
Large Scale Study. In Proceedings of the 29th IEEE International Conference on
ing similarity detection methods to recommend code reviewers Software Analysis, Evolution and Reengineering (SANER). IEEE, 630–640.
on architecture violations. We evaluated the selected similarity de- [14] Ruiyin Li, Peng Liang, and Paris Avgeriou. 2022. Warnings: Violation symptoms
indicating architecture erosion. arXiv preprint arXiv: https:// arxiv.org/ abs/ 2212.
tection methods and compared them with the baseline approach, 12168 (2022).
RevFinder. The results show that the similarity detection methods [15] Ruiyin Li, Peng Liang, and Paris Avgeriou. 2023. Replication Package. https:
and their combinations can produce acceptable performance, and //doi.org/10.5281/zenodo.7292880.
[16] Ruiyin Li, Peng Liang, Mohamed Soliman, and Paris Avgeriou. 2021. Under-
the combined similarity detection methods outperform the baseline standing architecture erosion: The practitioners’ perceptive. In Proceedings of the
approach across most performance metrics on our dataset. Besides, 29th IEEE/ACM International Conference on Program Comprehension (ICPC). IEEE,
we found that different sampling techniques used to build expertise Madrid, Spain, 311–322.
[17] Ruiyin Li, Peng Liang, Mohamed Soliman, and Paris Avgeriou. 2022. Under-
models can impact the performance of code reviewer recommen- standing software architecture erosion: A systematic mapping study. Journal of
dation approaches, and the fixed sampling technique outperforms Software: Evolution and Process 34, 3 (2022), e2423.
[18] Ruiyin Li, Mohamed Soliman, Peng Liang, and Paris Avgeriou. 2022. Symptoms
the incremental sampling technique on our dataset. of architecture erosion in code reviews: A study of two OpenStack projects. In
In the future, we plan to further optimize our reviewer recom- Proceedings of the 19th IEEE International Conference on Software Architecture
mendation approach (e.g., improve the performance through hybrid (ICSA). IEEE, Honolulu, Hawaii, USA, 24–35.
[19] Jakub Lipcak and Bruno Rossi. 2018. A large-scale study on source code reviewer
approaches discussed in Section 2.2) on larger datasets concern- recommendation. In Proceedings of the 44th Euromicro Conference on Software
ing architecture issues from diverse OSS projects and commercial Engineering and Advanced Applications (SEAA). IEEE, Prague, Czech, 378–387.
systems (e.g., explore the possibility in a cross-project scenario). [20] Cristiano Maffort, Marco Tulio Valente, Ricardo Terra, Mariza Bigonha, Nicolas
Anquetil, and André Hora. 2016. Mining architectural violations from version
history. Empirical Software Engineering 21, 3 (2016), 854–895.
ACKNOWLEDGMENTS [21] Ali Ouni, Raula Gaikovina Kula, and Katsuro Inoue. 2016. Search-based peer
reviewers recommendation in modern code review. In Proceedings of the 32nd
This work is funded by NSFC with No. 62172311 and the Special IEEE International Conference on Software Maintenance and Evolution (ICSME).
Fund of Hubei Luojia Laboratory. IEEE, Raleigh, NC, USA, 367–377.
[22] Mohammad Masudur Rahman, Chanchal K Roy, and Raula G Kula. 2017. Predict-
ing usefulness of code review comments using textual features and developer
REFERENCES experience. In Proceedings of the 14th IEEE/ACM International Conference on
[1] Wisam Haitham Abbood Al-Zubaidi, Patanamon Thongtanunam, Hoa Khanh Mining Software Repositories (MSR). IEEE, 215–226.
Dam, Chakkrit Tantithamthavorn, and Aditya Ghose. 2020. Workload-aware [23] Soumaya Rebai, Abderrahmen Amich, Somayeh Molaei, Marouane Kessentini,
reviewer recommendation using a multi-objective search-based approach. In and Rick Kazman. 2020. Multi-objective code reviewer recommendations: Balanc-
Proceedings of the 16th ACM International Conference on Predictive Models and ing expertise, availability and collaborations. Automated Software Engineering
Data Analytics in Software Engineering (PROMISE). ACM, Virtual, USA, 21–30. 27, 3 (2020), 301–328.
[2] Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- [24] Shade Ruangwan, Patanamon Thongtanunam, Akinori Ihara, and Kenichi Mat-
lenges of modern code review. In Proceedings of the 35th International Conference sumoto. 2019. The impact of human factors on the participation decision of
on Software Engineering (ICSE). IEEE, San Francisco, USA, 712–721. reviewers in modern code review. Empirical Software Engineering 24, 2 (2019),
[3] Vipin Balachandran. 2013. Reducing human effort and improving quality in peer 973–1016.
code reviews using automatic static analysis and reviewer recommendation. In [25] Emre Sülün, Eray Tüzün, and Uğur Doğrusöz. 2021. RSTrace+: Reviewer sug-
Proceedings of the 35th International Conference on Software Engineering (ICSE). gestion using software artifact traceability graphs. Information and Software
IEEE, San Francisco, CA, USA, 931–940. Technology 130 (2021), 106455.
[4] Amiangshu Bosu, Jeffrey C Carver, Christian Bird, Jonathan Orbeck, and Christo- [26] Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Raula Gaikovina Kula,
pher Chockley. 2016. Process aspects and social dynamics of contemporary Norihiro Yoshida, Hajimu Iida, and Ken-ichi Matsumoto. 2015. Who should
code review: Insights from open source development and industrial practice at review my code? A file location-based code-reviewer recommendation approach
microsoft. IEEE Transactions on Software Engineering 43 (2016), 56–75. for modern code review. In Proceedings of the 22nd IEEE International Conference
[5] Joao Brunet, Roberto Almeida Bittencourt, Dalton Serey, and Jorge Figueiredo. on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 141–150.
2012. On the evolutionary nature of architectural violations. In Proceedings of [27] Colin C Venters, Rafael Capilla, Stefanie Betz, Birgit Penzenstadler, Tom Crick,
the 19th Working Conference on Reverse Engineering (WCRE). IEEE, 257–266. Steve Crouch, Elisa Yumi Nakagawa, Christoph Becker, and Carlos Carrillo.
[6] Qiuyuan Chen, Dezhen Kong, Lingfeng Bao, Chenxing Sun, Xin Xia, and Shan- 2018. Software sustainability: Research and practice from a software architecture
ping Li. 2022. Code Reviewer Recommendation in Tencent: Practice, Challenge, viewpoint. Journal of Systems and Software 138 (2018), 174–188.
and Direction. In Proceedings of the 44nd International Conference on Software [28] Xin Xia, David Lo, Xinyu Wang, and Xiaohu Yang. 2015. Who should review
Engineering: Software Engineering in Practice (ICSE-SEIP). ACM, 115–124. this change?: Putting text and file location analyses together for more accurate
[7] Moataz Chouchen, Ali Ouni, Mohamed Wiem Mkaouer, Raula Gaikovina Kula, recommendations. In Proceedings of the 31st IEEE International Conference on
and Katsuro Inoue. 2021. WhoReview: A multi-objective search-based approach Software Maintenance and Evolution (ICSME). IEEE, Bremen, Germany, 261–270.
for code reviewers recommendation in modern code review. Applied Soft Com- [29] Yue Yu, Huaimin Wang, Gang Yin, and Tao Wang. 2016. Reviewer recommenda-
puting 100 (2021), 106908. tion for pull-requests in GitHub: What can we learn from code review and bug
[8] Emre Doğan, Eray Tüzün, K Ayberk Tecimer, and H Altay Güvenir. 2019. Investi- assignment? Information and Software Technology 74 (2016), 204–218.
gating the validity of ground truth in code reviewer recommendation studies. In [30] Motahareh Bahrami Zanjani, Huzefa Kagdi, and Christian Bird. 2015. Automati-
Proceedings of the 13th ACM/IEEE International Symposium on Empirical Software cally recommending peer reviewers in modern code review. IEEE Transactions
Engineering and Measurement (ESEM). IEEE, Gothenburg, Sweden, 1–6. on Software Engineering 42, 6 (2015), 530–543.
[9] Vasiliki Efstathiou, Christos Chatzilenas, and Diomidis Spinellis. 2018. Word [31] H Alperen Çetin, Emre Doğan, and Eray Tüzün. 2021. A review of code reviewer
embeddings for the software engineering domain. In Proceedings of the 15th recommendation studies: Challenges and future directions. Science of Computer
International Conference on Mining Software Repositories (MSR). ACM, 38–41. Programming 208 (2021), 102652.

51

You might also like