Taming Google-Scale Continuous Testing: Ntroduction
Taming Google-Scale Continuous Testing: Ntroduction
Taming Google-Scale Continuous Testing: Ntroduction
Abstract—Growth in Google’s code size and feature churn snapshot of Google’s codebase. I.e., TAPs milestone strategy
rate has seen increased reliance on continuous integration (CI) is to bundle a number of consecutive code commits together,
and testing to maintain quality. Even with enormous resources and run (or cut) a milestone as frequently as possible given
dedicated to testing, we are unable to regression test each code
change individually, resulting in increased lag time between code the available compute resources.
check-ins and test result feedback to developers. We report results A milestone is typically cut every 45 minutes during peak
of a project that aims to reduce this time by: (1) controlling test development time, meaning that, in the best case, a developer
workload without compromising quality, and (2) distilling test who submitted code has to wait for at least one milestone
results data to inform developers, while they write code, of the before being notified of test failures. In practice, however,
impact of their latest changes on quality. We model, empirically
understand, and leverage the correlations that exist between our because the TAP infrastructure is large and complex, with
code, test cases, developers, programming languages, and code- multiple interconnected parts, designed to deal with large
change and test-execution frequencies, to improve our CI and milestone sizes—as large as 4.2 million tests as selected
development processes. Our findings show: very few of our tests using reverse dependencies on changed source files since
ever fail, but those that do are generally “closer” to the code the previous milestone—it is susceptible to additional delays
they test; certain frequently modified code and certain users/tools
cause more breakages; and code recently modified by multiple caused by Out of Memory errors, machine failures, and
developers (more than 3) breaks more often. other infrastructure problems. In our work, we have observed
unacceptably large delays of up to 9 hours.
Keywords-software testing, continuous integration, selection.
In this paper, we describe a project that had two goals for
aiding developers and reducing test turnaround time. First,
I. I NTRODUCTION
we wanted to reduce TAP’s workload by avoiding frequently
The decades-long successful advocacy of software testing re-executing test cases that were highly unlikely to fail. For
for improving/maintaining software quality has positioned it example, one of our results showed that of the 5.5 Million
at the very core of today’s large continuous integration (CI) affected tests that we analyzed for a time period, only 63K
systems [1]. For example, Google’s Test Automation Platform ever failed. Yet, TAP treated all these 5.5 Million tests the
(TAP) system [2], responsible for CI of Google’s vast majority same in terms of execution frequency. Valuable resources may
of 2 Billion LOC codebase—structured largely as a single have been saved and test turnaround time reduced, had most of
monolithic code tree [3]—would fail to prevent regressions the “always passing” tests been identified ahead of time, and
in Google’s code without its testing-centric design. However, executed less frequently than the “more likely to fail” tests.
this success of testing comes with the cost of extensive Our second goal was to distill TAP’s test results data, and
compute cycles. In an average day, TAP integrates and tests— present it to developers as actionable items to inform code
at enormous compute cost—more than 13K code projects, development. For example, one such item that our project
requiring 800K builds and 150 Million test runs. yielded was “You are 97% likely to cause a breakage because
Even with Google’s massive compute resources, TAP is you are editing a Java source file modified by 15 other
unable to keep up with the developers’ code churn rate— developers in the last 30 days.” Armed with such timely data-
a code commit every second on the average—i.e., it is not driven guidance, developers may take preemptive measures to
cost effective to test each code commit individually. In the prevent breakages, e.g., by running more comprehensive pre-
past TAP tried to test each code change, but found that submit tests, inviting a more thorough code review, adding test
the compute resources were growing quadratically with two cases, and running static analysis tools.
multiplicative linear factors: (1) the code submission rate Because we needed to deploy our results in an Industry
which (for Google) has been growing roughly linearly and (2) setting, our project faced a number of practical constraints,
the size of the test pool which also has been growing linearly. some due to resources and others stemming from Google’s
This caused unsustainable demand for compute resources, coding and testing practices that have evolved over years to
hence TAP invented a mechanism to slow down one of the deal with scale and maximize productivity. First, Google’s
linear factors by breaking a TAP day into a sequence of notion of a “test” is different from what we understand to
epochs called milestones, each of which integrates and tests a be a “test case” or “test suite.” Google uses the term “test
target,” which is essentially a buildable and executable code In particular, we modeled the relationships between our test
unit labeled as a test in a meta BUILD file. A test target may targets and developers, code under test, and code-change and
be a suite of JUnit test cases, or a single python test case, test execution frequencies. We found that
or a collection of end-to-end test scripts. For our work, this
• looking at the overall test history of 5.5 Million affected
meant that we needed to interpret a FAILED outcome of a test
tests in a given time period, only 63K ever failed; the rest
target in one of several ways, e.g., failure of a single JUnit
never failed even once.
test case that is part of the test target, or a scripted end-to-end
• of all test executions we examined, only a tiny fraction
test, or single test case. Hence, we could not rely on obtaining
(1.23%) actually found a test breakage (or a code fix)
a traditional fault matrix [4] that maps individual test cases to
being introduced by a developer. The entire purpose of
faults; instead, we had sequences of time-stamped test target
TAP’s regression testing cycle is to find this tiny percent
outcomes. Moreover, the code covered by a test target needed
of tests that are of interest to developers.
to be interpreted as the union of all code elements covered by
• the ratio of PASSED vs. FAILED test targets per code
its constituent test cases. Again, we could not rely on obtaining
change is 99:1, which means that test turnaround time
a traditional coverage matrix [4] that maps individual test cases
may be significantly reduced if tests that almost never
to elements they cover.
FAIL, when affected, are re-executed less frequently than
Second, we had strict timing and resource restrictions. We
tests that expose breakages/fixes.
could not run a code instrumenter at each milestone on the
• modeling our codebase as a code dependency graph, we
massive codebase and collect code coverage numbers because
found that test targets that are more than a distance of
this would impose too large an overhead to be practical.
10 (in terms of number of dependency edges) from the
Indeed, just writing and updating the code coverage reports
changed code hardly ever break.
in a timely manner to disk would be an impossible task.
• most of our files are modified infrequently (once or twice
We also did not have tools that could instrument multiple
in a month) but those modified more frequently often
programming languages (Java C++, Go, Python, etc.) that form
cause breakages.
Google’s codebase and produce results that were compatible
• certain file types are more prone to breakages,
across languages for uniform analysis. Moreover, the code
• certain users/tools are more likely to cause breakages,
churn rates would quickly render the code coverage reports ob-
• files within a short time span modified by 3 (or more) de-
solete, requiring frequent updates. The above two constraints
velopers are significantly more likely to cause breakages
meant that we could not rely on the availability of fault and
compared to 2 developers.
coverage matrices, used by conventional regression test selec-
• while our code changes affect a large number of test
tion/prioritization approaches [5] that require exact mappings
targets, they do so with widely varying frequencies per
between code elements (e.g., statements [6], methods [7]),
target, and hence, our test targets need to be treated
requirements [8] and test cases/suites.
differently for test scheduling.
Third, the reality of practical testing in large organizations
is the presence of tests whose PASSED/FAILED outcome may These findings have significant practical implications for
be impacted by uncontrollable/unknown factors, e.g., response Google that is investing in continued research as well as ap-
time of a server; these are termed “flaky” tests [9] [10]. plied techniques that have real, practical impact on developer
A flaky test may, for example, FAIL because a resource is productivity while reducing compute costs. In particular, we
unavailable/unresponsive at the time of its execution. The same want to reduce the resources used in our CI system while
test may PASS for the same code if it is executed at a different not degrading the PASSED/FAILED signal provided to our
time after the resource became available. Flaky tests exist for developers. This research has shown that more than 99% of all
various reasons [11] [12] and it is impossible to weed out all tests run by the CI system pass or flake, and it has identified
flaky tests [13]. For our work, this meant that we could not the first set of signals that will allow us to schedule fewer
rely on regression test selection heuristics such as “rerun tests tests while retaining high probability of detecting real faults,
that failed recently” [14] [15] as we would end up mostly using which we can improve the ratio of change (fault or
re-running flaky tests [1]. fix) detection per unit of compute resource spent. The key to
Because of these constraints, we decided against using our success is to perform this reduction, while simultaneously
approaches that rely on fine-grained information per test case, retaining near certainty of finding real program faults when
e.g., exact mappings between test cases and code/requirements they are inserted; this research enables that goal. Specifically,
elements, or PASSED/FAILED histories. Instead, we devel- from this research we plan to expand the set of signals
oped an empirical approach, guided by domain expertise and about when tests actually fail, and use that information to
statistical analysis, to model and understand factors that cause improve test selection, running fewer tests while retaining high
our test targets to reveal breakages (transitions from PASSED- confidence that faults will be detected. We then plan to feed
to-FAILED) and fixes (FAILED-to-PASSED). This approach these signals into a Machine Learning tool to produce a single
also worked well with our goal of developing data-driven signal for reducing the set of selected tests. We also plan to
guidelines for developers because it yielded generalized, high- provide feedback to developers—prior to code submission—
level relationships between our artifacts of interest. that certain types of changes are more likely to break and
should be qualified and scrutinized more closely. For example, t14
t13
our data shows that a single file changed many times by t12
TAP works, and in Section III describe the nature of our data. t6
t5
In Section IV, we develop the hypotheses for our project and t4
discuss results. We present related work in Section V, and t3
which in turn require 6 more packages. Even though the de- Frequency (in 1000’s)
//ui/jface/tests/viewers/AllTests //ui/jface/tests/layout
//ui/jface/tests/viewers/interactive //ui/jface/tests/viewers/TreeViewerTest
We are interested in MinDist values for test targets that tran-
//ui/jface/tests/viewers/interactive/AddElementAction.java //ui/jface/tests/viewers/TreeViewerColumnTest sitioned from PASSED to FAILED (breakage) and FAILED
//ui/jface/tests/viewers/TestModel.java
to PASSED (fix) for a given change. We call these our edge
//ui/jface/tests/viewers/interactive/TestElement.java targets; our developers are most interested in these edge targets
as they provide information regarding fixes and breakages.
Fig. 5. Modeling Distance.
Because of the way Google reports test targets results, in
terms of test target outcome per CL, we need to define MinDist
It is this structure that TAP employs to compute the set of per CL and test target pair, instead of per file and test target
AFFECTED test targets for a given committed code change. pair.
Assuming that the file TestElement.java has changed in a CL. Definition: For a given test target Tj and an affecting change-
TAP uses a reverse dependency graph to compute all test list CLi , we say that the relation MinDist(CLi , Tj )=n holds
targets that may be impacted by the change. In our example, iff there exists a file F modified at CLi such that MinDist(Tj ,
the top-level node org/eclipse/platform/ui:ui tests happens to F ) = n. 2.
be a test target, and hence, is added to the set of AFFECTED
Note that MinDist(CLi , Tj ) as defined above is a relation,
targets. Because Google’s codebase is very large, the set of
not a function, i.e., MinDist(CLi , Tj )=n may hold for several
AFFECTED targets can get quite large. In our work, we have
values of n, determined by our original MinDist() function
seen set sizes as large as 1.6 Million.
defined for a file and test target pair.
We define the term MinDist as the shortest distance (in
Next we develop the MinDist relation for a specific test
terms of number of directed edges) between two nodes in
target Tj . Intuitively, this relation holds for all values returned
our dependency graph. In our example from Figure 5, the
by our original MinDist() function for all constituent files
MinDist between ui tests and TestElement.java is 5 (we write
modified in every affecting CL.
in functional notation MinDist(ui tests, TestElement.java) =
5). In our work on Google’s code repository, we have seen Definition: For a test target Tj , we say the relation
MinDist(Tj )=n holds iff there exists a changelist CLi that
affects Tj and MinDist(CLi , Tj )=n also holds. 2.
Probability
Given all the MinDist values for a test target Tj , we can
compute the probability that MinDist(Tj )=x for all values of x.
We show (Figure 7 smoothed for visualization) the probability
distribution of one such test target Tj from our data. The plot
shows that most (25%) of the MinDist values for Tj were 10,
followed by 18, 21, and so on. There were none beyond 22
or lower than 7.
MinDist
0.3
0.25
Fig. 9. Probability Distribution of Our Edge Targets.
0.2
Probability
0.15
There are two sources of noise in our data of Figure 9. The
first is due to the presence of flaky test targets, and second
0.1 is an artifact of how TAP cuts milestones, i.e., because TAP
0.05
does not run each affected target at each CL, we have no
way to pinpoint the cause of a breakage or fix. To eliminate
0 the first noise source, we can filter flakes, ending up with a
1 11 21 31 41
MinDist less noisy distribution shown in Figure 10. This distribution is
much more focused at MinDist = 10. This shows that most of
Fig. 7. MinDist Values for Tj plotted as a Smoothed Curve. our non-flaky edge test targets have MinDists between 5 and
10.
We computed the same probabilities for all the test targets
in our dataset. Aggregating them gave us the probability
distribution of our entire population as shown in Figure 8.
As expected, this curve follows the trend shown in Figure 6.
Probability
0.12
0.09
Probability
0.06
MinDist
0.03
interested in our edge targets, so we should eliminate all non- Any of the N changelists between the P and F may
edge test target information. Moreover, each CL describes have caused the breakage, which was eventually detected at
multiple file modifications, which means that we will have a milestone build. Hence, our edge test targets have extra
multiple MinDist values per CL (one for each file) and test MinDist values that most likely have nothing to do with fixes
target pair; without loss of accuracy, we choose to retain only and breakages. We can eliminate this noise by considering
the smallest from this set. If we exclude all test targets, except only those edge targets that have no AFFECTED outcomes
our edge targets, and retain only the smallest MinDist value, between PASSED and FAILED (also FAILED to PASSED),
from the data of Figure 8, we see the distribution shown i.e., we know for sure the culprit CL for a breakage and fix.
in Figure 9. This distribution is more pronounced between Examining only this subset of edge targets, 77% of the full set,
MinDist values 6 and 10. gives us the distribution shown in Figure 11, clearly contained
that is modified very frequently is more likely to be in these
edge CLs. Because of the nature of our data, our granularity
for code is a source code file. In our dataset, we saw files being
modified as frequently as 42 times, and as little as once. We
Probability
0
1 11 21 31 41
MinDist=10; 1,133,612 Number of Times File in CL
Area under MinDist=10 = 58%