Software Engineering For Machine Learning: A Case Study
Software Engineering For Machine Learning: A Case Study
Software Engineering For Machine Learning: A Case Study
A Case Study
Saleema Amershi Andrew Begel Christian Bird Robert DeLine Harald Gall
Microsoft Research Microsoft Research Microsoft Research Microsoft Research University of Zurich
Redmond, WA USA Redmond, WA USA Redmond, WA USA Redmond, WA USA Zurich, Switzerland
samershi@microsoft.com andrew.begel@microsoft.com cbird@microsoft.com rdeline@microsoft.com gall@ifi.uzh.ch
Abstract—Recent advances in machine learning have stim- techniques that have powered recent excitement in the software
ulated widespread interest within the Information Technology and services marketplace. Microsoft product teams have used
sector on integrating AI capabilities into software and services. machine learning to create application suites such as Bing
This goal has forced organizations to evolve their development
processes. We report on a study that we conducted on observing Search or the Cortana virtual assistant, as well as platforms
software teams at Microsoft as they develop AI-based applica- such as Microsoft Translator for real-time translation of text,
tions. We consider a nine-stage workflow process informed by voice, and video, Cognitive Services for vision, speech, and
prior experiences developing AI applications (e.g., search and language understanding for building interactive, conversational
NLP) and data science tools (e.g. application diagnostics and bug agents, and the Azure AI platform to enable customers to build
reporting). We found that various Microsoft teams have united
this workflow into preexisting, well-evolved, Agile-like software their own machine learning applications [1]. To create these
engineering processes, providing insights about several essential software products, Microsoft has leveraged its preexisting
engineering challenges that organizations may face in creating capabilities in AI and developed new areas of expertise across
large-scale AI solutions for the marketplace. We collected some the company.
best practices from Microsoft teams to address these challenges. In this paper, we describe a study in which we learned how
In addition, we have identified three aspects of the AI domain that
make it fundamentally different from prior software application various Microsoft software teams build software applications
domains: 1) discovering, managing, and versioning the data with customer-focused AI features. For that, Microsoft has
needed for machine learning applications is much more complex integrated existing Agile software engineering processes with
and difficult than other types of software engineering, 2) model AI-specific workflows informed by prior experiences in devel-
customization and model reuse require very different skills than oping early AI and data science applications. In our study, we
are typically found in software teams, and 3) AI components
are more difficult to handle as distinct modules than traditional asked Microsoft employees about how they worked through
software components — models may be “entangled” in complex the growing challenges of daily software development specific
ways and experience non-monotonic error behavior. We believe to AI, as well as the larger, more essential issues inherent in the
that the lessons learned by Microsoft teams will be valuable to development of large-scale AI infrastructure and applications.
other organizations. With teams across the company having differing amounts of
Index Terms—AI, Software engineering, process, data
work experience in AI, we observed that many issues reported
by newer teams dramatically drop in importance as the teams
I. I NTRODUCTION
mature, while some remain as essential to the practice of large-
Personal computing. The Internet. The Web. Mobile com- scale AI. We have made a first attempt to create a process
puting. Cloud computing. Nary a decade goes by without a maturity metric to help teams identify how far they have come
disruptive shift in the dominant application domain of the on their journeys to building AI applications.
software industry. Each shift brings with it new software As a key finding of our analyses, we discovered three funda-
engineering goals that spur software organizations to evolve mental differences to building applications and platforms for
their development practices in order to address the novel training and fielding machine-learning models than we have
aspects of the domain. seen in prior application domains. First, machine learning is all
The latest trend to hit the software industry is around about data. The amount of effort and rigor it takes to discover,
integrating artificial intelligence (AI) capabilities based on source, manage, and version data is inherently more complex
advances in machine learning. AI broadly includes technolo- and different than doing the same with software code. Second,
gies for reasoning, problem solving, planning, and learning, building for customizability and extensibility of models require
among others. Machine learning refers to statistical modeling teams to not only have software engineering skills but almost
Fig. 1. The nine stages of the machine learning workflow. Some stages are data-oriented (e.g., collection, cleaning, and labeling) and others are model-oriented
(e.g., model requirements, feature engineering, training, evaluation, deployment, and monitoring). There are many feedback loops in the workflow. The larger
feedback arrows denote that model evaluation and monitoring may loop back to any of the previous stages. The smaller feedback arrow illustrates that model
training may loop back to feature engineering (e.g., in representation learning).
always require deep enough knowledge of machine learning to in continuous integration and diagnostic-gathering, making it
build, evaluate, and tune models from scratch. Third, it can be simpler to implement continuous delivery.
more difficult to maintain strict module boundaries between Process changes not only alter the day-to-day development
machine learning components than for software engineering practices of a team, but also influence the roles that people
modules. Machine learning models can be “entangled” in play. 15 years ago, many teams at Microsoft relied heavily on
complex ways that cause them to affect one another during development triads consisting of a program manager (require-
training and tuning, even if the software teams building them ments gathering and scheduling), a developer (programming),
intended for them to remain isolated from one another. and a tester (testing) [6]. These teams’ adoption of DevOps
The lessons we identified via studies of a variety of teams combined the roles of developer and tester and integrated
at Microsoft who have adapted their software engineering the roles of IT, operations, and diagnostics into the mainline
processes and practices to integrate machine learning can help software team.
other software organizations embarking on their own paths In recent years, teams have increased their abilities to an-
towards building AI applications and platforms. alyze diagnostics-based customer application behavior, prior-
In this paper, we offer the following contributions. itize bugs, estimate failure rates, and understand performance
1) A description of how several Microsoft software en- regressions through the addition of data scientists [7], [8], who
gineering teams work cast into a nine-stage workflow helped pioneer the integration of statistical and machine learn-
for integrating machine learning into application and ing workflows into software development processes. Some
platform development. software teams employ polymath data scientists, who “do it
2) A set of best practices for building applications and all,” but as data science needs to scale up, their roles specialize
platforms relying on machine learning. into domain experts who deeply understand the business prob-
3) A custom machine-learning process maturity model for lems, modelers who develop predictive models, and platform
assessing the progress of software teams towards excel- builders who create the cloud-based infrastructure.
lence in building AI applications.
4) A discussion of three fundamental differences in how B. ML Workflow
software engineering applies to machine-learning–centric One commonly used machine learning workflow at Mi-
components vs. previous application domains. crosoft has been depicted in various forms across industry
and research [1], [9], [10], [11]. It has commonalities with
II. BACKGROUND prior workflows defined in the context of data science and data
A. Software Engineering Processes mining, such as TDSP [12], KDD [13], and CRISP-DM [14].
The changing application domain trends in the software Despite the minor differences, these representations have in
industry have influenced the evolution of the software pro- common the data-centered essence of the process and the
cesses practiced by teams at Microsoft. For at least a decade multiple feedback loops among the different stages. Figure 1
and a half, many teams have used feedback-intense Agile shows a simplified view of the workflow consisting of nine
methods to develop their software [2], [3], [4] because they stages.
needed to be responsive at addressing changing customer In the model requirements stage, designers decide which
needs through faster development cycles. Agile methods have features are feasible to implement with machine learning and
been helpful at supporting further adaptation, for example, which can be useful for a given existing product or for a
the most recent shift to re-organize numerous team’s prac- new one. Most importantly, in this stage, they also decide
tices around DevOps [5], which better matched the needs what types of models are most appropriate for the given
of building and supporting cloud computing applications and problem. During data collection, teams look for and integrate
platforms.1 The change to DevOps occurred fairly quickly available datasets (e.g., internal or open source) or collect their
because these teams were able to leverage prior capabilities own. Often, they might train a partial model using available
generic datasets (e.g., ImageNet for object detection), and then
1 https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/ use transfer learning together with more specialized data to
train a more specific model (e.g., pedestrian detection). Data related line of thought, recent work also discusses the impact
cleaning involves removing inaccurate or noisy records from that the use of ML-based software has on risk and safety
the dataset, a common activity to all forms of data science. concerns of ISO standards [21]. In the last five years, there
Data labeling assigns ground truth labels to each record. have been multiple efforts in industry to automate this process
For example, an engineer might have a set of images on hand by building frameworks and environments to support the ML
which have not yet been labeled with the objects present in workflow and its experimental nature [1], [22], [23]. However,
the image. Most of the supervised learning techniques require ongoing research and surveys show that engineers still struggle
labels to be able to induce a model. Other techniques (e.g., to operationalize and standardize working processes [9], [24],
reinforcement learning) use demonstration data or environment [23]. The goal of this work is to uncover detailed insights on
rewards to adjust their policies. Labels can be provided either ML-specific best practices used by developers at Microsoft.
by engineers themselves, domain experts, or by crowd workers We share these insights with the broader community aspiring
in online crowd-sourcing platforms. that such take-away lessons can be valuable to other companies
Feature engineering refers to all activities that are performed and engineers.
to extract and select informative features for machine learning
models. For some models (e.g. convolutional neural networks), D. Process Maturity
this stage is less explicit and often blended with the next stage, Software engineers face a constantly changing set of plat-
model training. During model training, the chosen models forms and technologies that they must learn to build the newest
(using the selected features) are trained and tuned on the applications for the software marketplace. Some engineers
clean, collected data and their respective labels. Then in learn new methods and techniques in school, and bring them
model evaluation, the engineers evaluate the output model to the organizations they work for. Other learn new skills on
on tested or safeguard datasets using pre-defined metrics. the job or on the side, as they anticipate their organization’s
For critical domains, this stage might also involve extensive need for latent talent [25]. Software teams, composed of
human evaluation. The inference code of the model is then individual engineers with varying amounts of experience in the
deployed on the targeted device(s) and continuously monitored skills necessary to professionally build ML components and
for possible errors during real-world execution. their support infrastructure, themselves exhibit varying levels
For simplicity the view in Figure 1 is linear, however, of proficiency in their abilities depending on their aggregate
machine learning workflows are highly non-linear and contain experience in the domain.
several feedback loops. For example, if engineers notice that The software engineering discipline has long considered
there is a large distribution shift between the training data software process improvement as one of its vital functions.
and the data in the real world, they might want to go back Researchers and practitioners in the field have developed sev-
and collect more representative data and rerun the workflow. eral well-known metrics to assess it, including the Capability
Similarly, they may revisit their modeling choices made in the Maturity Model (CMM) [26] and Six Sigma [27]. CMM rates
first stage, if the problem evolves or if better algorithms are the software processes of organizations on five levels, from
invented. While feedback loops are typical in Agile software initial (ad hoc processes), repeatable, defined, capable (i.e.,
processes, the peculiarity of the machine learning workflow quantitatively measured), and efficient (i.e., deliberate process
is related to the amount of experimentation needed to con- improvement). Inspired by CMM, we build a first maturity
verge to a good model for the problem. Indeed, the day- model for teams building systems and platforms that integrate
to-day work of an engineer doing machine learning involves machine learning components.
frequent iterations over the selected model, hyper-parameters,
and dataset refinement. Similar experimental properties have III. S TUDY
been observed in the past in scientific software [15] and We collected data in two phases: an initial set of interviews
hardware/software co-design [16]. This workflow can become to gather the major topics relevant to our research questions
even more complex if the system is integrative, containing and a wide-scale survey about the identified topics. Our study
multiple ML components which interact together in complex design was approved by Microsoft’s Ethics Advisory Board.
and unexpected ways [17].
A. Interviews
C. Software Engineering for Machine Learning Because the work practice around building and integrating
The need for adjusting software engineering practices in machine learning into software and services is still emerging
the recent era has been discussed in the context of hidden and is not uniform across all product teams, there is no
technical debt [18] and troubleshooting integrative AI [19], systematic way to identify the key stakeholders on the topic
[20]. This work identifies various aspects of ML system archi- of adoption. We therefore used a snowball sampling strategy,
tecture and requirements which need to be considered during starting with (1) leaders of teams with mature use of machine
system design. Some of these aspects include hidden feedback learning (ML) (e.g., Bing), (2) leaders of teams where AI is
loops, component entanglement and eroded boundaries, non- a major aspect of the user experience (e.g., Cortana), and (3)
monotonic error propagation, continuous quality states, and people conducting company-wide internal training in AI and
mismatches between the real world and evaluation sets. On a ML. As we chose informants, we picked a variety of teams
TABLE I responded, giving us a 13.6% response rate. For each open-
T HE STAKEHOLDERS WE INTERVIEWED FOR THE STUDY. response item, between two and four researchers analyzed the
responses through a card sort. Then, the entire team reviewed
Id Role Product Area Manager?
the card sort results for clarity and consistency.
I1 Applied Scientist Search Yes
I2 Applied Scientist Search Yes
Respondents were fairly well spread across all divisions of
I3 Architect Conversation Yes the company and came from a variety of job roles: Data and
I4 Engineering Manager Vision Yes applied science (42%), Software engineering (32%), Program
I5 General Manager ML Tools Yes
I6 Program Manager ML Tools Yes
management (17%), Research (7%), and other (1%). 21% of
I7 Program Manager Productivity Tools Yes respondents were managers and 79% were individual contrib-
I8 Researcher ML Tools Yes utors, helping us balance out the majority manager perspective
I9 Software Engineer Speech Yes
in our interviews.
I10 Program Manager AI Platform No In the next sections, we discuss our interview and survey
I11 Program Manager Community No
I12 Scientist Ads No results, starting with the range of AI applications developed by
I13 Software Engineer Vision No Microsoft, diving into best practices that Microsoft engineers
I14 Software Engineer Vision No have developed to address some of the essential challenges
in building large-scale AI applications and platforms, show-
1. Part 1 ing how the perception of the importance of the challenges
1.1. Background and demographics: changes as teams gain experience building AI applications, and
1.1.1. years of AI experience finally, describing our proposed AI process maturity model.
1.1.2. primary AI use case*
1.1.3. team effectiveness rating IV. A PPLICATIONS OF AI
1.1.4. source of AI components Many teams across Microsoft have augmented their appli-
1.2. Challenges* cations with machine learning and inference, some in some
1.3. Time spent on each of the nine workflow activities surprising domains. We asked survey respondents for the ways
1.4. Time spent on cross-cutting activities that they used AI on their teams. We card sorted this data
2. Part 2 (repeated for two activities where most time spent) twice, once to capture the application domain in which AI
2.1. Tools used* was being applied, and a second time to look at the (mainly)
2.2. Effectiveness rating ML algorithms used to build that application.
2.3. Maturity ratings We found AI is used in traditional areas such as search, ad-
3. Part 3 vertising, machine translation, predicting customer purchases,
3.1. Dream tools* voice recognition, and image recognition, but also saw it
3.2. Best practices* being used in novel areas, such as identifying customer leads,
3.3. General comments* providing design advice for presentations and word processing
documents, providing unique drawing features, healthcare, and
Fig. 2. The structure of the study’s questionnaire. An asterisk indicates an
open-response item. improving gameplay. In addition, machine learning is being
used heavily in infrastructure projects to manage incident
reporting, identify the most likely causes for bugs, monitor
to get different levels of experience and different parts of fraudulent fiscal activity, and to monitor network streams for
the ecosystem (products with AI components, AI frameworks security breaches.
and platforms, AI created for external companies). In all, we Respondents used a broad spectrum of ML approaches to
interviewed 14 software engineers, largely in senior leadership build their applications, from classification, clustering, dy-
roles. These are shown in Table I. The interviews were namic programming, and statistics, to user behavior modeling,
semi-structured and specialized to each informant’s role. For social networking analysis, and collaborative filtering. Some
example, when interviewing Informant I3, we asked questions areas of the company specialized further, for instance, Search
related to his work overseeing teams building the product’s worked heavily with ranking and relevance algorithms along
architectural components. with query understanding. Many divisions in the company
work on natural language processing, developing tools for
B. Survey entity recognition, sentiment analysis, intent prediction, sum-
Based on the results of the interviews, we designed an marization, machine translation, ontology construction, text
open-ended questionnaire whose focus was on existing work similarity, and connecting answers to questions. Finance and
practice, challenges in that work practice, and best practices Sales have been keen to build risk prediction models and do
(Figure 2). We asked about challenges both directly and forecasting. Internal resourcing organizations make use of de-
indirectly by asking informants to imagine “dream tools” and cision optimization algorithms such as resource optimization,
improvements that would make their work practice better. We planning, pricing, bidding, and process optimization.
sent the questionnaire to 4195 members of internal mailing The takeaway for us was that integration of machine learn-
lists on the topics of AI and ML. 551 software engineers ing components is happening all over the company, not just
on teams historically known for it. Thus, we could tell that we of these environments is to help engineers discover, gather,
were not just hearing from one niche corner of the company, ingest, understand, and transform data, and then train, deploy,
but in fact, we received responses from a broad range of and maintain models. In addition, these teams customize the
perspectives spread throughout. environments to make them easier to use by engineers with
varying levels of experience. “Visual tools help beginning data
V. B EST P RACTICES WITH M ACHINE L EARNING IN scientists when getting started, but once they know the ropes
S OFTWARE E NGINEERING and branch out, such tools may get in their way and they may
In this section, we present our respondents’ viewpoints on need something else.”
some of the essential challenges associated with building large-
scale ML applications and platforms and how they address B. Data availability, collection, cleaning, and management
them in their products. We categorized the challenges by Since many machine learning techniques are centered
card sorting interview and survey free response questions, and around learning from large datasets, the success of ML-centric
then used our own judgment as software engineering and AI projects often heavily depends on data availability, quality
researchers to highlight those that are essential to the practice and management [28]. Labeling datasets is costly and time-
of AI on software teams. consuming, so it is important to make them available for
use within the company (subject to compliance constraints).
A. End-to-end pipeline support
Our respondents confirm that it is important to “reuse the
As machine learning components have become more mature data as much as possible to reduce duplicated effort.” In
and integrated into larger software systems, our participants addition to availability, our respondents focus most heavily
recognized the importance of integrating ML development on supporting the following data attributes: “accessibility,
support into the traditional software development infrastruc- accuracy, authoritativeness, freshness, latency, structuredness,
ture. They noted that having a seamless development experi- ontological typing, connectedness, and semantic joinability.”
ence covering (possibly) all the different stages described in Automation is a vital cross-cutting concern, enabling teams
Figure 1 was important to automation. However, achieving this to more efficiently aggregate data, extract features, synthesize
level of integration can be challenging because of the differ- labelled examples. The increased efficiency enables teams to
ent characteristics of ML modules compared with traditional “speed up experimentation and work with live data while they
software components. For example, previous work in this experiment with new models.”
field [18], [19] found that variation in the inherent uncertainty We found that Microsoft teams have found it necessary to
(and error) of data-driven learning algorithms and complex blend data management tools with their ML frameworks to
component entanglement caused by hidden feedback loops avoid the fragmentation of data and model management activ-
could impose substantial changes (even in specific stages) ities. A fundamental aspect of data management for machine
which were previously well understood in software engineer- learning is the rapid evolution of data sources. Continuous
ing (e.g., specification, testing, debugging, to name a few). changes in data may originate either from (i) operations
Nevertheless, due to the experimental and even more iterative initiated by engineers themselves, or from (ii) incoming fresh
nature of ML development, unifying and automating the day- data (e.g., sensor data, user interactions). Either case requires
to-day workflow of software engineers reduces overhead and rigorous data versioning and sharing techniques, for example,
facilitate progress in the field. “Each model is tagged with a provenance tag that explains with
Respondents report to leverage internal infrastructure in the which data it has been trained on and which version of the
company (e.g. AEther2 ) or they have built pipelines specialized model. Each dataset is tagged with information about where
to their own use cases. It is important to develop a “rock solid, it originated from and which version of the code was used to
data pipeline, capable of continuously loading and massaging extract it (and any related features).” This practice is used for
data, enabling engineers to try out many permutations of AI mapping datasets to deployed models or for facilitating data
algorithms with different hyper-parameters without hassle.” sharing and reusability.
The pipelines created by these teams are automated, supporting
training, deployment, and integration of models with the C. Education and Training
product they are a part of. In addition, some pipeline engineers The integration of machine learning continues to become
indicated that “rich dashboards” showing the value provided more ubiquitous in customer-facing products, for example,
to users are useful. machine learning components are now widely used in produc-
Several respondents develop openly available IDEs to en- tivity software (e.g., email, word processing) and embedded
able Microsoft’s customers to build and deploy their models devices (i.e., edge computing). Thus, engineers with traditional
(e.g. Azure ML for Visual Studio Code3 and Azure ML software engineering backgrounds need to learn how to work
Studio4 ). According to two of our interviewees, the goal alongside of the ML specialists. A variety of players within
2 https://www.slideshare.net/MSTechCommunity/
Microsoft have found it incredibly valuable to scaffold their
ai-microsoft-how-we-do-it-and-how-you-can-too
engineers’ education in a number of ways. First, the company
3 https://marketplace.visualstudio.com/items?itemName=ms-toolsai.vscode-ai hosts a twice-yearly internal conference on machine learning
4 https://azure.microsoft.com/en-us/services/machine-learning-studio/ and data science, with at least one day devoted to introductions
to the basics of technologies, algorithms, and best practices. F. Compliance
In addition, employees give talks about internal tools and
Microsoft issued a set of principles around uses of AI in
the engineering details behind novel projects and product
the open world. These include fairness, accountability, trans-
features, and researchers present cutting-edge advances they
parency, and ethics. All teams at Microsoft have been asked to
have seen and contributed to academic conferences. Second,
align their engineering practices and the behaviors of fielded
a number of Microsoft teams host weekly open forums on
software and services in accordance with these principles.
machine learning and deep learning, enabling practitioners to
Respect for them is a high priority in software engineering
get together and learn more about AI. Finally, mailing lists and
and AI and ML processes and practices. A discussion of these
online forums with thousands of participants enable anyone
concerns is beyond the scope of this paper. To learn more
to ask and answer technical and pragmatic questions about
about Microsoft’s commitments to this important topic, please
AI and machine learning, as well as frequently share recent
read about its approach to AI.5
results from academic conferences.
G. Varied Perceptions
D. Model Debugging and Interpretability
We found that as a number of product teams at Microsoft in-
Debugging activities for components that learn from data
tegrated machine learning components into their applications,
not only focus on programming bugs, but also focus on
their ability to do so effectively was mediated by the amount of
inherent issues that arise from model errors and uncertainty.
prior experience with machine learning and data science. Some
Understanding when and how models fail to make accurate
teams fielded data scientists and researchers with decades of
predictions is an active research area [29], [30], [31], which is
experience, while others had to grow quickly, picking up their
attracting more attention as ML algorithms and optimization
own experience and more-experienced team members on the
techniques become more complex. Several survey respondents
way. Due to this heterogeneity, we expected that our survey
and the larger Explainable AI community [32], [33] propose
respondents’ perceptions of the challenges their teams’ faced
to use more interpretable models, or to develop visualization
in practicing machine learning would vary accordingly.
techniques that make black-box models more interpretable. For
We grouped the respondents into three buckets (low,
larger, multi-model systems, respondents apply modularization
medium, and high), evenly divided by the number of years
in a conventional, layered, and tiered software architecture to
of experience respondents personally had with AI. First, we
simplify error analysis and debuggability.
ranked each of the card sorted categories of respondents’
challenges divided by the AI experience buckets. This list is
E. Model Evolution, Evaluation, and Deployment
presented in Table II, initially sorted by the respondents with
ML-centric software goes through frequent revisions initi- low experience with AI.
ated by model changes, parameter tuning, and data updates, Two things are worth noticing. First, across the board,
the combination of which has a significant impact on system Data Availability, Collection, Cleaning, and Management, is
performance. A number of teams have found it important ranked as the top challenge by many respondents, no matter
to employ rigorous and agile techniques to evaluate their their experience level. We find similarly consistent ranking for
experiments. They developed systematic processes by adopting issues around the categories of end-to-end pipeline support
combo-flighting techniques (i.e., flighting a combination of and collaboration and working culture. Second, some of the
changes and updates), including multiple metrics in their ex- challenges rise or fall in importance as the respondents’ ex-
periment score cards, and performing human-driven evaluation perience with AI differs. For example, education and training
for more sensitive data categories. One respondent’s team is far more important to those with low experience levels in
uses “score cards for the evaluation of flights and storing AI than those with more experience. In addition, respondents
flight information: How long has it been flighted, metrics for with low experience rank challenges with integrating AI into
the flight, etc.” Automating tests is as important in machine larger systems higher than those with medium or high expe-
learning as it is in software engineering; teams create carefully rience. This means that as individuals (and their teams) gain
put-together test sets that capture what their models should do. experience building applications and platforms that integrate
However, it is important that a human remains in the loop. One ML, their increasing skills help shrink the importance of
respondent said, “we spot check and have a human look at the some of the challenges they perceive. Note, the converse also
errors to see why this particular category is not doing well, occurs. Challenges around tooling, scale, and model evolution,
and then hypothesize to figure out problem source.” evaluation, and deployment are more important for engineers
Fast-paced model iterations require more frequent deploy- with a lot of experience with AI. This is very likely because
ment. To ensure that system deployment goes smoothly, sev- these more experienced individuals are tasked with the more
eral engineers recommend not only to automate the training essentially difficult engineering tasks on their team; those with
and deployment pipeline, but also to integrate model building low experience are probably tasked to easier problems until
with the rest of the software, use common versioning reposi- they build up their experience.
tories for both ML and non-ML codebases, and tightly couple
the ML and non-ML development sprints and standups. 5 https://www.microsoft.com/en-us/ai/our-approach-to-ai
TABLE II
T HE TOP - RANKED CHALLENGES AND PERSONAL EXPERIENCE WITH AI. R ESPONDENTS WERE GROUPED INTO THREE BUCKETS ( LOW, MEDIUM , HIGH )
BASED ON THE 33 RD AND 67 TH PERCENTILE OF THE NUMBER OF YEARS OF AI EXPERIENCE THEY PERSONALLY HAD (N=308). T HE COLUMN Frequency
SHOWS THE INCREASE / DECREASE OF THE FREQUENCY IN THE MEDIUM AND HIGH BUCKETS COMPARED TO THE LOW BUCKETS . T HE COLUMN Rank
SHOWS THE RANKING OF THE CHALLENGES WITHIN EACH EXPERIENCE BUCKET, WITH 1 BEING THE MOST FREQUENT CHALLENGE .
Frequency Rank
Medium High Experience
Challenge vs. Low vs. Low Trend Low Medium High
Data Availability, Collection, Cleaning, and Management -2% 60% 1 1 1
Education and Training -69% -78% 1 5 9
Hardware Resources -32% 13% 3 8 6
End-to-end pipeline support 65% 41% 4 2 4
Collaboration and working culture 19% 69% 5 6 6
Specification 2% 50% 5 8 8
Integrating AI into larger systems -49% -62% 5 16 13
Education: Guidance and Mentoring -83% -81% 5 21 18
AI Tools 144% 193% 9 3 2
Scale 154% 210% 10 4 3
Model Evolution, Evaluation, and Deployment 137% 276% 15 6 4
We also compared the overall frequency of each kind of to which they rely on others’ and their own tools goes up,
challenge using the same three buckets of AI experience. making them think about their impact more often.
Looking again at the top ranked challenge, Data Availability, • End-to-end pipeline support was positively correlated with
Collection, Cleaning, and Management, we notice that it formal education (p < 0.01), implying that only those with
was reported by low and medium experienced respondents at formal education were working on building such a pipeline.
similar rates, but represented a lot more of the responses (60%) • Specifications were also positively correlated with formal
given by those with high experience. This also happened for education (p < 0.03), implying that those with formal
challenges related to Specifications. However, when looking education are the ones who write down the specifications
at Education and Training, Integrating AI into larger systems, for their models and engineering systems.
and Education: Guidance and Mentoring, their frequency drops The lesson we learn from these analyses is that the kinds
significantly from the rate reported by the low experience of issues that engineers perceive as important change as
bucket than reported by the medium and high buckets. We they grow in their experience with AI. Some concerns are
interpret this to mean that these challenges were less important transitory, related to one’s position within the team and the
to the medium and high experience respondents than to those accidental complexity of working together. Several others are
with low experience levels. Thus, this table gives a big picture more fundamental to the practice of integrating machine learn-
of both which problems are perceived as most important within ing into software applications, affecting many engineers, no
each experience bucket, and which problems are perceived as matter their experience levels. Since machine learning-based
most important across the buckets. applications are expected to continue to grow in popularity,
Finally, we conducted a logistic regression analysis to build we call for further research to address these important issues.
a model that could explain the differences in frequency when
controlling for personal AI experience, team AI experience,
VI. T OWARDS A M ODEL OF ML P ROCESS M ATURITY
overall work experience, the number of concurrent AI projects,
and whether or not a respondent had formal education in As we saw in Section V-G, we see some variance in the
machine learning or data science techniques. We found five experience levels of AI in software teams. That variation
significant coefficients: affects their perception of the engineering challenges to be
• Education and Training was negatively correlated with per- addressed in their day-to-day practices. As software teams
sonal AI experience with a coefficient of -0.18 (p < 0.02), mature and gel, they can become more effective and efficient
meaning that people with less AI experience found this to in delivering machine learning-based products and platforms.
be a more important issue. To capture the maturity of ML more precisely than us-
• Educating Others was positively correlated with personal AI ing a simple years-of-experience number, we created a ma-
experience with a coefficient of 0.26 (p < 0.01), meaning turity model with six dimensions evaluating whether each
that people with greater AI experience found this to be a workflow stage: (1) has defined goals, (2) is consistently
more important issue. implemented, (3) documented, (4) automated, (5) measured
• Tool issues are positively correlated with team AI experience and tracked, and (6) continuously improved. The factors are
with a coefficient of 0.13 (p < 0.001), meaning that as the loosely based on the concepts behind the Capability Maturity
team gains experience working on AI projects, the degree Model (CMM) [26] and Six Sigma [27], which are widely
used in software development to assess and improve maturity
of software projects.
In the survey, we asked respondents to report the maturity
for the two workflow stages that each participant spent the
most time on (measured by number of hours they reported
spending on each activity). Specifically, we asked participants
to rate their agreement with the following statements S1 ..S6
(bold text was in the original survey) using a Likert response
format from Strongly Disagree (1) to Strongly Agree (5):
S1: My team has goals defined for what to accomplish with
this activity.
S2: My team does this activity in a consistent manner.
S3: My team has largely documented the practices related
to this activity.
S4: My team does this activity mostly in an automated way. Fig. 3. The average overall effectiveness (OE) of a team’s ML practices
S5: My team measures and tracks how effective we are at divided by application domain (anonymized). The y-axis labels have been
completing this activity. elided for confidentiality. An ANOVA and Scott Knott test identified two
distinct groups to the OE metric, labeled in black (A–F) and red (G–I).
S6: My team continuously improves our practices related to
this activity.
We gathered this data for the stages that respondents were meaning that some respondents feel their teams are at different
most familiar with because we found that they often specialize levels of maturity than others. Second, an ANOVA and Scott
in various stages of the workflow. This question was intended Knott test show significant differences in the reported values,
to be lightweight so that respondents could answer easily, demonstrating the potential value of this metric to identify the
while at the same time accounting for the wide variety of various ML process maturity levels.
ML techniques applied. Rather than being prescriptive (i.e., do We recognize that these metrics represent a first attempt at
this to get to the next maturity level), our intention was to be quantifying a process metric to enable teams to assess how
descriptive (e.g., how much automation is there in a particular well they practice ML. In future work, we will refine our
workflow stage? how well is a workflow stage documented?). instrument and further validate its utility.
More work is needed to define maturity levels similar to CMM.
VII. D ISCUSSION
To analyze the responses, we defined an Activity Maturity
Index (AMI) to combine the individual scores into a single In this section, we synthesize our findings into three ob-
measure. This index is the average of the agreement with servations of some fundamental differences in the way that
the six maturity statements S1 ..S6 . As a means of validating software engineering has been adapted to support past popular
the Maturity Index, we asked participants to rate the Activity application domains and how it can be adapted to support
Effectiveness (AE) by answering “How effective do you think artificial intelligence applications and platforms. There may
your team’s practices around this activity are on a scale from be more differences, but from our data and discussions with
1 (poor) to 5 (excellent)?”. The Spearman correlation between ML experts around Microsoft, these three rose to prominence.
the Maturity Index and the Effectiveness was between 0.4982
and 0.7627 (all statistically significant at p < 0.001) for A. Data discovery and management
all AI activities. This suggests that the Maturity Index is a Just as software engineering is primarily about the code
valid composite measure that can capture the maturity and that forms shipping software, ML is all about the data that
effectiveness of AI activities. powers learning models. Software engineers prefer to design
In addition to the Activity Maturity Index and Activity and build systems which are elegant, abstract, modular, and
Effectiveness, we collected an Overall Effectiveness (OE) simple. By contrast, the data used in machine learning are vo-
score by asking respondents the question “How effectively luminous, context-specific, heterogeneous, and often complex
does your team work with AI on a scale from 1 (poor) to 5 to describe. These differences result in difficult problems when
(excellent)” Having the AMI, AE, and OE measures allowed ML models are integrated into software systems at scale.
us to compare the maturity and effectiveness of different Engineers have to find, collect, curate, clean, and process
organizations, disciplines, and application domains within Mi- data for use in model training and tuning. All the data has
crosoft, and identify areas for improvement. We plot one of to be stored, tracked, and versioned. While software APIs
these comparisons in Figure 3 and show the average overall are described by specifications, datasets rarely have explicit
effectiveness scores divided by nine of the most represented schema definitions to describe the columns and characterize
AI application domains in our survey. There are two things their statistical distributions. However, due to the rapid itera-
to notice. First, the spread of the y-values indicates that tion involved in ML, the data schema (and the data) change
the OE metric can numerically distinguish between teams, frequently, even many times per day. When data is ingested
from large-scale diagnostic data feeds, if ML engineers want architecture. Thus, separate modules are often assigned to
to change which data values are collected, they must wait separate teams. Module interactions are controlled by APIs
for the engineering systems to be updated, deployed, and which do dual duty to enable software modules to remain
propagated before new data can arrive. Even “simple” changes apart, but also describe the interfaces to minimize the amount
can have significant impacts on the volume of data collected, of communication needed between separate teams to make
potentially impacting applications through altered performance their modules work together [36], [37].
characteristics or increased network bandwidth usage. Maintaining strict module boundaries between machine
While there are very well-designed technologies to version learned models is difficult for two reasons. First, models are
code, the same is not true for data. A given data set may not easily extensible. For example, one cannot (yet) take an
contain data from several different schema regimes. When a NLP model of English and add a separate NLP model for
single engineer gathers and processes this data, they can keep ordering pizza and expect them to work properly together.
track of these unwritten details, but when project sizes scale, Similarly, one cannot take that same model for pizza and pair
maintaining this tribal knowledge can become a burden. To it with an equivalent NLP model for French and have it work.
help codify this information into a machine-readable form, The models would have to be developed and trained together.
Gebru et al. propose to use data sheets inspired by elec- Second, models interact in non-obvious ways. In large-
tronics to more transparently and reliably track the metadata scale systems with more than a single model, each model’s
characteristics of these datasets [34]. To compare datasets results will affect one another’s training and tuning processes.
against each other, the Datadiff [35] tool enables developers to In fact, one model’s effectiveness will change as a result of
formulate viable transformation functions over data samples. the other model, even if their code is kept separated. Thus,
B. Customization and Reuse even if separate teams built each model, they would have to
collaborate closely in order to properly train or maintain the
While it is well-understood how much work it takes to cus- full system. This phenomenon (also referred to as component
tomize and reuse code components, customizing ML models entanglement) can lead to non-monotonic error propagation,
can require much more. In software, the primary units of reuse meaning that improvements in one part of the system might
are functions, algorithms, libraries, and modules. A software decrease the overall system quality because the rest of the
engineer can find the source code for a library (e.g. on Github), system is not tuned to the latest improvements. This issue is
fork it, and easily make changes to the code, using the same even more evident in cases when machine learning models are
skills they use to develop their own software. not updated in a compatible way and introduce new, previously
Although fully-trained ML models appear to be functions unseen mistakes that break the interaction with other parts of
that one can call for a given input, the reality is far more the system which rely on it.
complex. One part of a model is the algorithm that powers
the particular machine learning technique being used (e.g.,
SVM or neural nets). Another is the set of parameters that
VIII. L IMITATIONS
controls the function (e.g., the SVM support vectors or neural
net weights) and are learned during training. If an engineer
wants to apply the model on a similar domain as the data it was Our case study was conducted with teams at Microsoft, a
originally trained on, reusing it is straightforward. However, large, world-wide software company with a diverse portfolio
more signficant changes are needed when one needs to run of software products. It is also one of the largest purveyors of
the model on a different domain or use a slightly different machine learning-based products and platforms. Some findings
input format. One cannot simply change the parameters with are likely to be specific to the Microsoft teams and team
a text editor. In fact, the model may require retraining, or members who participated in our interviews and surveys. Nev-
worse, may need to be replaced with another model. Both ertheless, given the high variety of projects represented by our
require the software developer to have machine learning skills, informants, we expect that many of the lessons we present in
which they may never have learned. Beyond that, retraining this paper will apply to other companies. Some of our findings
or rebuilding the model requires additional training data to be depend on the particular ML workflow used by some software
discovered, collected, and cleaned, which can take as much teams at Microsoft. The reader should be able to identify how
work and expertise as the original model’s authors put in. our model abstractions fit into the particulars of the models
they use. Finally, interviews and surveys rely on self-selected
C. ML Modularity informants and self-reported data. Wherever appropriate, we
Another key attribute of engineering large-scale software stated that findings were our informants’ perceptions and
systems is modularity. Modules are separated and isolated opinions. This is especially true with this implementation
to ensure that developing one component does not interfere of our ML process maturity model, which triangulated its
with the behavior of others under development. In addition, measures against other equally subjective measures with no
software modularity is strengthened by Conway’s Law, which objective baseline. Future implementations of the maturity
makes the observation that the teams that build each com- model should endeavor to gather objective measures of team
ponent of the software organize themselves similarly to its process performance and evolution.
IX. C ONCLUSION [15] J. E. Hannay, C. MacLeod, J. Singer, H. P. Langtangen, D. Pfahl, and
G. Wilson, “How do scientists develop and use scientific software?”
Many teams at Microsoft have put significant effort into in Proc. of the 2009 ICSE workshop on Software Engineering for
developing an extensive portfolio of AI applications and plat- Computational Science and Engineering. IEEE Computer Society,
2009, pp. 1–8.
forms by integrating machine learning into existing software [16] G. De Michell and R. K. Gupta, “Hardware/software co-design,” Proc.
engineering processes and cultivating and growing ML talent. of the IEEE, vol. 85, no. 3, pp. 349–365, 1997.
In this paper, we described the results of a study to learn [17] D. Bohus, S. Andrist, and M. Jalobeanu, “Rapid development of
multimodal interactive systems: a demonstration of platform for situated
more about the process and practice changes undertaken by a intelligence,” in Proc. of the 19th ACM International Conference on
number of Microsoft teams in recent years. From these find- Multimodal Interaction. ACM, 2017, pp. 493–494.
ings, we synthesized a set of best practices to address issues [18] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,
V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden
fundamental to the large-scale development and deployment of technical debt in machine learning systems,” in NIPS, 2015.
ML-based applications. Some reported issues were correlated [19] B. Nushi, E. Kamar, E. Horvitz, and D. Kossmann, “On human intellect
with the respondents’ experience with AI, while others were and machine failures: Troubleshooting integrative machine learning
systems.” in AAAI, 2017, pp. 1017–1025.
applicable to most respondents building AI applications. We [20] S. Andrist, D. Bohus, E. Kamar, and E. Horvitz, “What went wrong
presented a ML process maturity metric to help teams self- and why? diagnosing situated interaction failures in the wild,” in ICSR.
assess how well they work with machine learning and offer Springer, 2017, pp. 293–303.
[21] R. Salay, R. Queiroz, and K. Czarnecki, “An analysis of ISO 26262:
guidance towards improvements. Finally, we identified three Using machine learning safely in automotive software,” arXiv preprint
aspects of the AI domain that make it fundamentally different arXiv:1709.02435, 2017.
than prior application domains. Their impact will require [22] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque,
S. Haykal, M. Ispir, V. Jain, L. Koc et al., “TFX: A tensorflow-based
significant research efforts to address in the future. production-scale machine learning platform,” in Proc. of the 23rd ACM
SIGKDD. ACM, 2017, pp. 1387–1395.
R EFERENCES [23] V. Sridhar, S. Subramanian, D. Arteaga, S. Sundararaman, D. Roselli,
and N. Talagala, “Model governance: Reducing the anarchy of produc-
[1] M. Salvaris, D. Dean, and W. H. Tok, “Microsoft AI Platform,” in Deep tion ML,” in USENIX. USENIX Association, 2018, pp. 351–358.
Learning with Azure. Springer, 2018, pp. 79–98. [24] T. Wuest, D. Weimer, C. Irgens, and K.-D. Thoben, “Machine learning
[2] A. Begel and N. Nagappan, “Usage and perceptions of agile software in manufacturing: advantages, challenges, and applications,” Production
development in an industrial context: An exploratory study,” in First & Manufacturing Research, vol. 4, no. 1, pp. 23–45, 2016.
International Symposium on Empirical Software Engineering and Mea- [25] J. Sillito and A. Begel, “App-directed learning: An exploratory study,”
surement (ESEM 2007), Sept 2007, pp. 255–264. in 6th International Workshop on Cooperative and Human Aspects of
[3] A. Begel and N. Nagappan, “Pair programming: What’s in it for me?” in Software Engineering (CHASE), May 2013, pp. 81–84.
Proc. of the Second ACM-IEEE International Symposium on Empirical [26] C. Weber, B. Curtis, and M. B. Chrissis, The capability maturity model,
Software Engineering and Measurement, 2008, pp. 120–128. guidelines for improving the software process. Harlow: Addison Wesley,
[4] B. Murphy, C. Bird, T. Zimmermann, L. Williams, N. Nagappan, and 1994.
A. Begel, “Have agile techniques been the silver bullet for software [27] M. Alexander, Six Sigma: The breakthrough management strategy rev-
development at microsoft?” in 2013 ACM/IEEE Intl. Symp. on Empirical olutionizing the world’s top corporations. Taylor & Francis, 2001.
Software Engineering and Measurement, Oct 2013, pp. 75–84. [28] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data manage-
[5] M. Senapathi, J. Buchan, and H. Osman, “DevOps capabilities, practices, ment challenges in production machine learning,” in Proc. of the 2017
and challenges: Insights from a case study,” in Proc. of the 22nd ACM SIGMOD, 2017, pp. 1723–1726.
International Conference on Evaluation and Assessment in Software [29] T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf, “Principles of
Engineering 2018, 2018, pp. 57–67. explanatory debugging to personalize interactive machine learning,” in
[6] T. D. LaToza, G. Venolia, and R. DeLine, “Maintaining mental models: Proc. of the 20th International Conference on Intelligent User Interfaces,
A study of developer work habits,” in Proc. of the 28th International 2015, pp. 126–137.
Conference on Software Engineering, 2006, pp. 492–501. [30] S. Amershi, M. Chickering, S. M. Drucker, B. Lee, P. Simard, and J. Suh,
[7] M. Kim, T. Zimmermann, R. DeLine, and A. Begel, “The emerging role “Modeltracker: Redesigning performance analysis tools for machine
of data scientists on software development teams,” in Proc. of the 38th learning,” in Proc. of the 33rd Annual ACM Conference on Human
International Conference on Software Engineering, 2016, pp. 96–107. Factors in Computing Systems, 2015, pp. 337–346.
[8] M. Kim, T. Zimmermann, R. DeLine, and A. Begel, “Data scientists in [31] B. Nushi, E. Kamar, and E. Horvitz, “Towards accountable AI: Hybrid
software teams: State of the art and challenges,” IEEE Transactions on human-machine analyses for characterizing system failure,” in HCOMP,
Software Engineering, vol. 44, no. 11, pp. 1024–1038, 2018. 2018, pp. 126–135.
[9] C. Hill, R. Bellamy, T. Erickson, and M. Burnett, “Trials and tribulations [32] D. Gunning, “Explainable artificial intelligence (XAI),” Defense Ad-
of developers of intelligent systems: A field study,” in Visual Languages vanced Research Projects Agency (DARPA), 2017.
and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on. [33] D. S. Weld and G. Bansal, “Intelligible artificial intelligence,” arXiv
IEEE, 2016, pp. 162–170. preprint arXiv:1803.04263, 2018.
[10] “Machine learning workflow,” https://cloud.google.com/ml- [34] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. M. Wallach,
engine/docs/tensorflow/ml-solutions-overview, accessed: 2018-09-24. H. D. III, and K. Crawford, “Datasheets for datasets,” CoRR, vol.
[11] K. Patel, J. Fogarty, J. A. Landay, and B. Harrison, “Investigating abs/1803.09010, 2018.
statistical machine learning as a tool for software development,” in Proc. [35] C. Sutton, T. Hobson, J. Geddes, and R. Caruana, “Data diff: Inter-
of the SIGCHI Conference on Human Factors in Computing Systems. pretable, executable summaries of changes in distributions for data
ACM, 2008, pp. 667–676. wrangling,” in Proc. of the 24th ACM SIGKDD. ACM, 2018, pp.
[12] “The Team Data Science Process,” https://docs.microsoft.com/en- 2279–2288.
us/azure/machine-learning/team-data-science-process/, accessed: 2018- [36] C. R. B. de Souza, D. Redmiles, and P. Dourish, ““Breaking the Code”,
09-24. moving between private and public work in collaborative software
[13] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The KDD process for development,” in Proc. of the 2003 International ACM SIGGROUP
extracting useful knowledge from volumes of data,” Communications of Conference on Supporting Group Work, 2003, pp. 105–114.
the ACM, vol. 39, no. 11, pp. 27–34, 1996. [37] C. R. B. de Souza, D. Redmiles, L.-T. Cheng, D. Millen, and J. Patterson,
[14] R. Wirth and J. Hipp, “CRISP-DM: Towards a standard process model “Sometimes you need to see through walls: A field study of application
for data mining,” in Proc. 4th Intl. Conference on Practical Applications programming interfaces,” in Proc. of the 2004 ACM Conference on
of Knowledge Discovery and Data mining, 2000, pp. 29–39. Computer Supported Cooperative Work, 2004, pp. 63–71.
2018 IEEE International Conference on Software Testing, Verification and Validation Workshops
quality. The problems are that training data determines the Intelligence
logic of ML applications, and the results of ML application • ASE: IEEE/ACM International Conference on Automated
problems. Userss
answer?
MLaaS
M MLaaS insufficient Enterprise
E t i
or biased Application
We classified survey targets into Academic Conferences, How to verify data
ML answer to
Magazines, and Communities. We targeted 16 academic con- unknown data
ferences on artificial intelligence and software engineering, Mobile Cloud Transaction Enterprise
Application Messaging Connectivity Process
including 78 papers. We targeted 5 Magazines, including 22
papers.
: Actors of application : Messaging : Software testing research question
II. P ROCEDURE OF SURVEY
We present the procedure of our survey in this section. Fig. 1. ML Application and software testing research questions
280
TABLE I TABLE II
N UMBER OF ARTICLES N UMBER OF TAGS
281
TABLE III
C ORRESPONDENCE OF PROBLEMS AND TAGS
Tags
Problems Deep Fault Prediction MLaaS Multi Search- Model
learning localiza- agent Based check-
tion ing
How to verify the answer that MLaaS return to unknown data [80] [62] [40]
How to verify insufficient or biased data because problems be- [73] [53]
cause data determine the logic of ML applications.
How to verify the quality of end-to-end systems. [59] [79] [4] [42] [47]
How to notify users confidence of answer correctness from the [63],
system. [66]
R EFERENCES
et al. presented Search-Based Software Testing, which is the [1] B. Wilson, “The Machine Learning Dictionary,” 2012. [Online].
use of a meta-heuristic optimizing search technique, such as a Available: http://www.cse.unsw.edu.au/ billw/mldict.html
Genetic Algorithm, to automate or partially automate a testing [2] G. S. Novak Jr., “Artificial Intelligence Vocabulary,” 2005. [Online].
Available: https://www.cs.utexas.edu/users/novak/aivocab.html
task; for example, the automatic generation of test data [43]. [3] B. Zarrieß and J. Claßen, “Decidable Verification of Golog Programs
Gay2017 proposed assessed search-based generation of test over Non-Local Effect Actions,” pp. 1109–1115.
suites that detect real faults [42]. [4] M. Wooldridge, J. Gutierrez, P. Harrenstein, E. Marchioni, G. Perelli,
and A. Toumi, “Rational Verification: From Model Checking to Equi-
7) Model Checking: Model checking is a traditional tech- librium Checking,” Proceedings of the 30th Conference on Artificial
nique of software engineering. However, describing applica- Intelligence (AAAI 2016), pp. 4184–4190, 2016.
[5] B. Bittner, M. Bozzano, A. Cimatti, and G. Zampedri, “Automated
tions models with ML allows the application model checking Verification and Tightening of Failure Propagation Models,” Proceed-
to verify the applications. Alechina et al. proposed a method ings of the 30th Conference on Artificial Intelligence (AAAI 2016), pp.
for verifying existence of resource-bounded coalition uniform 907–913, 2016.
[6] M. Witbrock, “AI for Complex Situations: Beyond Uniform Problem
strategies to be able to automatically verify properties of such Solving,” 2017.
systems using model-checking [47]. Tappler et al. presented [7] V. Vanhoucke and G. B. Robotics, “‘ OK Google , fold my laundry
learning-based approach to detecting failures by model-based s ’ il te plaı̂t ’.”
[8] T. Sunahase, Y. Baba, and H. Kashima, “Pairwise HITS : Quality
testing [40]. Estimation from Pairwise Comparisons in Creator-Evaluator Crowd-
sourcing Process,” Proceedings of the 31th Conference on Artificial
IV. D ISCUSSION Intelligence (AAAI 2017), no. Kleinberg, pp. 977–983, 2017.
[9] J. Goldsmith and E. Burton, “Why teaching ethics to AI practitioners
Table III shows articles that have correspondence between is important,” 31st AAAI Conference on Artificial Intelligence, AAAI
2017, pp. 110–114, 2017.
problems and top seven tags. For instance, Pei et al. [80] [10] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen,
discussed a corner-case in a deep learning system and a white- “Combining deep learning with information retrieval to localize buggy
box testing method using the corner-case. The corner-case files for bug reports,” Proceedings - 2015 30th IEEE/ACM International
Conference on Automated Software Engineering, ASE 2015, pp. 476–
is an unknown data for ML applications; hence, their paper 481, 2016.
corresponds to the problem. We put corresponding papers to [11] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep
learning code fragments for code clone detection,” Proceedings of
each problems in the same manner. As summarized in Section the 31st IEEE/ACM International Conference on Automated Software
3, we found several techniques or methods of software quality Engineering - ASE 2016, pp. 87–98, 2016.
for ML applications. Some of these techniques corresponded to [12] X. Li, Y. Liang, H. Qian, Y.-Q. Hu, L. Bu, Y. Yu, X. Chen, and
X. Li, “Symbolic Execution of Complex Program Driven by Machine
problems that we raised. However, the problems have not been Learning Based Constraint Solving,” Ase, pp. 554–559, 2016.
completely solved. Each papers covered some of the problems. [13] N. Li, Y. Lei, H. R. Khan, J. Liu, and Y. Guo, “Applying combi-
For instance, Pei et al. targeted deep learning systems but natorial test data generation to big data applications,” Proceedings of
the 31st IEEE/ACM International Conference on Automated Software
did not target other ML models such as k-means, decision- Engineering - ASE 2016, pp. 637–647, 2016.
tree, and support vector machine. ML applications consist of [14] W. Zhang, X. Cao, R. Wang, Y. Guo, and Z. Chen, “Binarized Mode
Seeking for Scalable Visual Pattern Discovery,” Cvpr2017, pp. 3864–
various ML models; therefore, techniques of software quality 3872, 2017.
or testing are required for verifying of ML applications. [15] J. Wu, J. B. Tenenbaum, and P. Kohli, “Neural Scene De-rendering,”
Proceedings of the IEEE Conference on Computer Vision and Pattern
V. C ONCLUSION Recognition, pp. 699–707, 2017.
[16] W. Treible, P. Saponaro, S. Sorensen, A. Kolagunda, M. O. Neal,
ML applications such as face recognition, question answer- B. Phelan, K. Sherbondy, and C. Kambhamettu, “CATS : A Color
and Thermal Stereo Benchmark,” Cvpr, pp. 134–142, 2017.
ing, and sales analysis are now widespread. New software [17] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and
engineering approaches are required to solve the problems. R. Webb, “Learning from Simulated and Unsupervised Images through
Adversarial Training,” pp. 2242–2251, 2016.
We presented a survey of software quality for ML applications [18] K. Sasaki, S. Iizuka, E. Simo-serra, and H. Ishikawa, “Joint Gap
to consider the quality of ML applications. From our survey Detection and Inpainting of Line Drawings,” Cvpr, pp. 5768–5776,
determined problems with ML applications and discovered 2017.
[19] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, “Scribbler: Controlling
software engineering approaches and software testing research Deep Image Synthesis with Sketch and Color,” pp. 6836–6845, 2016.
areas to solve these problems.
282
[20] K. Nakamura, S. Yeung, A. Alahi, and L. Fei-Fei, “Jointly on Software Testing, Verification and Validation, ICST 2016, pp. 69–79,
Learning Energy Expenditures and Activities using Egocentric 2016.
Multimodal Signals,” Cvpr, pp. 6817–6826, 2017. [Online]. Available: [40] M. Tappler, B. K. Aichernig, and R. Bloem, “Model-Based Testing IoT
http://vision.stanford.edu/pdf/nakamura2017cvpr.pdf Communication via Active Automata Learning,” Proceedings - 10th
[21] H. Le, T.-j. Chin, and D. Suter, “An Exact Penalty Method for Locally IEEE International Conference on Software Testing, Verification and
Convergent Maximum Consensus,” Cvpr2017, pp. 1888–1896, 1888. Validation, ICST 2017, pp. 276–287, 2017.
[22] I. Kokkinos, “UberNet: Training a ‘Universal’ Convolutional Neural [41] D. Pradhan, S. Wang, S. Ali, T. Yue, and M. Liaaen, “CBGA-
Network for Low-, Mid-, and High-Level Vision using Diverse Datasets ES: A Cluster-Based Genetic Algorithm with Elitist Selection for
and Limited Memory,” 2016. Supporting Multi-Objective Test Optimization,” Proceedings - 10th
[23] W. Kehl, F. Tombari, S. Ilic, and N. Navab, “Real-Time 3D Model IEEE International Conference on Software Testing, Verification and
Tracking in Color and Depth on a Single CPU Core,” 2017 IEEE Validation, ICST 2017, pp. 367–378, 2017.
Conference on Computer Vision and Pattern Recognition (CVPR), pp. [42] G. Gay, “The Fitness Function for the Job: Search-Based Generation
465–473, 2017. of Test Suites That Detect Real Faults,” Proceedings - 10th IEEE Inter-
[24] D. He, X. Yang, C. Liang, Z. Zhou, D. Kifer, C. L. Giles, and national Conference on Software Testing, Verification and Validation,
A. Ororbia, “Multi-scale FCN with Instance Aware Segmentation for ICST 2017, pp. 345–355, 2017.
Arbitrary Oriented Word Spotting In The Wild,” Cvpr 2017, no. 1, [43] P. McMinn, “Search-Based Software Testing: Past, Present and
2017. Future,” 2011 IEEE Fourth International Conference on Software
[25] C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and Testing, Verification and Validation Workshops, pp. 153–163, 2011.
M. S. Ryoo, “Identifying First-person Camera Wearers in Third-person [Online]. Available: http://ieeexplore.ieee.org/document/5954405/
Videos,” no. 1, pp. 4734–4742, 2017. [44] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A Survey
[26] Y. Chen and Y.-j. Liu, “Learning to Rank Retargeted Images,” Learn- on Software Fault Localization,” IEEE Transactions on Software En-
ing, pp. 3994–4002, 2017. gineering, vol. 42, no. 8, pp. 707–740, 2016.
[27] L. Castrejn, K. Kundu, R. Urtasun, and S. Fidler, “Annotating object [45] H. Narasimhan, S. Agarwal, and D. C. Parkes, “Automated mechanism
instances with a polygon-rnn,” in 2017 IEEE Conference on Computer design without money via machine learning,” IJCAI International Joint
Vision and Pattern Recognition (CVPR), July 2017, pp. 4485–4493. Conference on Artificial Intelligence, vol. 2016-Janua, pp. 433–439,
[28] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network 2016.
dissection: Quantifying interpretability of deep visual representations,” [46] Y. F. Li, S. B. Wang, and Z. H. Zhou, “Graph quality judgement:
in 2017 IEEE Conference on Computer Vision and Pattern Recognition A large margin expedition,” IJCAI International Joint Conference on
(CVPR), July 2017, pp. 3319–3327. Artificial Intelligence, vol. 2016-Janua, pp. 1725–1731, 2016.
[29] X. Alameda-Pineda, A. Pilzer, D. Xu, N. Sebe, and E. Ricci, “Vi- [47] N. Alechina, M. Dastani, and B. Logan, “Verifying existence of
raliency: Pooling local virality,” in 2017 IEEE Conference on Computer resource-bounded coalition uniform strategies,” IJCAI International
Vision and Pattern Recognition (CVPR), July 2017, pp. 484–492. Joint Conference on Artificial Intelligence, vol. 2016-Janua, pp. 24–
[30] V. K. Adhikarla, M. Vinkler, D. Sumin, R. K. Mantiuk, K. Myszkowski, 30, 2016.
H. P. Seidel, and P. Didyk, “Towards a quality metric for dense light [48] S. Adriaensen and A. Nowé, “Towards a white box approach to
fields,” in 2017 IEEE Conference on Computer Vision and Pattern automated algorithm design,” IJCAI International Joint Conference on
Recognition (CVPR), July 2017, pp. 3720–3729. Artificial Intelligence, vol. 2016-Janua, pp. 554–560, 2016.
[31] S. Bosse, “Distributed machine learning with self-organizing mobile [49] V. Noroozi, L. Zheng, S. Bahaadini, S. Xie, and P. S. Yu, “SEVEN:
agents for earthquake monitoring,” Proceedings - IEEE 1st Interna- Deep SEmi-supervised verification networks,” IJCAI International
tional Workshops on Foundations and Applications of Self-Systems, Joint Conference on Artificial Intelligence, pp. 2571–2577, 2017.
FAS-W 2016, pp. 126–132, 2016. [50] P. Kouvaros and A. Lomuscio, “Verifying fault-tolerance in parame-
[32] M. Ribeiro, K. Grolinger, and M. A. Capretz, “MLaaS: terised multi-agent systems,” IJCAI International Joint Conference on
Machine Learning as a Service,” 2015 IEEE 14th Artificial Intelligence, pp. 288–294, 2017.
International Conference on Machine Learning and Applications [51] J. Kong and A. Lomuscio, “Model checking multi-agent systems
(ICMLA), no. c, pp. 896–902, 2015. [Online]. Available: against LDLK specifications,” IJCAI International Joint Conference
http://ieeexplore.ieee.org/document/7424435/ on Artificial Intelligence, pp. 1138–1144, 2017.
[33] B. Chen, D. Kumor, and E. Bareinboim, “Identification and Model [52] N. Gorogiannis and F. Raimondi, “A Novel Symbolic Approach to
Testing in Linear Structural Equation Models using Auxiliary Verifying Epistemic Properties of Programs ,” pp. 206–212, 2009.
Variables,” Proceedings of the 34th International Conference on [53] F. Belardinelli, L. Ibisc, and I. Toulouse, “Parameterised Verification
Machine Learning, vol. 70, pp. 757–766, 2017. [Online]. Available: of Data-aware Multi-agent Systems,” pp. 98–104, 2016.
http://proceedings.mlr.press/v70/chen17f.html [54] F. Belardinelli, L. Ibisc, A. Murano, and S. Rubin, “Verification of
[34] M. Harman, P. McMinn, J. De Souza, and S. Yoo, “Search Broadcasting Multi-Agent Systems against an Epistemic Strategy Logic
based software engineering: Techniques, taxonomy, tutorial,” Imperial College London,” pp. 91–97, 2014.
Search, vol. 2012, pp. 1–59, 2011. [Online]. Available: [55] O. Tripp, O. Weisman, and L. Guy, “Finding Your Way in the
http://discovery.ucl.ac.uk/1340709/ Testing Jungle: A Learning Approach to Web Security Testing,”
[35] N. Erman, V. Tufvesson, M. Borg, P. Runeson, and A. Ardö, “Navigat- Proceedings of the 2013 International Symposium on Software
ing information overload caused by automated testing - A clustering Testing and Analysis, pp. 347–357, 2013. [Online]. Available:
approach in multi-branch development,” 2015 IEEE 8th International http://doi.acm.org/10.1145/2483760.2483776
Conference on Software Testing, Verification and Validation, ICST 2015 [56] F. M. Kifetew, A. Panichella, A. De Lucia, R. Oliveto, and P. Tonella,
- Proceedings, 2015. “Orthogonal exploration of the search space in evolutionary test case
[36] R. Carbone, L. Compagna, A. Panichella, and S. E. Ponta, “Security generation,” Proceedings of the 2013 International Symposium on
threat identification and testing,” 2015 IEEE 8th International Con- Software Testing and Analysis - ISSTA 2013, p. 257, 2013. [Online].
ference on Software Testing, Verification and Validation, ICST 2015 - Available: http://dl.acm.org/citation.cfm?doid=2483760.2483789
Proceedings, 2015. [57] F. Howar, D. Giannakopoulou, and Z. Rakamarić, “Hybrid learning:
[37] S. F. Sun and A. Podgurski, “Properties of Effective Metrics for interface generation through static, dynamic, and symbolic analysis,”
Coverage-Based Statistical Fault Localization,” Proceedings - 2016 Proceedings of the 2013 International Symposium on Software Testing
IEEE International Conference on Software Testing, Verification and and Analysis - ISSTA 2013, p. 268, 2013. [Online]. Available:
Validation, ICST 2016, pp. 124–134, 2016. http://dl.acm.org/citation.cfm?doid=2483760.2483783
[38] K. Moran, M. Linares-Vasquez, C. Bernal-Cardenas, C. Vendome, and [58] I. Medeiros, N. Neves, and M. Correia, “DEKANT: A Static
D. Poshyvanyk, “Automatically Discovering, Reporting and Reproduc- Analysis Tool that Learns to Detect Web Application Vulnerabilities,”
ing Android Application Crashes,” Proceedings - 2016 IEEE Inter- Proceedings of the 14th ACM conference on Computer and
national Conference on Software Testing, Verification and Validation, communications security CCS 07, p. 529, 2007. [Online]. Available:
ICST 2016, pp. 33–44, 2016. http://portal.acm.org/citation.cfm?doid=1315245.1315311
[39] B. Marculescu, R. Feldt, and R. Torkar, “Using Exploration Focused [59] T.-D. B. Le, D. Lo, C. Le Goues, and L. Grunske, “A learning-to-rank
Techniques to Augment Search-Based Software Testing: An Experi- based fault localization approach using likely invariants,” Proceedings
mental Evaluation,” Proceedings - 2016 IEEE International Conference of the 25th International Symposium on Software Testing and
283
Analysis - ISSTA 2016, pp. 177–188, 2016. [Online]. Available: [80] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated
http://dl.acm.org/citation.cfm?doid=2931037.2931049 whitebox testing of deep learning systems,” in Proceedings of the
[60] H. Spieker, A. Gotlieb, D. Marijan, and M. Mossige, “Reinforcement 26th Symposium on Operating Systems Principles, ser. SOSP ’17.
Learning for Automatic Test Case Prioritization and Selection in New York, NY, USA: ACM, 2017, pp. 1–18. [Online]. Available:
Continuous Integration,” Proceedings of 26th International Symposium http://doi.acm.org/10.1145/3132747.3132785
on Software Testing and Analysis (ISSTA’17), pp. 12—-22, 2017. [81] Amazon, “Amazon Alexa.” [Online]. Available:
[61] M. Santolucito, “Version Space Learning for Verification on Temporal https://developer.amazon.com/alexa
Differentials,” Proceedings of the 26th ACM SIGSOFT International [82] IBM, “With Watson Program,” 2018. [Online]. Available:
Symposium on Software Testing and Analysis, pp. 428–431, 2017. https://www.ibm.com/watson/with-watson/
[Online]. Available: http://doi.acm.org/10.1145/3092703.3098238 [83] R. G. Smith, “On the development of commercial expert systems,”
[62] D. S. Katz, “Understanding Intended Behavior using Models of Low- AI Magazine, vol. 5, no. 3, pp. 61–73, 1984. [Online]. Available:
Level Signals,” pp. 424–427. https://www.aaai.org/ojs/index.php/aimagazine/article/view/449
[63] J. Hotzkow, “Automatically Inferring and Enforcing User [84] S. Sievers, M. Ortlieb, and M. Helmert, “Efficient Implementation of
Expectations,” Proceedings of the 26th ACM SIGSOFT Pattern Database Heuristics for Classical Planning,” Symposium on
International Symposium on Software Testing and Analysis - Combinatorial Search, pp. 105–111, 2012.
ISSTA 2017, no. July, pp. 420–423, 2017. [Online]. Available: [85] A. Ramaswamy, B. Monsuez, and A. Tapus, “AI Dimensions in
http://dl.acm.org/citation.cfm?doid=3092703.3098236 Software Development for Human-Robot Interaction Systems,” pp.
[64] K. R. Varshney, “Engineering safety in machine learning,” 2016 Infor- 128–130, 2014.
mation Theory and Applications Workshop, ITA 2016, 2017. [86] D. S. Prerau, “Knowledge Acquisition in the Development of a Large
Expert System,” AI Magazine, vol. 8, no. 2, pp. 43–51, 1987.
[65] G. I. Webb and F. Petitjean, “A Multiple Test Correction for Streams
[87] G. Peter, “Knowledge-Based Software Engineering Conference,” 1992.
and Cascades of Statistical Hypothesis Tests,” Proceedings of the 22nd
[88] M. R. Lowry, “Software Engineering in the Twenty-First Century,” AI
ACM SIGKDD International Conference on Knowledge Discovery and
Magazine, vol. 13, no. 3, pp. 71–87, 1992.
Data Mining - KDD ’16, pp. 1255–1264, 2016. [Online]. Available:
[89] S. Giroux and N. Bier, “Cognitive Assistance to Meal Preparation:
http://dl.acm.org/citation.cfm?doid=2939672.2939775
Design, Implementation, and Assessment in a Living Lab,”
[66] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i 2015 AAAI Spring . . . , pp. 01–25, 2015. [Online]. Available:
trust you?”: Explaining the predictions of any classifier,” in https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10329/10013
Proceedings of the 22Nd ACM SIGKDD International Conference [90] M. A. Cohen, F. E. Ritter, and S. R. Haynes, “Applying Software
on Knowledge Discovery and Data Mining, ser. KDD ’16. New Engineering to Agent Development,” pp. 25–44, 2010.
York, NY, USA: ACM, 2016, pp. 1135–1144. [Online]. Available: [91] S. Burton, K. Swanson, and L. Leonard, “Quality and Knowledge in
http://doi.acm.org/10.1145/2939672.2939778 Software Engineering,” AI Magazine, vol. 14, no. 4, pp. 43–50, 1993.
[67] R. Parekh, “Designing AI at Scale to Power Everyday Life,” Proceed- [92] R. Hollander, “Alibaba, Baidu, and Tencent, China’s powerhouses,
ings of the 23rd ACM SIGKDD International Conference on Knowledge focus on AI to surpass the US - Business Insider,” 2017. [Online].
Discovery and Data Mining - KDD ’17, pp. 27–27, 2017. [Online]. Available: http://www.businessinsider.com/alibaba-baidu-and-tencent-
Available: http://dl.acm.org/citation.cfm?doid=3097983.3105815 chinas-powerhouses-focus-on-ai-to-surpass-the-us-2017-11
[68] D. Baylor and E. Breck, “TFX: A TensorFlow-Based Production-Scale [93] A. Nordrum, “Automatic Speaker Verification Systems Can Be Fooled
Machine Learning Platform,” Kdd, pp. 1387–1395, 2017. by Disguising Your Voice - IEEE Spectrum,” 2017. [Online]. Avail-
[69] J. Van Der Herten, I. Couckuyt, D. Deschrijver, P. Demeester, and able: https://spectrum.ieee.org/tech-talk/telecom/security/automatic-
T. Dhaene, “Adaptive modeling and sampling methodologies for In- speaker-verification-systems-can-be-fooled-by-disguising-your-voice
ternet of Things applications,” Proceedings of the 18th Mediterranean [94] S. K. Moore, “DARPA Seeking AI That Learns
Electrotechnical Conference: Intelligent and Efficient Technologies and All the Time - IEEE Spectrum,” 2017. [Online].
Services for the Citizen, MELECON 2016, no. April, pp. 18–20, 2016. Available: https://spectrum.ieee.org/cars-that-think/robotics/artificial-
[70] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, intelligence/darpa-seeking-ai-that-can-learn-all-the-time
V. Chaudhary, M. Young, and D. Dennison, “Hidden Technical [95] J. Hsu, “Deep Learning AI for NASA Powers Earth
Debt in Machine Learning Systems,” Nips, pp. 2494–2502, 2015. Robots - IEEE Spectrum,” 2017. [Online]. Available:
[Online]. Available: http://papers.nips.cc/paper/5656-hidden-technical- https://spectrum.ieee.org/automaton/robotics/artificial-intelligence/ai-
debt-in-machine-learning-systems.pdf startup-neurala-deep-learning-for-nasa-powers-earth-robots
[71] F. Yang, A. Ramdas, K. Jamieson, and M. J. Wainwright, “A framework [96] ——, “A New Way to Find Bugs in Self-Driving
for Multi-A(rmed)/B(andit) testing with online FDR control,” no. 3, AI Could Save Lives - IEEE Spectrum,” 2017. [On-
2017. [Online]. Available: http://arxiv.org/abs/1706.05378 line]. Available: https://spectrum.ieee.org/tech-talk/robotics/artificial-
[72] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkot- intelligence/better-bug-hunts-in-selfdriving-car-ai-could-save-lives
tai, “Model-Powered Conditional Independence Test,” no. Nips, pp. [97] E. Ackerman, “After Mastering Singapore’s Streets, NuTonomy’s
1–11, 2017. Robo-taxis Are Poised to Take on New Cities - IEEE Spectrum,”
[73] P. Schulam and S. Saria, “Reliable Decision Support using 2016. [Online]. Available: https://spectrum.ieee.org/transportation/self-
Counterfactual Models,” no. Nips, 2017. [Online]. Available: driving/after-mastering-singapores-streets-nutonomys-robotaxis-are-
http://arxiv.org/abs/1703.10651 poised-to-take-on-new-cities
[74] J. Z. Liu and B. Coull, “Robust Hypothesis Test for Nonlinear Effect [98] N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar,
with Gaussian Processes,” Nips2017, vol. 2, no. Nips, pp. 1–10, 2017. “Semi-supervised Knowledge Transfer for Deep Learning from Private
[Online]. Available: http://arxiv.org/abs/1710.01406 Training Data,” no. 2015, pp. 1–16, 2016. [Online]. Available:
[75] H. C. L. Law, C. Yau, and D. Sejdinovic, “Testing and Learning http://arxiv.org/abs/1610.05755
on Distributions with Symmetric Noise Invariance,” no. Mmd, 2017. [99] T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial Training
[Online]. Available: http://arxiv.org/abs/1703.07596 Methods for Semi-Supervised Text Classification,” pp. 1–11, 2016.
[76] W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton, “A [Online]. Available: http://arxiv.org/abs/1605.07725
Linear-Time Kernel Goodness-of-Fit Test,” no. Nips, pp. 0–3, 2017. [100] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel,
[Online]. Available: http://arxiv.org/abs/1705.07673 “Adversarial Attacks on Neural Network Policies,” 2017. [Online].
[77] Z. Daniel Guo, P. S. Thomas, and E. Brunskill, “Using Available: http://arxiv.org/abs/1702.02284
Options and Covariance Testing for Long Horizon Off- [101] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and
Policy Policy Evaluation,” Advances in Neural Information D. Amodei, “Deep reinforcement learning from human preferences,”
Processing Systems 30 (NIPS 2017), no. Nips, 2017. [On- 2017. [Online]. Available: http://arxiv.org/abs/1706.03741
line]. Available: http://papers.nips.cc/paper/6843-using-options-and- [102] V. Cheung, J. Schneider, I. Sutskever, and G. Brockman,
covariance-testing-for-long-horizon-off-policy-policy-evaluation.pdf “Infrastructure for Deep Learning,” pp. 1–11, 2016. [Online].
Available: https://openai.com/blog/infrastructure-for-deep-learning/
[78] F. Cecchi and N. Hegde, “Adaptive Active Hypothesis Testing under
[103] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and
Limited Information,” no. d, pp. 1–9, 2017.
D. Mané, “Concrete Problems in AI Safety,” pp. 1–29, 2016.
[79] H. Assem, L. Xu, T. S. Buda, and D. O’Sullivan, “Machine learning
as a service for enabling Internet of Things and People,” Personal and
Ubiquitous Computing, vol. 20, no. 6, pp. 899–914, 2016.
284
1
Abstract—Several papers have recently contained reports on applying machine learning (ML) to the
arXiv:1802.01096v2 [cs.SE] 7 Feb 2018
automation of software engineering (SE) tasks, such as project management, modeling and development.
However, there appear to be no approaches comparing how software engineers fare against machine-learning
algorithms as applied to specific software development tasks. Such a comparison is essential to gain insight
into which tasks are better performed by humans and which by machine learning and how cooperative work or
human-in-the-loop processes can be implemented more effectively. In this paper, we present an empirical study
that compares how software engineers and machine-learning algorithms perform and reuse tasks. The
empirical study involves the synthesis of the control structure of an autonomous streetlight application. Our
approach consists of four steps. First, we solved the problem using machine learning to determine specific
performance and reuse tasks. Second, we asked software engineers with different domain knowledge levels to
provide a solution to the same tasks. Third, we compared how software engineers fare against
machine-learning algorithms when accomplishing the performance and reuse tasks based on criteria such as
energy consumption and safety. Finally, we analyzed the results to understand which tasks are better
performed by either humans or algorithms so that they can work together more effectively. Such an
understanding and the resulting human-in-the-loop approaches, which take into account the strengths and
weaknesses of humans and machine-learning algorithms, are fundamental not only to provide a basis for
cooperative work in support of software engineering, but also, in other areas.
1 I NTRODUCTION
tasks could be formulated as learning in software engineering, but also in other appli-
problems and approached in terms of cation areas.
learning algorithms.” This paper is organized as follows: Section
2 presents the empirical study describing re-
However, there is a lack of approaches to
search questions, hypotheses and the objective
compare how software engineers fare against
of the study. Section 3 presents the method se-
machine-learning algorithms for specific soft-
lected to collect our empirical data. Sections 4
ware development tasks. This comparison is crit-
and 5 present the experimental results. Section
ical in order to evaluate which S.E. tasks are
6 presents the threats to the validity of our ex-
better performed by automation and which re-
periment. Section 7 presents the related work.
quire human involvement or human-in-the-loop
The paper ends with concluding remarks and
approaches [21], [22]. In practice, because there
suggestions for future work.
are no explicit comparisons between the tasks
performed by engineers and automated proce-
dures, including machine learning, it is often not 1.1 Motivation
clear when to use automation in a specific setting. The theme of this paper, namely whether arti-
For example, a Brazilian company acquired a ficial intelligence such as machine learning, can
software system to select petroleum exploration benefit software engineering, has been investi-
models automatically, but the engineers decided gated since 1986, when Hebert A Simon pub-
they could provide a better solution manually. lished a paper entitled “Whether software engi-
However, when there was a comparison of the neering needs to be artificially intelligent” [24].
manual solution with the one provided automat- In this paper, Simon discussed “the roles that
ically by the system, it became clear that the humans now play versus the roles that could be
automated solution was better. This illustrates taken over by artificial intelligence in developing
that a lack of comparisons makes choosing a computer systems.” Notwithstanding, in 1993,
manual or an automated solution or a combined Ian Sommerville raised the following question
human-in-the-loop approach difficult. [25]: “What of the future - can Artificial Intelli-
This paper, contains an empirical study [23] to gence make a contribution to system engineer-
compare how software engineers and machine- ing?” In this paper [25], Sommervile performed a
learning algorithms achieve performance and literature review in applications of artificial intel-
reuse tasks. The empirical study uses a case study ligence to software engineering, and concluded
involving the creation of a control structure for that:
an autonomous streetlight application. The ap- “the contribution of AI will be in sup-
proach consists of four steps. First, the problem porting...activities that are characterized
was solved using machine learning to achieve by solutions to problems which are
specific performance and reuse of tasks. Sec- neither right nor wrong but which are
ond, we asked software engineers with different more or less appropriate for a par-
domain-knowledge levels to provide a solution ticular situation...For example, require-
to achieve the same tasks. Third, we compared ments specification and analysis which
how software engineers compare with machine- involves extensive consultation with do-
learning algorithms when accomplishing the per- main experts and in project manage-
formance and reuse tasks based on criteria such ment.”
as energy consumption and safety. Finally, the Several papers have since investigated the use
results were analyzed to understand which tasks of Machine Learning (ML) [26] in solving differ-
are better performed by either humans or al- ent software engineering (SE) tasks [5]–[20], [27]–
gorithms so that they can work together more [103]. These investigations include approaches
effectively. to: i) project management [27]–[49], dealing with
Such an understanding is essential in re- problems related to cost, time, quality prediction,
alizing novel human-in-the-loop approaches in and resource management; ii) defect prediction
which machine-learning procedures assist soft- [50]–[84]; iii) requirements management, focus-
ware developers in achieving tasks. Such human- ing on problems of classifying or representing re-
in-the-loop approaches, which take into account quirements [85]–[88], or generating requirements
the strengths and weaknesses of humans and [89]; iv) software development, such as code gen-
machine-learning algorithms, are fundamental eration [20], [68], [90]–[96], synthesis [97]–[101],
not only to provide a basis for cooperative work and code evaluation [102], [103].
3
Most of these papers present successful ap- control structure for each thing manually.
plications of machine learning in software engi-
neering, showing that ML techniques can provide
correct automatic solutions to some SE problems.
However, very few papers discuss whether or
not a domain expert could propose a manual
solution more appropriate for the particular
situation. “More appropriate”, means a solution 1.2 Objective
that provides better performance or increases an-
other quality that is important to a particular ap-
plication scenario, such as user preference [104]. In this context, we decided to ask the following
For example, in the medical and aviation engi- question: “How do software engineers compare
neering fields, trust [105] in a solution provided with machine-learning algorithms?” To explore
to the end-user is an important factor to consider this question, we selected the Internet of Things
for a solution to be more appropriate. However, as our application domain and then, compared
although many authors [106]–[109] have been a solution provided by a skilled IoT professional
promoting the use of neural networks [110] in with a solution provided by a learning algorithm
medicine, Abbas et al. [105] and Castelvecchi with respect to performance and reuse tasks.
[111] are among the few authors who questioned: In short, Figure 1 depicts the theory [121] that
“what is the trustworthiness of a prediction made we investigate in this paper. According to the
by an artificial neural network?” theory, the variables that we intend to isolate
In other application scenarios, such as many and measure are the performance and reusabil-
of those related to the Internet of Things (IoT) ity achieved from three kinds of solutions: i)
[112], [113], numerous authors [93], [95], [101], solutions provided by learning techniques; ii)
[114] consider the reuse of a solution as an im- solutions provided by software engineers with
portant quality. They agree that to achieve the IoT skills; and iii) solutions provided by software
goal of billions of things connected to the Internet engineers without IoT skills.
over the next few years [112], it is necessary to To evaluate the relationship among these vari-
find ways to reduce time to market. For example, ables, we performed an empirical study, using
it is desirable that the solution or parts of the FIoT [101]. As shown in Figure 1, we raised
solution to design autonomous streetlights [96] four research questions (RQx) to investigate our
for a specific scenario could be reused to design theory’s propositions (e.g hypotheses (H-RQx)).
streetlights for another scenario. We present these questions and hypotheses in
In particular, the Internet of Things has con- Section 2. To collect and analyze our empiri-
siderably increased the number of approaches cal data, we performed a controlled experiment.
that propose the use of machine learning to To perform this experiment, we reproduced the
automate software development [93]–[96], [101], problem of synthesizing the control structure
[115]. None of this research contains a compar- of autonomous streetlights using neuroevolution
ison of their results to experiments designed by (i.e. “a learning algorithm which uses genetic
IoT experts. For example, do Nascimento and Lu- algorithms to train neural networks” [122]) pre-
cena [101], [116] developed a hybrid framework sented in [118]. Then, we invited 14 software
that uses learning-based and manual program engineers to provide a solution for the same
synthesis for the Internet of Things (FIoT). They problem using the same architecture and envi-
generated four instances of the framework [101], ronment. Lastly, we compared the solution pro-
[108], [117], [118] and used learning techniques vided by the learning algorithm against the so-
to synthesize the control structure automatically. lutions provided by the software engineers. In
These authors stated that the use of machine this application of autonomous streetlights, we
learning made feasible the development of these are considering a “more appropriate” solution
applications. However, they did not present any as one that presents a better performance in
experiment without using learning techniques. In the main scenario [118] or can be satisfactorily
contrast, most of the solutions released for the reused in a new scenario, based on criteria such
Internet of Things, such as Apple’s HomeKit’s as minimal energy consumption and safety
approach [119] and Samsung Smart Things, [120] (that is, maximum visual comfort in illuminated
consider a software developer synthesizing the areas).
4
Technology SE Tasks
Actor
Synthesize and reuse
Machine Learning the control structure of
Techniques autonomous things
Fig. 1. Theory [121]: Machine Learning can create solutions more appropriate than software engineers in the context of
the Internet of Things.
Lighting
(Dark/DIM/Light) Light Decision
(OFF/DIM/ON)
Presence Fig. 4. The challenge: how does a streetlight make deci-
(Yes/No) sions based on collected data?
Wireless
Data collected from Transmitter
the closest street light (0.0/0.5/1.0) We provided pseudocode that considered all
(0.0/0.5/1.0) possible combinations of input variables. Then,
Listening participants decided how to set output variables
Previous Listening Decision
Decision
(Yes/No)
according to the collected data. Part 2 of this
(Yes/No)
pseudocode is depicted in Figure 5.
Each participant provided a different solu-
Fig. 3. Variables collected and set by streetlights. tion. Therefore, we conducted the experiment by
Each streetlight in the simulation has a mi- 1. All documents that we prepared to explain this appli-
crocontroller that is used to detect the proximity cation scenario to participants are available at
http://www.inf.puc-rio.br/.̃nnascimento/projects.html
of a person, and control the closest streetlight. A
2. The pseudocode that we provided to participants is
streetlight can change the status of its light to ON, available at:
OFF or DIM. http://www.inf.puc-rio.br/.̃nnascimento/projects.html
7
Fig. 5. Small portion of the pseudocode of the decision module that was filled by participants.
Fig. 6. Small portion of the rule decisions that was synthesized according to the learning-based approach.
using each one. In addition, we also considered lights with broken lamps emit “0.5” from their
a “zeroed” solution, which always sets all values wireless transmitters.
to zero. This zeroed solution is supposed to be In addition, we also observed that a streetlight
the worst solution, since streetlights will always that is not broken switches its lamp ON if it de-
switch their lights to OFF. tects a persons proximity or receives “0.5” from a
wireless transmitter.
3.2.2 The solution generated by a machine-
learning algorithm 3.2.3 Scenario constraints
We compared the results from all of these ap- Before starting a solution, each participant
proaches to the result produced using the ma- should consider the following constraints:
chine learning approach. As do Nascimento and
Lucena explain in [118], the learning approach • Do not take light numbering into account,
uses a three-layer feedforward neural network since your solution may be used in differ-
combined with an evolutionary algorithm to gen- ent scenarios (see an example of a scenario
erate decision rules automatically. Figure 6 de- in Figure 7).
picts some of the rules that were generated by the • Three streetlights will go dark during the
evolved neural network. The interested reader simulation.
can consult more extensive papers [101], [118] or • People walk along different paths starting
read Nascimento’s dissertation [116] (chap. ii, sec. at random departure points. Their role is
iii). to complete their routes by reaching a des-
Based on the generated rules and the system tination point. The number of people that
execution, we observe that using the solution finished their routes after the simulation
provided by the neural network, only the street- ends, and the total time spent by people
8
moving during their trip are the most Figure 7 depicts the elements that are part of the
important factors for a good solution. application namely, streetlights, people, nodes
• A person can only move if his current and and edges.
next positions are not completely dark.
In addition, we also consider that people
walk slowly if the place is partially devoid
of light. Execution: 12 seconds
• The energy consumption also influences - A person moves from
the solution evaluation. one point to another in
one second or a second
• The energy consumption is proportional and a half.
- Street lights execute
to the light status (OFF/DIM/ON). cycles of 1 second
• We also consider the use of the wireless
ON
transmitter to calculate energy consump- DIM
OFF
tion (if the streetlight emits something Broken Lamps
this person will conclude his route before the the solution provided by each one of the 14
simulation ends after 12 seconds. participants 3 .
To provide a controlled experiment and be
3.2.5 New Scenario: Unknown environment able to compare the different solutions, we
started with only one person in the scenario and
The second step of the experiment consists of manually we set the parameters that were sup-
executing solutions from participants and the posed to be randomly selected, such as departure
learning approach in a new scenario, but with the and target points and broken lamps.
same constraints. This scenario, that is depicted Each experiment execution consists of execut-
in Figure 9 was not used by the learning algo- ing the simulated scenario three times: (i) night
rithm and was not presented to participants. (environmental light is equal to 0.0); (ii) late
The goal of this new part of the experiment afternoon (environmental light is equal to 0.5);
is to verify if the decision module that was de- and (iii) morning (environmental light is equal
signed to control streetlights in the first scenario to 1.0). The main idea is to determine how the
can be reused in another scenario. solution behaves during different parts of the
day. Figure 10 depicts the percentage of energy
that was spent according to the environmental
light for each one of the 16 different solutions. As
we described previously, we also considered the
use of the wireless transmitter to calculate energy
consumption. As expected, as streetlights using
the zeroed decision never switch their lights ON
and never emit any signal, the energy consumed
using this solution is always zero. It is possible
to observe that only the solutions provided by
the learning algorithm and by the 5th and 11th
participants do not expend energy when the en-
vironmental light is maximum. In fact, according
to the proposed scenario, there is no reason to
turn ON streetlights during the period of the day
with maximum illumination.
100
90
Fig. 9. Simulating a new neighborhood. 80
.
70
0
60
In this new scenario, we also only started
50
one person, who has the point 18 (yellow point)
(
40
as departure and the point 8 as target. As the 30
.)1
20
scenario is larger, we established a simulation
10
time of 30 seconds. 0
We executed the experiment 16 times, only Fig. 10. Scenario1: Percentage of energy spent in different
changing the decision solution of the au- parts of the day according to the participant solutions.
tonomous streetlights. In the first instance, we
set all outputs to zero (the zeroed solution) dur- Figure 11 depicts the percentage of time that
ing the whole simulation, which is supposed to was spent by the unique person in each one of
be the worst solution. For example, streetlights the simulations. As shown, the higher difference
never switch their lights ON. In the second in- between solutions occurs at night. If the time is
stance, we executed the experiment using the
best solution that was found by the learning 3. All files that were generated during the development
of this work, such as executable files and participants’
algorithm, according to the experiment presented solutions results, are available at
in [118]. Then, we executed the simulation for http://www.inf.puc-rio.br/.̃nnascimento/projects.html
10
100%, it means that the person did not complete simulation time did not allow the person to finish
the route, thus the solution did not work the route.
60
50 participant, we connect that solution’s results
40
30
with the participant’s knowledge in the IoT do-
20 main, as shown in Table 3.
10
0
TABLE 3
1) 0 Correlation between participants expertises in the Internet
of Things with their solution results.
). ( ). ( % ). (
Experience
Fig. 11. Scenario1: Percentage of time spent by person with IoT Solution Does
to conclude his route based on different parts of the day Software Development Performance the
according to the participant solutions. Engineer (None/Low/ (Fitness solution
Medium/ Average) work?
Besides presenting the results of the different High)
1 High 55.48 Y
solutions in different parts of the day, the best
2 None 26.99 N
solution must be the one that presents the best 3 High 62.88 Y
result for the whole day. Thus, we calculated the 4 Low 62.49 Y
average of each one of the parameters (energy, 5 None 30.50 N
people, trip and fitness) that was achieved by 6 Low 51.09 Y
7 Medium 54.37 Y
solutions in different parts of the day. Figure 12 8 None 16.59 N
depicts a common average. We also calculated a 9 High 28.62 N
weighted average, taking into account the dura- 10 None 61.60 Y
tion of the parts of the day (we considered 12 11 None 29.67 N
12 Medium 47.81 Y
hours for the night period, 3h for dim and 9h for 13 None 30.32 N
the morning), but the results were very similar. 14 Low 56.91 Y
Learning 59.53 Y
100
zeroed 28.33 N
90
80
70 62.88
We observe a significant difference between
61.60
62.49 56.91 59.53
60 55.48
51.09
54.37
47.81 results from software engineers with any ex-
50
%
40
29.67 30.32
perience in IoT development and results from
26.99 30.50 28.62 28.33
30
20
16.59 software engineers without experience in IoT
10 development. Participant 10 is the only individ-
0
ual without knowledge of IoT that provided a
solution that works and participant 9 is the only
Fitness individual with any knowledge of IoT that did
not provide a working solution.
Fig. 12. Scenario1: Average of energy, trip and fitness
calculated for the different parts of the day according to the
participant solutions. 4.2 Hypothesis Testing
In this section, we investigate the hypotheses
As shown in Figure 12, based on the fitness related to the solutions’ performance evaluation
average, three participants namely 3, 4 and 10 (i.e H-RQ1 and H-RQ3), as presented in subsec-
provided a solution slightly better than the so- tion 2.2. Thus, we performed statistical analyses,
lution provided by the learning algorithm. Five as described by Peck and Devore [125], of the
other participants provided a solution that works measures presented in Table 3.
and the remaining six provided a solution that As shown in Table 4, we separated the results
does not work. As explained earlier, we have of the experiments into two groups: i) software
been considering an incorrect solution as one engineers with IoT knowledge and ii) software
in which the person did not finish the route engineers without IoT knowledge. Then, we cal-
before the simulation ends. Even increasing the culated the mean and the standard deviation of
11
TABLE 4
Data to perform test statistic.
Degrees t
Standard
n Highest Mean of critical
Variable Median deviation
samples value x freedom value
σ
(n-1) (.99%)
Software
14 62.88 43.95 49.45 16.00 13 2.65
Engineers
Software
Engineers
8 62.88 52.46 54.92 10.91 7 3.00
with IoT
knowledge
Software
Engineers
6 61.60 32.61 30.00 15.15 5 3.37
without IoT
knowledge
Machine-
learning
1 59.53
based
approach
the day (morning and late afternoon). As shown of 14 engineers, only one participant, who has
in Table 6, when considering the whole day, the experience with IoT development, provided a
machine-learning approach presented the best solution that worked.
result. Because the average time for the trip
was a little higher using the machine-learning
approach, the difference in energy consumption
Bad
between the two solutions is considerably higher.
93% 0% Work better than ML
7%
TABLE 5 Work
Using the same solution in a different environment - only
at night.
TABLE 6 6 D ISCUSSION
Using the same solution in a different environment - day
average.
In this section, we analyze the empirical exper-
imental results to understand which tasks are
Energy% People% Trip% Fitness better performed by humans and which by al-
Average gorithms. This is important for selecting whether
Participant 50.52 100 38.14 56.90 software engineers or machine learning can ac-
12 complish a specific task better.
Average
8.46 100 46.29 68.83 In our empirical study, in which we have as-
Learning
sessed performance and reuse tasks, we accepted
three alternative hypotheses and rejected one:
5.1 Hypothesis Testing
In this section, we investigate the hypotheses re- Accepted:
lated to the solutions’ reuse evaluation, that is H- 1) An ML-based approach improves the
RQ2 and H-RQ4, as presented in subsection 2.2. performance of autonomous things com-
Their alternative hypotheses state that an ML- pared to solutions provided by software
based approach improves the performance of au- engineers without experience with IoT
tonomous things compared to solutions provided development.
by software engineers, software engineers with 2) An ML-based approach increases the
experience in IoT development, and software en- reuse of autonomous things compared
gineers without experience in IoT development, to solutions provided by IoT expert soft-
respectively. We planned to perform a statisti- ware engineers.
cal development to evaluate these hypotheses. 3) An ML-based approach increases the
However, as depicted in Figure 15, in the new reuse of autonomous things compared
scenario, 0% of participants provided a result to solutions provided by software engi-
better than the result provided by the machine- neers without experience with IoT devel-
learning solution. In addition, from the group opment.
14
against the results achieved by the world cham- line of investigation is: “Could a software engi-
pion in the game of Go. In [130], Silver et al. neer solve a specific development task better than
(2017) state that their program “achieved super- an ML algorithm?”. Indeed, it is fundamental to
human performance.” evaluate which tasks are better performed by en-
Whiteson et al. [122] indirectly performed this gineers or ML procedures so that they can work
comparison, by evaluating the use of three dif- together more effectively and also provide more
ferent approaches of the neuroevolution learning insight into novel human-in-the-loop machine-
algorithm to solve the same tasks: (i) coevolution, learning approaches to support SE tasks.
that is mostly unassisted by human knowledge; This paper appears to be the first to pro-
(ii) layered learning, that is highly assisted; and vide an empirical study comparing how soft-
(iii) concurrent layered learning, that is a mixed ware engineers and machine-learning algorithms
approach. The authors state that their results achieve performance and reuse tasks. In brief,
“demonstrate that the appropriate level of hu- as a result of our experiment, we have found
man assistance depends critically on the diffi- evidence that in some cases, software engineers
culty of the problem.” outperform machine-learning algorithms, and in
Furthermore, there is also a new approach other cases, they do not. Further, as is typical in
in machine learning, called Automatic Machine experimental studies, although we have designed
Learning (Auto-ML) [100], which uses learning and conducted the experiment carefully, there are
to set the parameters of a learning algorithm au- always factors that can threaten the experiment’s
tomatically. In a traditional approach, a software validity. For example, some threats include the
engineer with machine learning skills is respon- number and diversity of the software engineers
sible for finding a good configuration for the involved in our experiment.
algorithm parameters. Zoth and Lee [100] present Understanding how software engineers fare
an Auto-ML-based approach to design a neural against ML algorithms is essential to support
network to classify images of a specific dataset. new methodologies for developing human-in-
In addition, they compared their results with the-loop approaches in which machine learning
the previous state-of-the-art model, which was automated procedures assist software developers
designed by an ML expert engineer. According in achieving their tasks. For example, method-
to Zoth and Lee [100] , their AutoML-based ap- ologies to define which agent (engineers or auto-
proach “can design a novel network architecture mated ML procedure) should execute a specific
that rivals the best human-invented architecture task in a software development set. Based on
in terms of test set accuracy.” Zoth and Lee this understanding, these methodologies can pro-
also showed that a machine-learning technique vide a basis for software engineers and machine
is capable of beating a software engineer with learning algorithms to cooperate in Software En-
ML skills in a specific software engineering task, gineering development more effectively.
but the authors do not discuss this subject in the
Future work to extend the proposed experi-
paper.
ment includes: (i) conducting further empirical
Our paper appears to be the first to pro-
studies to assess other SE tasks, such as design,
vide an empirical study to investigate the use
maintenance and testing; (ii) experimenting with
of a machine-learning techniques to solve a
other machine-learning algorithms such as re-
problem in the field of Software Engineering,
inforcement learning and backpropagation; and
by comparing the solution provided by a ML-
(iii) using different criteria to evaluate task exe-
based approach against solutions provided by
cution.
software engineers.
Possible tasks that could be investigated (refer
to (i)) include programming tasks, in which case
tasks performed by software development teams
9 C ONCLUSION AND F UTURE W ORK
and ML algorithms are compared. For example,
Several researchers have proposed the use of we could invite software developers from the
machine-learning techniques to automate soft- team with the highest score in the last ACM Inter-
ware engineering tasks. However, most of these national Collegiate Programming Contest [131],
approaches do not direct efforts toward asking which is one of the most important programming
whether ML-based procedures have higher suc- championships in the world, to be involved in
cess rates than current standard and manual this comparison. This competition evaluates the
practices. A relevant question in this potential capability of software engineers to solve complex
16
software problems. Software engineers are classi- international conference on Software Engineer-
fied according to the number of right solutions, ing. IEEE Computer Society Press, 1987,
performance of the solutions and development pp. 200–211.
time. [7] D. Partridge, “Artificial intelligence and
Another line of investigation could address software engineering: a survey of possibil-
the use of different qualitative or quantitative ities,” Information and Software Technology,
methodologies. For example, the task execu- vol. 30, no. 3, pp. 146–152, 1988.
tion comparison could rely on reference per- [8] L. C. Cheung, S. Ip, and T. Holden, “Survey
formances, such as the performance of highly of artificial intelligence impacts on infor-
successful performers [100], [129], [130] . This mation systems engineering,” Information
research work can also be extended by propos- and Software Technology, vol. 33, no. 7, pp.
ing, based on the comparison between the per- 499–508, 1991.
formance of engineers and ML algorithms, a [9] D. Partridge, Artificial Intelligence in Soft-
methodology for more effective task allocation. ware Engineering. Wiley Online Library,
This methodology could, in principle, lead to 1998.
more effective ways to allocate tasks such as soft- [10] A. Van Lamsweerde and L. Willemet, “In-
ware development in cooperative work involv- ferring declarative requirements specifica-
ing humans and automated procedures. Such tions from operational scenarios,” IEEE
human-in-the-loop approaches, which take into Transactions on Software Engineering, vol. 24,
account the strengths and weaknesses of humans no. 12, pp. 1089–1114, 1998.
and machine learning algorithms, are fundamen- [11] G. D. Boetticher, “Using machine learn-
tal to provide a basis for cooperative work in ing to predict project effort: Empirical case
software engineering and possibly in other areas. studies in data-starved domains,” in Model
Based Requirements Workshop. Citeseer,
2001, pp. 17–24.
ACKNOWLEDGMENTS [12] F. Padberg, T. Ragg, and R. Schoknecht,
This work has been supported by the Labora- “Using machine learning for estimating the
tory of Software Engineering (LES) at PUC-Rio. defect content after an inspection,” IEEE
Our thanks to CAPES, CNPq, FAPERJ and PUC- Transactions on Software Engineering, vol. 30,
Rio for their support through scholarships and no. 1, pp. 17–28, 2004.
fellowships. We would also like to thank the [13] D. Zhang, “Applying machine learning al-
software engineers who participated in our ex- gorithms in software development,” in The
periment. Proceedings of 2000 Monterey Workshop on
Modeling Software System Structures, 2000,
pp. 275–285.
R EFERENCES [14] ——, “Machine learning in value-based
software test data generation,” in Tools with
[1] F. Brooks and H. Kugler, No silver bullet. Artificial Intelligence, 2006. ICTAI’06. 18th
April, 1987. IEEE International Conference on. IEEE,
[2] R. S. Pressman, Software engineering: a prac- 2006, pp. 732–736.
titioner’s approach. Palgrave Macmillan, [15] D. Zhang and J. J. Tsai, Machine learning
2005. applications in software engineering. World
[3] Q. Zhang, “Software developments,” Engi- Scientific, 2005, vol. 16.
neering Automation for Reliable Software, p. [16] D. Zhang, “Machine learning and value-
292, 2000. based software engineering: a research
[4] J. O. Kephart, “Research challenges of auto- agenda.” in SEKE, 2008, pp. 285–290.
nomic computing,” in Software Engineering, [17] T. M. Khoshgoftaar, “Introduction to the
2005. ICSE 2005. Proceedings. 27th Interna- special issue on quality engineering with
tional Conference on. IEEE, 2005, pp. 15–22. computational intelligence,” 2003.
[5] J. Mostow, “Foreword what is ai? and what [18] D. Zhang, “Machine learning and value-
does it have to do with software engineer- based software engineering,” in Software
ing?” IEEE Transactions on Software Engi- Applications: Concepts, Methodologies, Tools,
neering, vol. 11, no. 11, p. 1253, 1985. and Applications. IGI Global, 2009, pp.
[6] D. Barstow, “Artificial intelligence and soft- 3325–3339.
ware engineering,” in Proceedings of the 9th
17
[19] D. Zhang and J. J. Tsai, “Machine learning prediction,” International Journal of Software
and software engineering,” in Tools with Engineering and Computing, vol. 2, no. 2, pp.
Artificial Intelligence, 2002.(ICTAI 2002). Pro- 95–109, 2010.
ceedings. 14th IEEE International Conference [30] W. Zhang, Y. Yang, and Q. Wang, “Han-
on. IEEE, 2002, pp. 22–29. dling missing data in software effort pre-
[20] M. D. Kramer and D. Zhang, “Gaps: a diction with naive bayes and em algo-
genetic programming system,” in Computer rithm,” in Proceedings of the 7th International
Software and Applications Conference, 2000. Conference on Predictive Models in Software
COMPSAC 2000. The 24th Annual Interna- Engineering. ACM, 2011, p. 4.
tional. IEEE, 2000, pp. 614–619. [31] Ł. Radliński, “A framework for inte-
[21] A. Holzinger, M. Plass, K. Holzinger, G. C. grated software quality prediction using
Crişan, C.-M. Pintea, and V. Palade, “To- bayesian nets,” Computational Science and
wards interactive machine learning (iml): Its Applications-ICCSA 2011, pp. 310–325,
applying ant colony algorithms to solve 2011.
the traveling salesman problem with the [32] P. O. O. Sack, M. Bouneffa, Y. Maweed,
human-in-the-loop approach,” in Interna- and H. Basson, “On building an integrated
tional Conference on Availability, Reliability, and generic platform for software quality
and Security. Springer, 2016, pp. 81–95. evaluation,” in Information and Communica-
[22] A. Holzinger, “Interactive machine learn- tion Technologies, 2006. ICTTA’06. 2nd, vol. 2.
ing for health informatics: when do we IEEE, 2006, pp. 2872–2877.
need the human-in-the-loop?” Brain Infor- [33] M. Reformat and D. Zhang, “Introduc-
matics, vol. 3, no. 2, pp. 119–131, 2016. tion to the special issue on:“software
[23] S. Easterbrook, J. Singer, M.-A. Storey, and quality improvements and estimations
D. Damian, “Selecting empirical methods with intelligence-based methods”,” Soft-
for software engineering research,” Guide to ware Quality Journal, vol. 15, no. 3, pp. 237–
advanced empirical software engineering, pp. 240, 2007.
285–311, 2008. [34] B. Twala, M. Cartwright, and M. Shepperd,
[24] H. A. Simon, “Whether software engineer- “Applying rule induction in software pre-
ing needs to be artificially intelligent,” IEEE diction,” in Advances in Machine Learning
Transactions on Software Engineering, no. 7, Applications in Software Engineering. IGI
pp. 726–732, 1986. Global, 2007, pp. 265–286.
[25] I. Sommerville, “Artificial intelligence and [35] V. U. Challagulla, F. B. Bastani, and I.-L.
systems engineering,” Prospects for Arti- Yen, “High-confidence compositional relia-
ficial Intelligence: Proceedings of AISB’93, bility assessment of soa-based systems us-
29 March-2 April 1993, Birmingham, UK, ing machine learning techniques,” in Ma-
vol. 17, p. 48, 1993. chine Learning in Cyber Trust. Springer,
[26] R. S. Michalski, J. G. Carbonell, and T. M. 2009, pp. 279–322.
Mitchell, Machine learning: An artificial in- [36] R. C. Veras, S. R. Meira, A. L. Oliveira,
telligence approach. Springer Science & and B. J. Melo, “Comparative study of clus-
Business Media, 2013. tering techniques for the organization of
[27] A. Marchetto and A. Trentini, “Evaluating software repositories,” in Hybrid Intelligent
web applications testability by combining Systems, 2007. HIS 2007. 7th International
metrics and analogies,” in Information and Conference on. IEEE, 2007, pp. 372–377.
Communications Technology, 2005. Enabling [37] I. Birzniece and M. Kirikova, “Interactive
Technologies for the New Knowledge Society: inductive learning service for indirect anal-
ITI 3rd International Conference on. IEEE, ysis of study subject compatibility,” in Pro-
2005, pp. 751–779. ceedings of the BeneLearn, 2010, pp. 1–6.
[28] S. Bouktif, F. Ahmed, I. Khalil, and G. Anto- [38] D. B. Hanchate, “Analysis, mathemati-
niol, “A novel composite model approach cal modeling and algorithm for software
to improve ality prediction,” Information project scheduling using bcga,” in In-
and Software Technology, vol. 52, no. 12, pp. telligent Computing and Intelligent Systems
1298–1311, 2010. (ICIS), 2010 IEEE International Conference on,
[29] L. Radlinski, “A survey of bayesian net vol. 3. IEEE, 2010, pp. 1–7.
models for software development effort [39] Z. Xu and B. Song, “A machine learning ap-
18
plication for human resource data mining H. Meddeb, “On the use of time series
problem,” Advances in Knowledge Discovery and search based software engineering for
and Data Mining, pp. 847–856, 2006. refactoring recommendation,” in Proceed-
[40] J. Wen, S. Li, Z. Lin, Y. Hu, and C. Huang, ings of the 7th International Conference on
“Systematic literature review of machine Management of computational and collective
learning based software development ef- intElligence in Digital EcoSystems. ACM,
fort estimation models,” Information and 2015, pp. 35–42.
Software Technology, vol. 54, no. 1, pp. 41– [50] V. U. Challagulla, F. B. Bastani, I.-L.
59, 2012. Yen, and R. A. Paul, “Empirical assess-
[41] E. Rashid, S. Patnayak, and V. Bhattacher- ment of machine learning based soft-
jee, “A survey in the area of machine learn- ware defect prediction techniques,” in
ing and its application for software quality Object-Oriented Real-Time Dependable Sys-
prediction,” ACM SIGSOFT Software Engi- tems, 2005. WORDS 2005. 10th IEEE Inter-
neering Notes, vol. 37, no. 5, pp. 1–7, 2012. national Workshop on. IEEE, 2005, pp. 263–
[42] H. A. Al-Jamimi and M. Ahmed, “Machine 270.
learning-based software quality prediction [51] K. Kaminsky and G. Boetticher, “Build-
models: state of the art,” in Information ing a genetically engineerable evolvable
Science and Applications (ICISA), 2013 Inter- program (geep) using breadth-based ex-
national Conference on. IEEE, 2013, pp. 1–4. plicit knowledge for predicting software
[43] Ł. Radliński, “Enhancing bayesian network defects,” in Fuzzy Information, 2004. Process-
model for integrated software quality pre- ing NAFIPS’04. IEEE Annual Meeting of the,
diction,” in Proc. Fourth International Confer- vol. 1. IEEE, 2004, pp. 10–15.
ence on Information, Process, and Knowledge [52] ——, “How to predict more with less, de-
Management, Valencia. Citeseer, 2012, pp. fect prediction using machine learners in
144–149. an implicitly data starved domain,” in The
[44] F. Pinel, P. Bouvry, B. Dorronsoro, and S. U. 8th world multiconference on systemics, cyber-
Khan, “Savant: Automatic parallelization netics and informatics, Orlando, FL. Citeseer,
of a scheduling heuristic with machine 2004.
learning,” in Nature and Biologically Inspired [53] K. Kaminsky and G. D. Boetticher, “Better
Computing (NaBIC), 2013 World Congress on. software defect prediction using equalized
IEEE, 2013, pp. 52–57. learning with machine learners,” Knowledge
[45] D. Novitasari, I. Cholissodin, and W. F. Sharing and Collaborative Engineering, 2004.
Mahmudy, “Optimizing svr using local [54] O. Kutlubay and A. Bener, “A machine
best pso for software effort estimation,” learning based model for software defect
Journal of Information Technology and Com- prediction,” working paer, Boaziçi University,
puter Science, vol. 1, no. 1, 2016. Computer Engineering Department, 2005.
[46] Ł. Radliński, “Towards expert-based mod- [55] X. Ren, “Learn to predict “affecting
elling of integrated software quality,” Jour- changes” in software engineering,” 2003.
nal of Theoretical and Applied Computer Sci- [56] E. Ceylan, F. O. Kutlubay, and A. B. Bener,
ence, vol. 6, no. 2, pp. 13–26, 2012. “Software defect identification using ma-
[47] T. Rongfa, “Defect classification method chine learning techniques,” in Software En-
for software management quality control gineering and Advanced Applications, 2006.
based on decision tree learning,” in Ad- SEAA’06. 32nd EUROMICRO Conference on.
vanced Technology in Teaching-Proceedings IEEE, 2006, pp. 240–247.
of the 2009 3rd International Conference on [57] Y. Kastro and A. B. Bener, “A defect pre-
Teaching and Computational Science (WTCS diction method for software versioning,”
2009). Springer, 2012, pp. 721–728. Software Quality Journal, vol. 16, no. 4, pp.
[48] R. Rana and M. Staron, “Machine learn- 543–562, 2008.
ing approach for quality assessment and [58] O. Kutlubay, B. Turhan, and A. B. Bener, “A
prediction in large software organizations,” two-step model for defect density estima-
in Software Engineering and Service Science tion,” in Software Engineering and Advanced
(ICSESS), 2015 6th IEEE International Con- Applications, 2007. 33rd EUROMICRO Con-
ference on. IEEE, 2015, pp. 1098–1101. ference on. IEEE, 2007, pp. 322–332.
[49] H. Wang, M. Kessentini, W. Grosky, and [59] A. S. Namin and M. Sridharan, “Bayesian
19
reasoning for software testing,” in Proceed- gramms using machine learning tech-
ings of the FSE/SDP workshop on Future of niques,” in Artificial Intelligence Techniques
software engineering research. ACM, 2010, in Software Engineering Workshop, 2008.
pp. 349–354. [70] O. Maqbool and H. Babri, “Bayesian learn-
[60] C. Murphy and G. Kaiser, “Metamor- ing for software architecture recovery,” in
phic runtime checking of non-testable pro- Electrical Engineering, 2007. ICEE’07. Inter-
grams,” Columbia University Dept of Com- national Conference on. IEEE, 2007, pp. 1–6.
puter Science Tech Report cucs-042-09, p. [71] A. Okutan, “Software defect prediction us-
9293, 2009. ing bayesian networks and kernel meth-
[61] W. Afzal, R. Torkar, R. Feldt, and ods,” Ph.D. dissertation, ISIK UNIVER-
T. Gorschek, “Genetic programming for SITY, 2012.
cross-release fault count predictions in [72] D. Cotroneo, R. Pietrantuono, and S. Russo,
large and complex software projects,” Evo- “A learning-based method for combining
lutionary Computation and Optimization Al- testing techniques,” in Proceedings of the
gorithms in Software Engineering, pp. 94–126, 2013 International Conference on Software En-
2010. gineering. IEEE Press, 2013, pp. 142–151.
[62] C. Murphy et al., “Using metamorphic test- [73] D. Zhang, “A value-based framework for
ing at runtime to detect defects in applica- software evolutionary testing,” in Advances
tions without test oracles,” 2008. in Abstract Intelligence and Soft Computing.
[63] D. Qiu, S. Fang, and Y. Li, “A framework to IGI Global, 2013, pp. 355–373.
discover potential deviation between pro- [74] A. Okutan and O. T. Yıldız, “Software de-
gram and requirement through mining ob- fect prediction using bayesian networks,”
ject graph,” in Computer Application and Sys- Empirical Software Engineering, vol. 19, no. 1,
tem Modeling (ICCASM), 2010 International pp. 154–181, 2014.
Conference on, vol. 4. IEEE, 2010, pp. V4– [75] S. Agarwal and D. Tomar, “A feature se-
110. lection based model for software defect
[64] C. Murphy, G. E. Kaiser et al., “Automatic prediction,” assessment, vol. 65, 2014.
detection of defects in applications without [76] G. Abaei and A. Selamat, “Important is-
test oracles,” Dept. Comput. Sci., Columbia sues in software fault prediction: A road
Univ., New York, NY, USA, Tech. Rep. CUCS- map,” in Handbook of Research on Emerging
027-10, 2010. Advancements and Technologies in Software
[65] W. Afzal, “Search-based approaches to soft- Engineering. IGI Global, 2014, pp. 510–539.
ware fault prediction and software test- [77] A. Okutan and O. T. Yildiz, “A novel kernel
ing,” Ph.D. dissertation, Blekinge Institute to predict software defectiveness,” Journal
of Technology, 2009. of Systems and Software, vol. 119, pp. 109–
[66] M. K. Taghi, B. Cukic, and N. Seliya, “An 121, 2016.
empirical assessment on program module- [78] X.-d. Mu, R.-h. Chang, and L. Zhang,
order models,” Quality Technology & Quan- “Software defect prediction based on com-
titative Management, vol. 4, no. 2, pp. 171– petitive organization coevolutionary algo-
190, 2007. rithm,” Journal of Convergence Information
[67] J. H. Wang, N. Bouguila, and T. Bdiri, “Em- Technology (JCIT) Volume7, Number5, 2012.
pirical evaluation of selected algorithms for [79] J. Cahill, J. M. Hogan, and R. Thomas,
complexity-based classification of software “Predicting fault-prone software modules
modules and a new model,” in Intelligent with rank sum classification,” in Software
Systems: From Theory to Practice. Springer, Engineering Conference (ASWEC), 2013 22nd
2010, pp. 99–131. Australian. IEEE, 2013, pp. 211–219.
[68] H. Jin, Y. Wang, N.-W. Chen, Z.-J. Gou, [80] R. Rana, M. Staron, C. Berger, J. Hans-
and S. Wang, “Artificial neural network for son, M. Nilsson, and W. Meding, “The
automatic test oracles generation,” in Com- adoption of machine learning techniques
puter Science and Software Engineering, 2008 for software defect prediction: An initial
International Conference on, vol. 2. IEEE, industrial validation,” in Joint Conference
2008, pp. 727–730. on Knowledge-Based Software Engineering.
[69] J. Ferzund, S. N. Ahsan, and F. Wotawa, Springer, 2014, pp. 270–285.
“Automated classification of faults in pro- [81] T. Schulz, Ł. Radliński, T. Gorges, and
20
tion, 2011.
[126] W. Oizumi, L. Sousa, A. Garcia, R. Oliveira,
A. Oliveira, O. Agbachi, and C. Lucena,
“Revealing design problems in stinky code:
a mixed-method study,” in Proceedings of
the 11th Brazilian Symposium on Software
Components, Architectures, and Reuse. ACM,
2017, p. 5.
[127] E. Fernandes, F. Ferreira, J. A. Netto, and
E. Figueiredo, “Information systems devel-
opment with pair programming: An aca-
demic quasi-experiment,” in Proceedings of
the XII Brazilian Symposium on Information
Systems on Brazilian Symposium on Infor-
mation Systems: Information Systems in the
Cloud Computing Era-Volume 1. Brazilian
Computer Society, 2016, p. 64.
[128] G. Kasparov, Deep Thinking: Where Machine
Intelligence Ends and Human Creativity Be-
gins. Hachette UK, 2017.
[129] D. Silver, A. Huang, C. J. Maddison,
A. Guez, L. Sifre, G. Van Den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneer-
shelvam, M. Lanctot et al., “Mastering the
game of go with deep neural networks and
tree search,” Nature, vol. 529, no. 7587, pp.
484–489, 2016.
[130] D. Silver, J. Schrittwieser, K. Simonyan,
I. Antonoglou, A. Huang, A. Guez, T. Hu-
bert, L. Baker, M. Lai, A. Bolton et al.,
“Mastering the game of go without human
knowledge,” Nature, vol. 550, no. 7676, pp.
354–359, 2017.
[131] A. Trotman and C. Handley, “Program-
ming contest strategy,” Computers & Edu-
cation, vol. 50, no. 3, pp. 821–837, 2008.