Software Engineering For Machine Learning: A Case Study

Software Engineering for Machine Learning:
A Case Study
Saleema Amershi Andrew Begel Christian Bird Robert DeLine Harald Gall
Microsoft Research Microsoft Research Microsoft Research Microsoft Research University of Zurich
Redmond, WA USA Redmond, WA USA Redmond, WA USA Redmond, WA USA Zurich, Switzerland
samershi@microsoft.com andrew.begel@microsoft.com cbird@microsoft.com rdeline@microsoft.com gall@ifi.uzh.ch
Ece Kamar Nachiappan Nagappan Besmira Nushi Thomas Zimmermann

Microsoft Research Microsoft Research Microsoft Research Microsoft Research
Redmond, WA USA Redmond, WA USA Redmond, WA USA Redmond, WA USA
eckamar@microsoft.com nachin@microsoft.com besmira.nushi@microsoft.com tzimmer@microsoft.com
Abstract—Recent advances in machine learning have stim- techniques that have powered recent excitement in the software
ulated widespread interest within the Information Technology and services marketplace. Microsoft product teams have used
sector on integrating AI capabilities into software and services. machine learning to create application suites such as Bing
This goal has forced organizations to evolve their development
processes. We report on a study that we conducted on observing Search or the Cortana virtual assistant, as well as platforms
software teams at Microsoft as they develop AI-based applica- such as Microsoft Translator for real-time translation of text,
tions. We consider a nine-stage workflow process informed by voice, and video, Cognitive Services for vision, speech, and
prior experiences developing AI applications (e.g., search and language understanding for building interactive, conversational
NLP) and data science tools (e.g. application diagnostics and bug agents, and the Azure AI platform to enable customers to build
reporting). We found that various Microsoft teams have united
this workflow into preexisting, well-evolved, Agile-like software their own machine learning applications [1]. To create these
engineering processes, providing insights about several essential software products, Microsoft has leveraged its preexisting
engineering challenges that organizations may face in creating capabilities in AI and developed new areas of expertise across
large-scale AI solutions for the marketplace. We collected some the company.
best practices from Microsoft teams to address these challenges. In this paper, we describe a study in which we learned how
In addition, we have identified three aspects of the AI domain that
make it fundamentally different from prior software application various Microsoft software teams build software applications
domains: 1) discovering, managing, and versioning the data with customer-focused AI features. For that, Microsoft has
needed for machine learning applications is much more complex integrated existing Agile software engineering processes with
and difficult than other types of software engineering, 2) model AI-specific workflows informed by prior experiences in devel-
customization and model reuse require very different skills than oping early AI and data science applications. In our study, we
are typically found in software teams, and 3) AI components
are more difficult to handle as distinct modules than traditional asked Microsoft employees about how they worked through
software components — models may be “entangled” in complex the growing challenges of daily software development specific
ways and experience non-monotonic error behavior. We believe to AI, as well as the larger, more essential issues inherent in the
that the lessons learned by Microsoft teams will be valuable to development of large-scale AI infrastructure and applications.
other organizations. With teams across the company having differing amounts of
Index Terms—AI, Software engineering, process, data
work experience in AI, we observed that many issues reported
by newer teams dramatically drop in importance as the teams
I. I NTRODUCTION
mature, while some remain as essential to the practice of large-
Personal computing. The Internet. The Web. Mobile com- scale AI. We have made a first attempt to create a process
puting. Cloud computing. Nary a decade goes by without a maturity metric to help teams identify how far they have come
disruptive shift in the dominant application domain of the on their journeys to building AI applications.
software industry. Each shift brings with it new software As a key finding of our analyses, we discovered three funda-
engineering goals that spur software organizations to evolve mental differences to building applications and platforms for
their development practices in order to address the novel training and fielding machine-learning models than we have
aspects of the domain. seen in prior application domains. First, machine learning is all
The latest trend to hit the software industry is around about data. The amount of effort and rigor it takes to discover,
integrating artificial intelligence (AI) capabilities based on source, manage, and version data is inherently more complex
advances in machine learning. AI broadly includes technolo- and different than doing the same with software code. Second,
gies for reasoning, problem solving, planning, and learning, building for customizability and extensibility of models require
among others. Machine learning refers to statistical modeling teams to not only have software engineering skills but almost
Fig. 1. The nine stages of the machine learning workflow. Some stages are data-oriented (e.g., collection, cleaning, and labeling) and others are model-oriented
(e.g., model requirements, feature engineering, training, evaluation, deployment, and monitoring). There are many feedback loops in the workflow. The larger
feedback arrows denote that model evaluation and monitoring may loop back to any of the previous stages. The smaller feedback arrow illustrates that model
training may loop back to feature engineering (e.g., in representation learning).
always require deep enough knowledge of machine learning to in continuous integration and diagnostic-gathering, making it
build, evaluate, and tune models from scratch. Third, it can be simpler to implement continuous delivery.
more difficult to maintain strict module boundaries between Process changes not only alter the day-to-day development
machine learning components than for software engineering practices of a team, but also influence the roles that people
modules. Machine learning models can be “entangled” in play. 15 years ago, many teams at Microsoft relied heavily on
complex ways that cause them to affect one another during development triads consisting of a program manager (require-
training and tuning, even if the software teams building them ments gathering and scheduling), a developer (programming),
intended for them to remain isolated from one another. and a tester (testing) [6]. These teams’ adoption of DevOps
The lessons we identified via studies of a variety of teams combined the roles of developer and tester and integrated
at Microsoft who have adapted their software engineering the roles of IT, operations, and diagnostics into the mainline
processes and practices to integrate machine learning can help software team.
other software organizations embarking on their own paths In recent years, teams have increased their abilities to an-
towards building AI applications and platforms. alyze diagnostics-based customer application behavior, prior-
In this paper, we offer the following contributions. itize bugs, estimate failure rates, and understand performance
1) A description of how several Microsoft software en- regressions through the addition of data scientists [7], [8], who
gineering teams work cast into a nine-stage workflow helped pioneer the integration of statistical and machine learn-
for integrating machine learning into application and ing workflows into software development processes. Some
platform development. software teams employ polymath data scientists, who “do it
2) A set of best practices for building applications and all,” but as data science needs to scale up, their roles specialize
platforms relying on machine learning. into domain experts who deeply understand the business prob-
3) A custom machine-learning process maturity model for lems, modelers who develop predictive models, and platform
assessing the progress of software teams towards excel- builders who create the cloud-based infrastructure.
lence in building AI applications.
4) A discussion of three fundamental differences in how B. ML Workflow
software engineering applies to machine-learning–centric One commonly used machine learning workflow at Mi-
components vs. previous application domains. crosoft has been depicted in various forms across industry
and research [1], [9], [10], [11]. It has commonalities with
II. BACKGROUND prior workflows defined in the context of data science and data
A. Software Engineering Processes mining, such as TDSP [12], KDD [13], and CRISP-DM [14].
The changing application domain trends in the software Despite the minor differences, these representations have in
industry have influenced the evolution of the software pro- common the data-centered essence of the process and the
cesses practiced by teams at Microsoft. For at least a decade multiple feedback loops among the different stages. Figure 1
and a half, many teams have used feedback-intense Agile shows a simplified view of the workflow consisting of nine
methods to develop their software [2], [3], [4] because they stages.
needed to be responsive at addressing changing customer In the model requirements stage, designers decide which
needs through faster development cycles. Agile methods have features are feasible to implement with machine learning and
been helpful at supporting further adaptation, for example, which can be useful for a given existing product or for a
the most recent shift to re-organize numerous team’s prac- new one. Most importantly, in this stage, they also decide
tices around DevOps [5], which better matched the needs what types of models are most appropriate for the given
of building and supporting cloud computing applications and problem. During data collection, teams look for and integrate
platforms.1 The change to DevOps occurred fairly quickly available datasets (e.g., internal or open source) or collect their
because these teams were able to leverage prior capabilities own. Often, they might train a partial model using available
generic datasets (e.g., ImageNet for object detection), and then
1 https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/ use transfer learning together with more specialized data to
train a more specific model (e.g., pedestrian detection). Data related line of thought, recent work also discusses the impact
cleaning involves removing inaccurate or noisy records from that the use of ML-based software has on risk and safety
the dataset, a common activity to all forms of data science. concerns of ISO standards [21]. In the last five years, there
Data labeling assigns ground truth labels to each record. have been multiple efforts in industry to automate this process
For example, an engineer might have a set of images on hand by building frameworks and environments to support the ML
which have not yet been labeled with the objects present in workflow and its experimental nature [1], [22], [23]. However,
the image. Most of the supervised learning techniques require ongoing research and surveys show that engineers still struggle
labels to be able to induce a model. Other techniques (e.g., to operationalize and standardize working processes [9], [24],
reinforcement learning) use demonstration data or environment [23]. The goal of this work is to uncover detailed insights on
rewards to adjust their policies. Labels can be provided either ML-specific best practices used by developers at Microsoft.
by engineers themselves, domain experts, or by crowd workers We share these insights with the broader community aspiring
in online crowd-sourcing platforms. that such take-away lessons can be valuable to other companies
Feature engineering refers to all activities that are performed and engineers.
to extract and select informative features for machine learning
models. For some models (e.g. convolutional neural networks), D. Process Maturity
this stage is less explicit and often blended with the next stage, Software engineers face a constantly changing set of plat-
model training. During model training, the chosen models forms and technologies that they must learn to build the newest
(using the selected features) are trained and tuned on the applications for the software marketplace. Some engineers
clean, collected data and their respective labels. Then in learn new methods and techniques in school, and bring them
model evaluation, the engineers evaluate the output model to the organizations they work for. Other learn new skills on
on tested or safeguard datasets using pre-defined metrics. the job or on the side, as they anticipate their organization’s
For critical domains, this stage might also involve extensive need for latent talent [25]. Software teams, composed of
human evaluation. The inference code of the model is then individual engineers with varying amounts of experience in the
deployed on the targeted device(s) and continuously monitored skills necessary to professionally build ML components and
for possible errors during real-world execution. their support infrastructure, themselves exhibit varying levels
For simplicity the view in Figure 1 is linear, however, of proficiency in their abilities depending on their aggregate
machine learning workflows are highly non-linear and contain experience in the domain.
several feedback loops. For example, if engineers notice that The software engineering discipline has long considered
there is a large distribution shift between the training data software process improvement as one of its vital functions.
and the data in the real world, they might want to go back Researchers and practitioners in the field have developed sev-
and collect more representative data and rerun the workflow. eral well-known metrics to assess it, including the Capability
Similarly, they may revisit their modeling choices made in the Maturity Model (CMM) [26] and Six Sigma [27]. CMM rates
first stage, if the problem evolves or if better algorithms are the software processes of organizations on five levels, from
invented. While feedback loops are typical in Agile software initial (ad hoc processes), repeatable, defined, capable (i.e.,
processes, the peculiarity of the machine learning workflow quantitatively measured), and efficient (i.e., deliberate process
is related to the amount of experimentation needed to con- improvement). Inspired by CMM, we build a first maturity
verge to a good model for the problem. Indeed, the day- model for teams building systems and platforms that integrate
to-day work of an engineer doing machine learning involves machine learning components.
frequent iterations over the selected model, hyper-parameters,
and dataset refinement. Similar experimental properties have III. S TUDY
been observed in the past in scientific software [15] and We collected data in two phases: an initial set of interviews
hardware/software co-design [16]. This workflow can become to gather the major topics relevant to our research questions
even more complex if the system is integrative, containing and a wide-scale survey about the identified topics. Our study
multiple ML components which interact together in complex design was approved by Microsoft’s Ethics Advisory Board.
and unexpected ways [17].
A. Interviews
C. Software Engineering for Machine Learning Because the work practice around building and integrating
The need for adjusting software engineering practices in machine learning into software and services is still emerging
the recent era has been discussed in the context of hidden and is not uniform across all product teams, there is no
technical debt [18] and troubleshooting integrative AI [19], systematic way to identify the key stakeholders on the topic
[20]. This work identifies various aspects of ML system archi- of adoption. We therefore used a snowball sampling strategy,
tecture and requirements which need to be considered during starting with (1) leaders of teams with mature use of machine
system design. Some of these aspects include hidden feedback learning (ML) (e.g., Bing), (2) leaders of teams where AI is
loops, component entanglement and eroded boundaries, non- a major aspect of the user experience (e.g., Cortana), and (3)
monotonic error propagation, continuous quality states, and people conducting company-wide internal training in AI and
mismatches between the real world and evaluation sets. On a ML. As we chose informants, we picked a variety of teams
TABLE I responded, giving us a 13.6% response rate. For each open-
T HE STAKEHOLDERS WE INTERVIEWED FOR THE STUDY. response item, between two and four researchers analyzed the
responses through a card sort. Then, the entire team reviewed
Id Role Product Area Manager?
the card sort results for clarity and consistency.
I1 Applied Scientist Search Yes
I2 Applied Scientist Search Yes
Respondents were fairly well spread across all divisions of
I3 Architect Conversation Yes the company and came from a variety of job roles: Data and
I4 Engineering Manager Vision Yes applied science (42%), Software engineering (32%), Program
I5 General Manager ML Tools Yes
I6 Program Manager ML Tools Yes
management (17%), Research (7%), and other (1%). 21% of
I7 Program Manager Productivity Tools Yes respondents were managers and 79% were individual contrib-
I8 Researcher ML Tools Yes utors, helping us balance out the majority manager perspective
I9 Software Engineer Speech Yes
in our interviews.
I10 Program Manager AI Platform No In the next sections, we discuss our interview and survey
I11 Program Manager Community No
I12 Scientist Ads No results, starting with the range of AI applications developed by
I13 Software Engineer Vision No Microsoft, diving into best practices that Microsoft engineers
I14 Software Engineer Vision No have developed to address some of the essential challenges
in building large-scale AI applications and platforms, show-
1. Part 1 ing how the perception of the importance of the challenges
1.1. Background and demographics: changes as teams gain experience building AI applications, and
1.1.1. years of AI experience finally, describing our proposed AI process maturity model.
1.1.2. primary AI use case*
1.1.3. team effectiveness rating IV. A PPLICATIONS OF AI
1.1.4. source of AI components Many teams across Microsoft have augmented their appli-
1.2. Challenges* cations with machine learning and inference, some in some
1.3. Time spent on each of the nine workflow activities surprising domains. We asked survey respondents for the ways
1.4. Time spent on cross-cutting activities that they used AI on their teams. We card sorted this data
2. Part 2 (repeated for two activities where most time spent) twice, once to capture the application domain in which AI
2.1. Tools used* was being applied, and a second time to look at the (mainly)
2.2. Effectiveness rating ML algorithms used to build that application.
2.3. Maturity ratings We found AI is used in traditional areas such as search, ad-
3. Part 3 vertising, machine translation, predicting customer purchases,
3.1. Dream tools* voice recognition, and image recognition, but also saw it
3.2. Best practices* being used in novel areas, such as identifying customer leads,
3.3. General comments* providing design advice for presentations and word processing
documents, providing unique drawing features, healthcare, and
Fig. 2. The structure of the study’s questionnaire. An asterisk indicates an
open-response item. improving gameplay. In addition, machine learning is being
used heavily in infrastructure projects to manage incident
reporting, identify the most likely causes for bugs, monitor
to get different levels of experience and different parts of fraudulent fiscal activity, and to monitor network streams for
the ecosystem (products with AI components, AI frameworks security breaches.
and platforms, AI created for external companies). In all, we Respondents used a broad spectrum of ML approaches to
interviewed 14 software engineers, largely in senior leadership build their applications, from classification, clustering, dy-
roles. These are shown in Table I. The interviews were namic programming, and statistics, to user behavior modeling,
semi-structured and specialized to each informant’s role. For social networking analysis, and collaborative filtering. Some
example, when interviewing Informant I3, we asked questions areas of the company specialized further, for instance, Search
related to his work overseeing teams building the product’s worked heavily with ranking and relevance algorithms along
architectural components. with query understanding. Many divisions in the company
work on natural language processing, developing tools for
B. Survey entity recognition, sentiment analysis, intent prediction, sum-
Based on the results of the interviews, we designed an marization, machine translation, ontology construction, text
open-ended questionnaire whose focus was on existing work similarity, and connecting answers to questions. Finance and
practice, challenges in that work practice, and best practices Sales have been keen to build risk prediction models and do
(Figure 2). We asked about challenges both directly and forecasting. Internal resourcing organizations make use of de-
indirectly by asking informants to imagine “dream tools” and cision optimization algorithms such as resource optimization,
improvements that would make their work practice better. We planning, pricing, bidding, and process optimization.
sent the questionnaire to 4195 members of internal mailing The takeaway for us was that integration of machine learn-
lists on the topics of AI and ML. 551 software engineers ing components is happening all over the company, not just
on teams historically known for it. Thus, we could tell that we of these environments is to help engineers discover, gather,
were not just hearing from one niche corner of the company, ingest, understand, and transform data, and then train, deploy,
but in fact, we received responses from a broad range of and maintain models. In addition, these teams customize the
perspectives spread throughout. environments to make them easier to use by engineers with
varying levels of experience. “Visual tools help beginning data
V. B EST P RACTICES WITH M ACHINE L EARNING IN scientists when getting started, but once they know the ropes
S OFTWARE E NGINEERING and branch out, such tools may get in their way and they may
In this section, we present our respondents’ viewpoints on need something else.”
some of the essential challenges associated with building large-
scale ML applications and platforms and how they address B. Data availability, collection, cleaning, and management
them in their products. We categorized the challenges by Since many machine learning techniques are centered
card sorting interview and survey free response questions, and around learning from large datasets, the success of ML-centric
then used our own judgment as software engineering and AI projects often heavily depends on data availability, quality
researchers to highlight those that are essential to the practice and management [28]. Labeling datasets is costly and time-
of AI on software teams. consuming, so it is important to make them available for
use within the company (subject to compliance constraints).
A. End-to-end pipeline support
Our respondents confirm that it is important to “reuse the
As machine learning components have become more mature data as much as possible to reduce duplicated effort.” In
and integrated into larger software systems, our participants addition to availability, our respondents focus most heavily
recognized the importance of integrating ML development on supporting the following data attributes: “accessibility,
support into the traditional software development infrastruc- accuracy, authoritativeness, freshness, latency, structuredness,
ture. They noted that having a seamless development experi- ontological typing, connectedness, and semantic joinability.”
ence covering (possibly) all the different stages described in Automation is a vital cross-cutting concern, enabling teams
Figure 1 was important to automation. However, achieving this to more efficiently aggregate data, extract features, synthesize
level of integration can be challenging because of the differ- labelled examples. The increased efficiency enables teams to
ent characteristics of ML modules compared with traditional “speed up experimentation and work with live data while they
software components. For example, previous work in this experiment with new models.”
field [18], [19] found that variation in the inherent uncertainty We found that Microsoft teams have found it necessary to
(and error) of data-driven learning algorithms and complex blend data management tools with their ML frameworks to
component entanglement caused by hidden feedback loops avoid the fragmentation of data and model management activ-
could impose substantial changes (even in specific stages) ities. A fundamental aspect of data management for machine
which were previously well understood in software engineer- learning is the rapid evolution of data sources. Continuous
ing (e.g., specification, testing, debugging, to name a few). changes in data may originate either from (i) operations
Nevertheless, due to the experimental and even more iterative initiated by engineers themselves, or from (ii) incoming fresh
nature of ML development, unifying and automating the day- data (e.g., sensor data, user interactions). Either case requires
to-day workflow of software engineers reduces overhead and rigorous data versioning and sharing techniques, for example,
facilitate progress in the field. “Each model is tagged with a provenance tag that explains with
Respondents report to leverage internal infrastructure in the which data it has been trained on and which version of the
company (e.g. AEther2 ) or they have built pipelines specialized model. Each dataset is tagged with information about where
to their own use cases. It is important to develop a “rock solid, it originated from and which version of the code was used to
data pipeline, capable of continuously loading and massaging extract it (and any related features).” This practice is used for
data, enabling engineers to try out many permutations of AI mapping datasets to deployed models or for facilitating data
algorithms with different hyper-parameters without hassle.” sharing and reusability.
The pipelines created by these teams are automated, supporting
training, deployment, and integration of models with the C. Education and Training
product they are a part of. In addition, some pipeline engineers The integration of machine learning continues to become
indicated that “rich dashboards” showing the value provided more ubiquitous in customer-facing products, for example,
to users are useful. machine learning components are now widely used in produc-
Several respondents develop openly available IDEs to en- tivity software (e.g., email, word processing) and embedded
able Microsoft’s customers to build and deploy their models devices (i.e., edge computing). Thus, engineers with traditional
(e.g. Azure ML for Visual Studio Code3 and Azure ML software engineering backgrounds need to learn how to work
Studio4 ). According to two of our interviewees, the goal alongside of the ML specialists. A variety of players within
2 https://www.slideshare.net/MSTechCommunity/
Microsoft have found it incredibly valuable to scaffold their
ai-microsoft-how-we-do-it-and-how-you-can-too
engineers’ education in a number of ways. First, the company
3 https://marketplace.visualstudio.com/items?itemName=ms-toolsai.vscode-ai hosts a twice-yearly internal conference on machine learning
4 https://azure.microsoft.com/en-us/services/machine-learning-studio/ and data science, with at least one day devoted to introductions
to the basics of technologies, algorithms, and best practices. F. Compliance
In addition, employees give talks about internal tools and
Microsoft issued a set of principles around uses of AI in
the engineering details behind novel projects and product
the open world. These include fairness, accountability, trans-
features, and researchers present cutting-edge advances they
parency, and ethics. All teams at Microsoft have been asked to
have seen and contributed to academic conferences. Second,
align their engineering practices and the behaviors of fielded
a number of Microsoft teams host weekly open forums on
software and services in accordance with these principles.
machine learning and deep learning, enabling practitioners to
Respect for them is a high priority in software engineering
get together and learn more about AI. Finally, mailing lists and
and AI and ML processes and practices. A discussion of these
online forums with thousands of participants enable anyone
concerns is beyond the scope of this paper. To learn more
to ask and answer technical and pragmatic questions about
about Microsoft’s commitments to this important topic, please
AI and machine learning, as well as frequently share recent
read about its approach to AI.5
results from academic conferences.
G. Varied Perceptions
D. Model Debugging and Interpretability
We found that as a number of product teams at Microsoft in-
Debugging activities for components that learn from data
tegrated machine learning components into their applications,
not only focus on programming bugs, but also focus on
their ability to do so effectively was mediated by the amount of
inherent issues that arise from model errors and uncertainty.
prior experience with machine learning and data science. Some
Understanding when and how models fail to make accurate
teams fielded data scientists and researchers with decades of
predictions is an active research area [29], [30], [31], which is
experience, while others had to grow quickly, picking up their
attracting more attention as ML algorithms and optimization
own experience and more-experienced team members on the
techniques become more complex. Several survey respondents
way. Due to this heterogeneity, we expected that our survey
and the larger Explainable AI community [32], [33] propose
respondents’ perceptions of the challenges their teams’ faced
to use more interpretable models, or to develop visualization
in practicing machine learning would vary accordingly.
techniques that make black-box models more interpretable. For
We grouped the respondents into three buckets (low,
larger, multi-model systems, respondents apply modularization
medium, and high), evenly divided by the number of years
in a conventional, layered, and tiered software architecture to
of experience respondents personally had with AI. First, we
simplify error analysis and debuggability.
ranked each of the card sorted categories of respondents’
challenges divided by the AI experience buckets. This list is
E. Model Evolution, Evaluation, and Deployment
presented in Table II, initially sorted by the respondents with
ML-centric software goes through frequent revisions initi- low experience with AI.
ated by model changes, parameter tuning, and data updates, Two things are worth noticing. First, across the board,
the combination of which has a significant impact on system Data Availability, Collection, Cleaning, and Management, is
performance. A number of teams have found it important ranked as the top challenge by many respondents, no matter
to employ rigorous and agile techniques to evaluate their their experience level. We find similarly consistent ranking for
experiments. They developed systematic processes by adopting issues around the categories of end-to-end pipeline support
combo-flighting techniques (i.e., flighting a combination of and collaboration and working culture. Second, some of the
changes and updates), including multiple metrics in their ex- challenges rise or fall in importance as the respondents’ ex-
periment score cards, and performing human-driven evaluation perience with AI differs. For example, education and training
for more sensitive data categories. One respondent’s team is far more important to those with low experience levels in
uses “score cards for the evaluation of flights and storing AI than those with more experience. In addition, respondents
flight information: How long has it been flighted, metrics for with low experience rank challenges with integrating AI into
the flight, etc.” Automating tests is as important in machine larger systems higher than those with medium or high expe-
learning as it is in software engineering; teams create carefully rience. This means that as individuals (and their teams) gain
put-together test sets that capture what their models should do. experience building applications and platforms that integrate
However, it is important that a human remains in the loop. One ML, their increasing skills help shrink the importance of
respondent said, “we spot check and have a human look at the some of the challenges they perceive. Note, the converse also
errors to see why this particular category is not doing well, occurs. Challenges around tooling, scale, and model evolution,
and then hypothesize to figure out problem source.” evaluation, and deployment are more important for engineers
Fast-paced model iterations require more frequent deploy- with a lot of experience with AI. This is very likely because
ment. To ensure that system deployment goes smoothly, sev- these more experienced individuals are tasked with the more
eral engineers recommend not only to automate the training essentially difficult engineering tasks on their team; those with
and deployment pipeline, but also to integrate model building low experience are probably tasked to easier problems until
with the rest of the software, use common versioning reposi- they build up their experience.
tories for both ML and non-ML codebases, and tightly couple
the ML and non-ML development sprints and standups. 5 https://www.microsoft.com/en-us/ai/our-approach-to-ai
TABLE II
T HE TOP - RANKED CHALLENGES AND PERSONAL EXPERIENCE WITH AI. R ESPONDENTS WERE GROUPED INTO THREE BUCKETS ( LOW, MEDIUM , HIGH )
BASED ON THE 33 RD AND 67 TH PERCENTILE OF THE NUMBER OF YEARS OF AI EXPERIENCE THEY PERSONALLY HAD (N=308). T HE COLUMN Frequency
SHOWS THE INCREASE / DECREASE OF THE FREQUENCY IN THE MEDIUM AND HIGH BUCKETS COMPARED TO THE LOW BUCKETS . T HE COLUMN Rank
SHOWS THE RANKING OF THE CHALLENGES WITHIN EACH EXPERIENCE BUCKET, WITH 1 BEING THE MOST FREQUENT CHALLENGE .
Frequency Rank
Medium High Experience
Challenge vs. Low vs. Low Trend Low Medium High
Data Availability, Collection, Cleaning, and Management -2% 60% 1 1 1
Education and Training -69% -78% 1 5 9
Hardware Resources -32% 13% 3 8 6
End-to-end pipeline support 65% 41% 4 2 4
Collaboration and working culture 19% 69% 5 6 6
Specification 2% 50% 5 8 8
Integrating AI into larger systems -49% -62% 5 16 13
Education: Guidance and Mentoring -83% -81% 5 21 18
AI Tools 144% 193% 9 3 2
Scale 154% 210% 10 4 3
Model Evolution, Evaluation, and Deployment 137% 276% 15 6 4
We also compared the overall frequency of each kind of to which they rely on others’ and their own tools goes up,
challenge using the same three buckets of AI experience. making them think about their impact more often.
Looking again at the top ranked challenge, Data Availability, • End-to-end pipeline support was positively correlated with
Collection, Cleaning, and Management, we notice that it formal education (p < 0.01), implying that only those with
was reported by low and medium experienced respondents at formal education were working on building such a pipeline.
similar rates, but represented a lot more of the responses (60%) • Specifications were also positively correlated with formal
given by those with high experience. This also happened for education (p < 0.03), implying that those with formal
challenges related to Specifications. However, when looking education are the ones who write down the specifications
at Education and Training, Integrating AI into larger systems, for their models and engineering systems.
and Education: Guidance and Mentoring, their frequency drops The lesson we learn from these analyses is that the kinds
significantly from the rate reported by the low experience of issues that engineers perceive as important change as
bucket than reported by the medium and high buckets. We they grow in their experience with AI. Some concerns are
interpret this to mean that these challenges were less important transitory, related to one’s position within the team and the
to the medium and high experience respondents than to those accidental complexity of working together. Several others are
with low experience levels. Thus, this table gives a big picture more fundamental to the practice of integrating machine learn-
of both which problems are perceived as most important within ing into software applications, affecting many engineers, no
each experience bucket, and which problems are perceived as matter their experience levels. Since machine learning-based
most important across the buckets. applications are expected to continue to grow in popularity,
Finally, we conducted a logistic regression analysis to build we call for further research to address these important issues.
a model that could explain the differences in frequency when
controlling for personal AI experience, team AI experience,
VI. T OWARDS A M ODEL OF ML P ROCESS M ATURITY
overall work experience, the number of concurrent AI projects,
and whether or not a respondent had formal education in As we saw in Section V-G, we see some variance in the
machine learning or data science techniques. We found five experience levels of AI in software teams. That variation
significant coefficients: affects their perception of the engineering challenges to be
• Education and Training was negatively correlated with per- addressed in their day-to-day practices. As software teams
sonal AI experience with a coefficient of -0.18 (p < 0.02), mature and gel, they can become more effective and efficient
meaning that people with less AI experience found this to in delivering machine learning-based products and platforms.
be a more important issue. To capture the maturity of ML more precisely than us-
• Educating Others was positively correlated with personal AI ing a simple years-of-experience number, we created a ma-
experience with a coefficient of 0.26 (p < 0.01), meaning turity model with six dimensions evaluating whether each
that people with greater AI experience found this to be a workflow stage: (1) has defined goals, (2) is consistently
more important issue. implemented, (3) documented, (4) automated, (5) measured
• Tool issues are positively correlated with team AI experience and tracked, and (6) continuously improved. The factors are
with a coefficient of 0.13 (p < 0.001), meaning that as the loosely based on the concepts behind the Capability Maturity
team gains experience working on AI projects, the degree Model (CMM) [26] and Six Sigma [27], which are widely
used in software development to assess and improve maturity
of software projects.
In the survey, we asked respondents to report the maturity
for the two workflow stages that each participant spent the
most time on (measured by number of hours they reported
spending on each activity). Specifically, we asked participants
to rate their agreement with the following statements S1 ..S6
(bold text was in the original survey) using a Likert response
format from Strongly Disagree (1) to Strongly Agree (5):
S1: My team has goals defined for what to accomplish with
this activity.
S2: My team does this activity in a consistent manner.
S3: My team has largely documented the practices related
to this activity.
S4: My team does this activity mostly in an automated way. Fig. 3. The average overall effectiveness (OE) of a team’s ML practices
S5: My team measures and tracks how effective we are at divided by application domain (anonymized). The y-axis labels have been
completing this activity. elided for confidentiality. An ANOVA and Scott Knott test identified two
distinct groups to the OE metric, labeled in black (A–F) and red (G–I).
S6: My team continuously improves our practices related to
this activity.
We gathered this data for the stages that respondents were meaning that some respondents feel their teams are at different
most familiar with because we found that they often specialize levels of maturity than others. Second, an ANOVA and Scott
in various stages of the workflow. This question was intended Knott test show significant differences in the reported values,
to be lightweight so that respondents could answer easily, demonstrating the potential value of this metric to identify the
while at the same time accounting for the wide variety of various ML process maturity levels.
ML techniques applied. Rather than being prescriptive (i.e., do We recognize that these metrics represent a first attempt at
this to get to the next maturity level), our intention was to be quantifying a process metric to enable teams to assess how
descriptive (e.g., how much automation is there in a particular well they practice ML. In future work, we will refine our
workflow stage? how well is a workflow stage documented?). instrument and further validate its utility.
More work is needed to define maturity levels similar to CMM.
VII. D ISCUSSION
To analyze the responses, we defined an Activity Maturity
Index (AMI) to combine the individual scores into a single In this section, we synthesize our findings into three ob-
measure. This index is the average of the agreement with servations of some fundamental differences in the way that
the six maturity statements S1 ..S6 . As a means of validating software engineering has been adapted to support past popular
the Maturity Index, we asked participants to rate the Activity application domains and how it can be adapted to support
Effectiveness (AE) by answering “How effective do you think artificial intelligence applications and platforms. There may
your team’s practices around this activity are on a scale from be more differences, but from our data and discussions with
1 (poor) to 5 (excellent)?”. The Spearman correlation between ML experts around Microsoft, these three rose to prominence.
the Maturity Index and the Effectiveness was between 0.4982
and 0.7627 (all statistically significant at p < 0.001) for A. Data discovery and management
all AI activities. This suggests that the Maturity Index is a Just as software engineering is primarily about the code
valid composite measure that can capture the maturity and that forms shipping software, ML is all about the data that
effectiveness of AI activities. powers learning models. Software engineers prefer to design
In addition to the Activity Maturity Index and Activity and build systems which are elegant, abstract, modular, and
Effectiveness, we collected an Overall Effectiveness (OE) simple. By contrast, the data used in machine learning are vo-
score by asking respondents the question “How effectively luminous, context-specific, heterogeneous, and often complex
does your team work with AI on a scale from 1 (poor) to 5 to describe. These differences result in difficult problems when
(excellent)” Having the AMI, AE, and OE measures allowed ML models are integrated into software systems at scale.
us to compare the maturity and effectiveness of different Engineers have to find, collect, curate, clean, and process
organizations, disciplines, and application domains within Mi- data for use in model training and tuning. All the data has
crosoft, and identify areas for improvement. We plot one of to be stored, tracked, and versioned. While software APIs
these comparisons in Figure 3 and show the average overall are described by specifications, datasets rarely have explicit
effectiveness scores divided by nine of the most represented schema definitions to describe the columns and characterize
AI application domains in our survey. There are two things their statistical distributions. However, due to the rapid itera-
to notice. First, the spread of the y-values indicates that tion involved in ML, the data schema (and the data) change
the OE metric can numerically distinguish between teams, frequently, even many times per day. When data is ingested
from large-scale diagnostic data feeds, if ML engineers want architecture. Thus, separate modules are often assigned to
to change which data values are collected, they must wait separate teams. Module interactions are controlled by APIs
for the engineering systems to be updated, deployed, and which do dual duty to enable software modules to remain
propagated before new data can arrive. Even “simple” changes apart, but also describe the interfaces to minimize the amount
can have significant impacts on the volume of data collected, of communication needed between separate teams to make
potentially impacting applications through altered performance their modules work together [36], [37].
characteristics or increased network bandwidth usage. Maintaining strict module boundaries between machine
While there are very well-designed technologies to version learned models is difficult for two reasons. First, models are
code, the same is not true for data. A given data set may not easily extensible. For example, one cannot (yet) take an
contain data from several different schema regimes. When a NLP model of English and add a separate NLP model for
single engineer gathers and processes this data, they can keep ordering pizza and expect them to work properly together.
track of these unwritten details, but when project sizes scale, Similarly, one cannot take that same model for pizza and pair
maintaining this tribal knowledge can become a burden. To it with an equivalent NLP model for French and have it work.
help codify this information into a machine-readable form, The models would have to be developed and trained together.
Gebru et al. propose to use data sheets inspired by elec- Second, models interact in non-obvious ways. In large-
tronics to more transparently and reliably track the metadata scale systems with more than a single model, each model’s
characteristics of these datasets [34]. To compare datasets results will affect one another’s training and tuning processes.
against each other, the Datadiff [35] tool enables developers to In fact, one model’s effectiveness will change as a result of
formulate viable transformation functions over data samples. the other model, even if their code is kept separated. Thus,
B. Customization and Reuse even if separate teams built each model, they would have to
collaborate closely in order to properly train or maintain the
While it is well-understood how much work it takes to cus- full system. This phenomenon (also referred to as component
tomize and reuse code components, customizing ML models entanglement) can lead to non-monotonic error propagation,
can require much more. In software, the primary units of reuse meaning that improvements in one part of the system might
are functions, algorithms, libraries, and modules. A software decrease the overall system quality because the rest of the
engineer can find the source code for a library (e.g. on Github), system is not tuned to the latest improvements. This issue is
fork it, and easily make changes to the code, using the same even more evident in cases when machine learning models are
skills they use to develop their own software. not updated in a compatible way and introduce new, previously
Although fully-trained ML models appear to be functions unseen mistakes that break the interaction with other parts of
that one can call for a given input, the reality is far more the system which rely on it.
complex. One part of a model is the algorithm that powers
the particular machine learning technique being used (e.g.,
SVM or neural nets). Another is the set of parameters that
VIII. L IMITATIONS
controls the function (e.g., the SVM support vectors or neural
net weights) and are learned during training. If an engineer
wants to apply the model on a similar domain as the data it was Our case study was conducted with teams at Microsoft, a
originally trained on, reusing it is straightforward. However, large, world-wide software company with a diverse portfolio
more signficant changes are needed when one needs to run of software products. It is also one of the largest purveyors of
the model on a different domain or use a slightly different machine learning-based products and platforms. Some findings
input format. One cannot simply change the parameters with are likely to be specific to the Microsoft teams and team
a text editor. In fact, the model may require retraining, or members who participated in our interviews and surveys. Nev-
worse, may need to be replaced with another model. Both ertheless, given the high variety of projects represented by our
require the software developer to have machine learning skills, informants, we expect that many of the lessons we present in
which they may never have learned. Beyond that, retraining this paper will apply to other companies. Some of our findings
or rebuilding the model requires additional training data to be depend on the particular ML workflow used by some software
discovered, collected, and cleaned, which can take as much teams at Microsoft. The reader should be able to identify how
work and expertise as the original model’s authors put in. our model abstractions fit into the particulars of the models
they use. Finally, interviews and surveys rely on self-selected
C. ML Modularity informants and self-reported data. Wherever appropriate, we
Another key attribute of engineering large-scale software stated that findings were our informants’ perceptions and
systems is modularity. Modules are separated and isolated opinions. This is especially true with this implementation
to ensure that developing one component does not interfere of our ML process maturity model, which triangulated its
with the behavior of others under development. In addition, measures against other equally subjective measures with no
software modularity is strengthened by Conway’s Law, which objective baseline. Future implementations of the maturity
makes the observation that the teams that build each com- model should endeavor to gather objective measures of team
ponent of the software organize themselves similarly to its process performance and evolution.
IX. C ONCLUSION [15] J. E. Hannay, C. MacLeod, J. Singer, H. P. Langtangen, D. Pfahl, and
G. Wilson, “How do scientists develop and use scientific software?”
Many teams at Microsoft have put significant effort into in Proc. of the 2009 ICSE workshop on Software Engineering for
developing an extensive portfolio of AI applications and plat- Computational Science and Engineering. IEEE Computer Society,
2009, pp. 1–8.
forms by integrating machine learning into existing software [16] G. De Michell and R. K. Gupta, “Hardware/software co-design,” Proc.
engineering processes and cultivating and growing ML talent. of the IEEE, vol. 85, no. 3, pp. 349–365, 1997.
In this paper, we described the results of a study to learn [17] D. Bohus, S. Andrist, and M. Jalobeanu, “Rapid development of
multimodal interactive systems: a demonstration of platform for situated
more about the process and practice changes undertaken by a intelligence,” in Proc. of the 19th ACM International Conference on
number of Microsoft teams in recent years. From these find- Multimodal Interaction. ACM, 2017, pp. 493–494.
ings, we synthesized a set of best practices to address issues [18] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,
V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden
fundamental to the large-scale development and deployment of technical debt in machine learning systems,” in NIPS, 2015.
ML-based applications. Some reported issues were correlated [19] B. Nushi, E. Kamar, E. Horvitz, and D. Kossmann, “On human intellect
with the respondents’ experience with AI, while others were and machine failures: Troubleshooting integrative machine learning
systems.” in AAAI, 2017, pp. 1017–1025.
applicable to most respondents building AI applications. We [20] S. Andrist, D. Bohus, E. Kamar, and E. Horvitz, “What went wrong
presented a ML process maturity metric to help teams self- and why? diagnosing situated interaction failures in the wild,” in ICSR.
assess how well they work with machine learning and offer Springer, 2017, pp. 293–303.
[21] R. Salay, R. Queiroz, and K. Czarnecki, “An analysis of ISO 26262:
guidance towards improvements. Finally, we identified three Using machine learning safely in automotive software,” arXiv preprint
aspects of the AI domain that make it fundamentally different arXiv:1709.02435, 2017.
than prior application domains. Their impact will require [22] D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque,
S. Haykal, M. Ispir, V. Jain, L. Koc et al., “TFX: A tensorflow-based
significant research efforts to address in the future. production-scale machine learning platform,” in Proc. of the 23rd ACM
SIGKDD. ACM, 2017, pp. 1387–1395.
R EFERENCES [23] V. Sridhar, S. Subramanian, D. Arteaga, S. Sundararaman, D. Roselli,
and N. Talagala, “Model governance: Reducing the anarchy of produc-
[1] M. Salvaris, D. Dean, and W. H. Tok, “Microsoft AI Platform,” in Deep tion ML,” in USENIX. USENIX Association, 2018, pp. 351–358.
Learning with Azure. Springer, 2018, pp. 79–98. [24] T. Wuest, D. Weimer, C. Irgens, and K.-D. Thoben, “Machine learning
[2] A. Begel and N. Nagappan, “Usage and perceptions of agile software in manufacturing: advantages, challenges, and applications,” Production
development in an industrial context: An exploratory study,” in First & Manufacturing Research, vol. 4, no. 1, pp. 23–45, 2016.
International Symposium on Empirical Software Engineering and Mea- [25] J. Sillito and A. Begel, “App-directed learning: An exploratory study,”
surement (ESEM 2007), Sept 2007, pp. 255–264. in 6th International Workshop on Cooperative and Human Aspects of
[3] A. Begel and N. Nagappan, “Pair programming: What’s in it for me?” in Software Engineering (CHASE), May 2013, pp. 81–84.
Proc. of the Second ACM-IEEE International Symposium on Empirical [26] C. Weber, B. Curtis, and M. B. Chrissis, The capability maturity model,
Software Engineering and Measurement, 2008, pp. 120–128. guidelines for improving the software process. Harlow: Addison Wesley,
[4] B. Murphy, C. Bird, T. Zimmermann, L. Williams, N. Nagappan, and 1994.
A. Begel, “Have agile techniques been the silver bullet for software [27] M. Alexander, Six Sigma: The breakthrough management strategy rev-
development at microsoft?” in 2013 ACM/IEEE Intl. Symp. on Empirical olutionizing the world’s top corporations. Taylor & Francis, 2001.
Software Engineering and Measurement, Oct 2013, pp. 75–84. [28] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data manage-
[5] M. Senapathi, J. Buchan, and H. Osman, “DevOps capabilities, practices, ment challenges in production machine learning,” in Proc. of the 2017
and challenges: Insights from a case study,” in Proc. of the 22nd ACM SIGMOD, 2017, pp. 1723–1726.
International Conference on Evaluation and Assessment in Software [29] T. Kulesza, M. Burnett, W.-K. Wong, and S. Stumpf, “Principles of
Engineering 2018, 2018, pp. 57–67. explanatory debugging to personalize interactive machine learning,” in
[6] T. D. LaToza, G. Venolia, and R. DeLine, “Maintaining mental models: Proc. of the 20th International Conference on Intelligent User Interfaces,
A study of developer work habits,” in Proc. of the 28th International 2015, pp. 126–137.
Conference on Software Engineering, 2006, pp. 492–501. [30] S. Amershi, M. Chickering, S. M. Drucker, B. Lee, P. Simard, and J. Suh,
[7] M. Kim, T. Zimmermann, R. DeLine, and A. Begel, “The emerging role “Modeltracker: Redesigning performance analysis tools for machine
of data scientists on software development teams,” in Proc. of the 38th learning,” in Proc. of the 33rd Annual ACM Conference on Human
International Conference on Software Engineering, 2016, pp. 96–107. Factors in Computing Systems, 2015, pp. 337–346.
[8] M. Kim, T. Zimmermann, R. DeLine, and A. Begel, “Data scientists in [31] B. Nushi, E. Kamar, and E. Horvitz, “Towards accountable AI: Hybrid
software teams: State of the art and challenges,” IEEE Transactions on human-machine analyses for characterizing system failure,” in HCOMP,
Software Engineering, vol. 44, no. 11, pp. 1024–1038, 2018. 2018, pp. 126–135.
[9] C. Hill, R. Bellamy, T. Erickson, and M. Burnett, “Trials and tribulations [32] D. Gunning, “Explainable artificial intelligence (XAI),” Defense Ad-
of developers of intelligent systems: A field study,” in Visual Languages vanced Research Projects Agency (DARPA), 2017.
and Human-Centric Computing (VL/HCC), 2016 IEEE Symposium on. [33] D. S. Weld and G. Bansal, “Intelligible artificial intelligence,” arXiv
IEEE, 2016, pp. 162–170. preprint arXiv:1803.04263, 2018.
[10] “Machine learning workflow,” https://cloud.google.com/ml- [34] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. M. Wallach,
engine/docs/tensorflow/ml-solutions-overview, accessed: 2018-09-24. H. D. III, and K. Crawford, “Datasheets for datasets,” CoRR, vol.
[11] K. Patel, J. Fogarty, J. A. Landay, and B. Harrison, “Investigating abs/1803.09010, 2018.
statistical machine learning as a tool for software development,” in Proc. [35] C. Sutton, T. Hobson, J. Geddes, and R. Caruana, “Data diff: Inter-
of the SIGCHI Conference on Human Factors in Computing Systems. pretable, executable summaries of changes in distributions for data
ACM, 2008, pp. 667–676. wrangling,” in Proc. of the 24th ACM SIGKDD. ACM, 2018, pp.
[12] “The Team Data Science Process,” https://docs.microsoft.com/en- 2279–2288.
us/azure/machine-learning/team-data-science-process/, accessed: 2018- [36] C. R. B. de Souza, D. Redmiles, and P. Dourish, ““Breaking the Code”,
09-24. moving between private and public work in collaborative software
[13] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “The KDD process for development,” in Proc. of the 2003 International ACM SIGGROUP
extracting useful knowledge from volumes of data,” Communications of Conference on Supporting Group Work, 2003, pp. 105–114.
the ACM, vol. 39, no. 11, pp. 27–34, 1996. [37] C. R. B. de Souza, D. Redmiles, L.-T. Cheng, D. Millen, and J. Patterson,
[14] R. Wirth and J. Hipp, “CRISP-DM: Towards a standard process model “Sometimes you need to see through walls: A field study of application
for data mining,” in Proc. 4th Intl. Conference on Practical Applications programming interfaces,” in Proc. of the 2004 ACM Conference on
of Knowledge Discovery and Data mining, 2000, pp. 29–39. Computer Supported Cooperative Work, 2004, pp. 63–71.
2018 IEEE International Conference on Software Testing, Verification and Validation Workshops
A Survey of Software Quality for Machine

Learning Applications
Satoshi Masuda, Kohichi Ono, Toshiaki Yasue, Nobuhiro Hosokawa
IBM Research - Tokyo
Chuou-ku, Tokyo, Japan 103-8510
Email: {smasuda, onono, yasue, carvin}@jp.ibm.com
Abstract—Machine learning (ML) is now widespread. Tradi- A. Survey objective

tional software engineering can be applied to the development
ML applications. However, we have to consider specific problems
The main objectives of the survey is to discover techniques
with ML applications in therms of their quality. In this paper, to evaluate and improve the software quality of ML applica-
we present a survey of software quality for ML applications tions, and discuss future software testing research area. The
to consider the quality of ML applications as an emerging sub objective of the survey is to find solutions to software
discussion. From this survey, we raised problems with ML testing problems for ML applications. Figure 1 shows an
applications and discovered software engineering approaches and
software testing research areas to solve these problems. We
example of an ML application and problems in software
classified survey targets into Academic Conferences, Magazines, quality for that ML applications from industry experiences.
and Communities. We targeted 16 academic conferences on We focus the following problems:
artificial intelligence and software engineering, including 78 • How to verify the answer that MLaaS return to unknown
papers. We targeted 5 Magazines, including 22 papers. The
results indicated key areas, such as deep learning, fault localization,
data.
and prediction, to be researched with software engineering and • How to verify insufficient or biased data because prob-
testing. lems because data determine the logic of ML applica-
Index Terms—Machine Learning, Software Quality, Software tions.
Engineering and Testing • How to verify the quality of end-to-end systems.
• How to notify users confidence of answer correctness
I. I NTRODUCTION from the system.
Machine learning (ML) applications, such as face recog- These problems have not been solved by traditional software
nition, question answering, and sales analysis, are now engineering or testing, and are candidates of research area for
widespread. Many ML services are available from APIs on research in software testing.
the cloud and are called ML-as-a-services (MLaaS). ML B. Survey targets
applications often consist of such MLaaS. When we develop
ML applications based on requirements, traditional software The 16 targeted academic conferences on artificial intelli-
engineering can be applied. However, we have to consider gence and software engineering are as follows:
specific problems with ML applications in terms of their • AAAI: Association for the Advancement of Artificial
quality. The problems are that training data determines the Intelligence
logic of ML applications, and the results of ML application • ASE: IEEE/ACM International Conference on Automated
to unknown data cannot be verified in terms of correctness. Software Engineering

Hence, new software engineering approaches are required to • CVPR: IEEE Conference on Computer Vision and Pattern
solve the problems. In this paper, we present a survey of Recognition

software quality for ML applications to consider the quality
of ML applications. From the survey, we raised problems Public cloud Private cloud Enterprise network
with ML applications and discovered software engineering How to verify quality of end to
approaches and software testing research areas to solve these How to notify
confidence of
Data Data end system
problems. Userss
answer?
MLaaS
M MLaaS insufficient Enterprise
E t i
or biased Application
We classified survey targets into Academic Conferences, How to verify data
ML answer to
Magazines, and Communities. We targeted 16 academic con- unknown data
ferences on artificial intelligence and software engineering, Mobile Cloud Transaction Enterprise
Application Messaging Connectivity Process
including 78 papers. We targeted 5 Magazines, including 22
papers.
: Actors of application : Messaging : Software testing research question
II. P ROCEDURE OF SURVEY
We present the procedure of our survey in this section. Fig. 1. ML Application and software testing research questions
0-7695-6432-1/18/$31.00 ©2018 IEEE 279

DOI 10.1109/ICSTW.2018.00061
• FAS*W: Foundations and Applications of Self* Systems, conferences. For instance, about 700 papers were accepted
IEEE International Workshops on at AAAI2017. This small number indicates that the topics
• ICML: International Conference on Machine Learning of quality, verification, engineering, and design are not yet
• ICST: IEEE International Conference on Software Test- major topics. Table II lists tags and the number of tags for
ing, Verification and Validation each evaluations sorted by the number of evaluations. Deep
• ICSTW: Workshop on IEEE International Conference on learning was the top-ranked tag. Deep learning has been the
Software Testing, Verification and Validation most important technique in ML, so there were many articles
• IEEE Transaction: IEEE Transactions on Software Engi- related to the search words quality, verification, engineering
neering and design. The second-ranked tag was Fault localization,
• IJCAI: International Joint Conference on Artificial Intel- and the third was Prediction. Fault localization is a software
ligence engineering term, so it indicates that software engineering may
• ISSTA: ACM SIGSOFT International Symposium on shift to ML.
Software Testing and Analysis
• ITAW: Information Theory and Applications Workshop
B. Software Quality methods and techniques for ML applica-
• KDD: ACM SIGKDD Conference on Knowledge Dis-
tions
covery and Data Mining We summarize the software-quality methods and techniques
• MELECON: Mediterranean Electrotechnical Conference for ML applications from the top 7 ranked tags in the survey
• NIPS: Annual Conference on Neural Information Pro- results.
cessing Systems 1) Deep Learning: Pei et al. discussed how to find a
• PUC: Personal and Ubiquitous Computing ”corner case” (case rarely occurs, a troublesome case) in a
• SOSP: ACM Symposium on Operating Systems Princi- deep learning system and a test method using it [80]. Hsu
ples mentioned the method in the article ”A New Way to Find Bugs
The five targeted magazines are as follows: in Self-Driving AI Could Save Lives” [96]. Bau et al. proposed
• AI Magazine
a technique for interpreting what internal firing of a deep
• IEEE Spectrum
learning model corresponds to using semantic segmentation
• MIT news - Artificial intelligence
[28]. Schulam et al. proposed a method of generating counter-
• Analytics Magazine
factual predictions [73]. The method targeted a circumstances
• OpenAI
that if prediction can be executed by using experience-based
data the predictions have good results, but experience-based
The two targeted communities are the developer communi-
data cannot be always used due to ethical problems. Zhang
ties of software engineers.
et al., Sangkloy et al. proposed verification method of visual
C. Survey Steps pattern in deep learning [14], [19]. Kokkinos et al. discussed an
We executed the survey in the following steps: innovative method of deep learning. They proposed a method
that train in an end-to-end manner a convolutional neural
1) Search articles by retrieving key words: quality, verifi-
network (CNN) that jointly handles low-, mid-, and high-level
cation, engineering, design and other words related to
vision tasks in a unified architecture [22].
software engineering.
2) Fault Localization: Wong et al. surveyed software fault
2) Read articles.
localization [44]. Software fault localization, the act of iden-
3) Tag one key word to articles to classify them. The tags
tifying the locations of faults in a program, is widely rec-
are based on the Machine Learning Dictionary [1] and
ognized as one of the most tedious, time consuming, and
Artificial Intelligence Vocabulary [2]. If we could not
expensive activities in program debugging. Fault localization
find an appropriate word in the vocabulary, we created
is challenging for ML applications [59]. Le et al. proposed
a tag from the contents of the article. Some academic
a new fault localization approach that employs a learning-to-
papers in the articles have key words in their contents;
rank strategy. The approach includes ranking methods based
however, the key words indicate so extensive knowledge
on their likelihood of being a root cause of a failure. Sun et al.
that we could not classify the articles.
investigated several coverage-based statistical fault localization
4) Make comments and evaluate. We evaluated the articles
metrics and reported on an empirical study they conducted to
as representative papers R or subsidiary papers S.
assess the relative importance of those metrics elements [37].
III. S URVEY R ESULTS 3) Prediction: Prediction and inference are important func-
We present the survey results in this section and summarize tions of applications with ML applications. As software sys-
the methods and techniques that described in the articles. tems increase in complexity and operate with less human
supervision, it becomes more difficult to use traditional tech-
A. Results niques to detect when software is not behaving as intended
Table I lists the conferences and the number of articles. [62]. For instance, predicting human behavior in front of
The total number of articles was 101 from the search results. software systems helps better design software. Katz et al. pro-
This number was too small with respect to articles in each posed techniques to model the behavior of executing programs
280
TABLE I TABLE II
N UMBER OF ARTICLES N UMBER OF TAGS
Category Name Number References Tags R 1 S 2 Sub Total

of Deep learning 3 6 9
articles Fault localization 3 3
Academic AAAI2016 3 [3], [4], [5] Prediction 3 3
conferences AAAI2017 4 [6], [7], [8], [9] MLaaS 2 3 5
ASE2015 1 [10] Multi agent 2 2 4
ASE2016 3 [11], [12], [13] Search-Based 2 2 4
CVPR017 17 [14], [15], [16], [17], [18], Model checking 2 1 3
[19], [20], [21], [22], [23], AI business 1 7 8
[24], [25], [26], [27], [28], Genetic algorithm 1 1 2
[29], [30] Robot 1 1 2
FAS*W2016 1 [31] Technical debt 1 1
ICML2015 1 [32] Democratizing AI 1 1
ICML2017 1 [33] AI ethics 1 1
ICST2015 3 [34], [35], [36] Continuous integration 1 1
ICST2016 3 [37], [38], [39] Adversarial 1 1
ICST2017 3 [40], [41], [42] AI for complex 1 1
ICSTW2011 1 [43] Lifelong Learning Machines 1 1
IEEE Trans- 1 [44] Graph quality 1 1
actions Testing 7 7
IJCAI2016 4 [45], [46], [47], [48] Neural network 7 7
IJCAI2017 6 [49], [50], [51], [52], [53], Safety 7 7
[54] Software development 6 6
ISSTA2013 3 [55], [56], [57] Software engineering 4 4
ISSTA2016 2 [58], [59] AI overview 3 3
ISSTA2017 4 [60], [61], [62], [63] Testing automation 2 2
ITAW2017 1 [64] Symbolic execution 1 1
KDD2016 2 [65], [66] Static analysis 1 1
KDD2017 2 [67], [68] Timed failure 1 1
MELECON16 1 [69] Auto encoder 1 1
NIPS2015 1 [70] Pairwise 1 1
NIPS2017 8 [71], [72], [73], [74], [75], Algorithm design 1 1
[76], [77], [78] Security 1 1
PUC2016 1 [79] Training data 1 1
SOSP2017 1 [80] Speaker verification 1 1
Community Commercial 1 [81] Linear dynamic logic 1 1
Developer 1 [82] 3d model 1 1
Magazines AI Magazine 9 [83], [84], [85], [86], [87], Mechanism design 1 1
[88], [89], [90], [91] Label data 1 1
Business In- 1 [92] Metrics 1 1
sider Total 28 73 101
IEEE Spec- 5 [93], [94], [95], [96], [97]
tram
1. R: Representative papers, 2. S: Subsidiary papers
Open AI 6 [98], [99], [100], [101],
[102], [103]
Total 101
application, Bosse et al. developed distributed ML with self-
organizing mobile agents for earthquake monitoring [31].
using low-level signals collected during executions [62]. The 5) Multi Agent: Multi agent is a key approach to adapt soft-
models provide a basis for predicting whether an execution ware engineering into ML applications such as autonomous
of a program or program unit under test represents intended vehicles and robots. Verifying a data-aware multi-agent system
behavior. Hotzkow et al. presented automatically inferring (DAMAS) is challenging because of the infinite state models
user expectations from the semantic contexts over multiple generated by their infinite-domain variables. Belardinelli et al.
applications [63]. Once the user expectations are established, proposed a method of parameterized DAMAS (P-DAMAS)
this knowledge can be used as an oracle as a system with an unbounded number of homogenous
4) MLaaS: Ribeiro et al. defined MLaaS as a ready-to-use agents, each assumed to be data-aware, i.e., endowed with
ML functions. They proposed an architecture to create a flexi- possibly infinite domains and interacting with an environment
ble and scalable MLaaS [32]. VanDerHerten et al. developed a composed of partially shared data [53]. Wooldridge et al.
proof of concept for adaptive modeling and sampling method- presented formal models through which rational verification
ologies for Internet-of-Things applications with MLaaS [69]. can be studied, and surveyed the complexity of key decision
Assem et al. developed a methodology to build an application problems in multi-agent systems [4].
with train, compare, decide, and change approach with MLaaS 6) Search-Based: Search-Based Software Engineering is
[79]. This is more practical than theoretical; however, it seems the name given to a body of work in which search-based
to be the most helpful reference in the real world. For concrete optimization is applied to software engineering [34]. McMinn
281
TABLE III
C ORRESPONDENCE OF PROBLEMS AND TAGS
Tags
Problems Deep Fault Prediction MLaaS Multi Search- Model
learning localiza- agent Based check-
tion ing
How to verify the answer that MLaaS return to unknown data [80] [62] [40]
How to verify insufficient or biased data because problems be- [73] [53]
cause data determine the logic of ML applications.
How to verify the quality of end-to-end systems. [59] [79] [4] [42] [47]
How to notify users confidence of answer correctness from the [63],
system. [66]
R EFERENCES
et al. presented Search-Based Software Testing, which is the [1] B. Wilson, “The Machine Learning Dictionary,” 2012. [Online].
use of a meta-heuristic optimizing search technique, such as a Available: http://www.cse.unsw.edu.au/ billw/mldict.html
Genetic Algorithm, to automate or partially automate a testing [2] G. S. Novak Jr., “Artificial Intelligence Vocabulary,” 2005. [Online].
Available: https://www.cs.utexas.edu/users/novak/aivocab.html
task; for example, the automatic generation of test data [43]. [3] B. Zarrieß and J. Claßen, “Decidable Verification of Golog Programs
Gay2017 proposed assessed search-based generation of test over Non-Local Effect Actions,” pp. 1109–1115.
suites that detect real faults [42]. [4] M. Wooldridge, J. Gutierrez, P. Harrenstein, E. Marchioni, G. Perelli,
and A. Toumi, “Rational Verification: From Model Checking to Equi-
7) Model Checking: Model checking is a traditional tech- librium Checking,” Proceedings of the 30th Conference on Artificial
nique of software engineering. However, describing applica- Intelligence (AAAI 2016), pp. 4184–4190, 2016.
[5] B. Bittner, M. Bozzano, A. Cimatti, and G. Zampedri, “Automated
tions models with ML allows the application model checking Verification and Tightening of Failure Propagation Models,” Proceed-
to verify the applications. Alechina et al. proposed a method ings of the 30th Conference on Artificial Intelligence (AAAI 2016), pp.
for verifying existence of resource-bounded coalition uniform 907–913, 2016.
[6] M. Witbrock, “AI for Complex Situations: Beyond Uniform Problem
strategies to be able to automatically verify properties of such Solving,” 2017.
systems using model-checking [47]. Tappler et al. presented [7] V. Vanhoucke and G. B. Robotics, “‘ OK Google , fold my laundry
learning-based approach to detecting failures by model-based s ’ il te plaı̂t ’.”
[8] T. Sunahase, Y. Baba, and H. Kashima, “Pairwise HITS : Quality
testing [40]. Estimation from Pairwise Comparisons in Creator-Evaluator Crowd-
sourcing Process,” Proceedings of the 31th Conference on Artificial
IV. D ISCUSSION Intelligence (AAAI 2017), no. Kleinberg, pp. 977–983, 2017.
[9] J. Goldsmith and E. Burton, “Why teaching ethics to AI practitioners
Table III shows articles that have correspondence between is important,” 31st AAAI Conference on Artificial Intelligence, AAAI
2017, pp. 110–114, 2017.
problems and top seven tags. For instance, Pei et al. [80] [10] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen,
discussed a corner-case in a deep learning system and a white- “Combining deep learning with information retrieval to localize buggy
box testing method using the corner-case. The corner-case files for bug reports,” Proceedings - 2015 30th IEEE/ACM International
Conference on Automated Software Engineering, ASE 2015, pp. 476–
is an unknown data for ML applications; hence, their paper 481, 2016.
corresponds to the problem. We put corresponding papers to [11] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep
learning code fragments for code clone detection,” Proceedings of
each problems in the same manner. As summarized in Section the 31st IEEE/ACM International Conference on Automated Software
3, we found several techniques or methods of software quality Engineering - ASE 2016, pp. 87–98, 2016.
for ML applications. Some of these techniques corresponded to [12] X. Li, Y. Liang, H. Qian, Y.-Q. Hu, L. Bu, Y. Yu, X. Chen, and
X. Li, “Symbolic Execution of Complex Program Driven by Machine
problems that we raised. However, the problems have not been Learning Based Constraint Solving,” Ase, pp. 554–559, 2016.
completely solved. Each papers covered some of the problems. [13] N. Li, Y. Lei, H. R. Khan, J. Liu, and Y. Guo, “Applying combi-
For instance, Pei et al. targeted deep learning systems but natorial test data generation to big data applications,” Proceedings of
the 31st IEEE/ACM International Conference on Automated Software
did not target other ML models such as k-means, decision- Engineering - ASE 2016, pp. 637–647, 2016.
tree, and support vector machine. ML applications consist of [14] W. Zhang, X. Cao, R. Wang, Y. Guo, and Z. Chen, “Binarized Mode
Seeking for Scalable Visual Pattern Discovery,” Cvpr2017, pp. 3864–
various ML models; therefore, techniques of software quality 3872, 2017.
or testing are required for verifying of ML applications. [15] J. Wu, J. B. Tenenbaum, and P. Kohli, “Neural Scene De-rendering,”
Proceedings of the IEEE Conference on Computer Vision and Pattern
V. C ONCLUSION Recognition, pp. 699–707, 2017.
[16] W. Treible, P. Saponaro, S. Sorensen, A. Kolagunda, M. O. Neal,
ML applications such as face recognition, question answer- B. Phelan, K. Sherbondy, and C. Kambhamettu, “CATS : A Color
and Thermal Stereo Benchmark,” Cvpr, pp. 134–142, 2017.
ing, and sales analysis are now widespread. New software [17] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and
engineering approaches are required to solve the problems. R. Webb, “Learning from Simulated and Unsupervised Images through
Adversarial Training,” pp. 2242–2251, 2016.
We presented a survey of software quality for ML applications [18] K. Sasaki, S. Iizuka, E. Simo-serra, and H. Ishikawa, “Joint Gap
to consider the quality of ML applications. From our survey Detection and Inpainting of Line Drawings,” Cvpr, pp. 5768–5776,
determined problems with ML applications and discovered 2017.
[19] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays, “Scribbler: Controlling
software engineering approaches and software testing research Deep Image Synthesis with Sketch and Color,” pp. 6836–6845, 2016.
areas to solve these problems.
282
[20] K. Nakamura, S. Yeung, A. Alahi, and L. Fei-Fei, “Jointly on Software Testing, Verification and Validation, ICST 2016, pp. 69–79,
Learning Energy Expenditures and Activities using Egocentric 2016.
Multimodal Signals,” Cvpr, pp. 6817–6826, 2017. [Online]. Available: [40] M. Tappler, B. K. Aichernig, and R. Bloem, “Model-Based Testing IoT
http://vision.stanford.edu/pdf/nakamura2017cvpr.pdf Communication via Active Automata Learning,” Proceedings - 10th
[21] H. Le, T.-j. Chin, and D. Suter, “An Exact Penalty Method for Locally IEEE International Conference on Software Testing, Verification and
Convergent Maximum Consensus,” Cvpr2017, pp. 1888–1896, 1888. Validation, ICST 2017, pp. 276–287, 2017.
[22] I. Kokkinos, “UberNet: Training a ‘Universal’ Convolutional Neural [41] D. Pradhan, S. Wang, S. Ali, T. Yue, and M. Liaaen, “CBGA-
Network for Low-, Mid-, and High-Level Vision using Diverse Datasets ES: A Cluster-Based Genetic Algorithm with Elitist Selection for
and Limited Memory,” 2016. Supporting Multi-Objective Test Optimization,” Proceedings - 10th
[23] W. Kehl, F. Tombari, S. Ilic, and N. Navab, “Real-Time 3D Model IEEE International Conference on Software Testing, Verification and
Tracking in Color and Depth on a Single CPU Core,” 2017 IEEE Validation, ICST 2017, pp. 367–378, 2017.
Conference on Computer Vision and Pattern Recognition (CVPR), pp. [42] G. Gay, “The Fitness Function for the Job: Search-Based Generation
465–473, 2017. of Test Suites That Detect Real Faults,” Proceedings - 10th IEEE Inter-
[24] D. He, X. Yang, C. Liang, Z. Zhou, D. Kifer, C. L. Giles, and national Conference on Software Testing, Verification and Validation,
A. Ororbia, “Multi-scale FCN with Instance Aware Segmentation for ICST 2017, pp. 345–355, 2017.
Arbitrary Oriented Word Spotting In The Wild,” Cvpr 2017, no. 1, [43] P. McMinn, “Search-Based Software Testing: Past, Present and
2017. Future,” 2011 IEEE Fourth International Conference on Software
[25] C. Fan, J. Lee, M. Xu, K. K. Singh, Y. J. Lee, D. J. Crandall, and Testing, Verification and Validation Workshops, pp. 153–163, 2011.
M. S. Ryoo, “Identifying First-person Camera Wearers in Third-person [Online]. Available: http://ieeexplore.ieee.org/document/5954405/
Videos,” no. 1, pp. 4734–4742, 2017. [44] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, “A Survey
[26] Y. Chen and Y.-j. Liu, “Learning to Rank Retargeted Images,” Learn- on Software Fault Localization,” IEEE Transactions on Software En-
ing, pp. 3994–4002, 2017. gineering, vol. 42, no. 8, pp. 707–740, 2016.
[27] L. Castrejn, K. Kundu, R. Urtasun, and S. Fidler, “Annotating object [45] H. Narasimhan, S. Agarwal, and D. C. Parkes, “Automated mechanism
instances with a polygon-rnn,” in 2017 IEEE Conference on Computer design without money via machine learning,” IJCAI International Joint
Vision and Pattern Recognition (CVPR), July 2017, pp. 4485–4493. Conference on Artificial Intelligence, vol. 2016-Janua, pp. 433–439,
[28] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network 2016.
dissection: Quantifying interpretability of deep visual representations,” [46] Y. F. Li, S. B. Wang, and Z. H. Zhou, “Graph quality judgement:
in 2017 IEEE Conference on Computer Vision and Pattern Recognition A large margin expedition,” IJCAI International Joint Conference on
(CVPR), July 2017, pp. 3319–3327. Artificial Intelligence, vol. 2016-Janua, pp. 1725–1731, 2016.
[29] X. Alameda-Pineda, A. Pilzer, D. Xu, N. Sebe, and E. Ricci, “Vi- [47] N. Alechina, M. Dastani, and B. Logan, “Verifying existence of
raliency: Pooling local virality,” in 2017 IEEE Conference on Computer resource-bounded coalition uniform strategies,” IJCAI International
Vision and Pattern Recognition (CVPR), July 2017, pp. 484–492. Joint Conference on Artificial Intelligence, vol. 2016-Janua, pp. 24–
[30] V. K. Adhikarla, M. Vinkler, D. Sumin, R. K. Mantiuk, K. Myszkowski, 30, 2016.
H. P. Seidel, and P. Didyk, “Towards a quality metric for dense light [48] S. Adriaensen and A. Nowé, “Towards a white box approach to
fields,” in 2017 IEEE Conference on Computer Vision and Pattern automated algorithm design,” IJCAI International Joint Conference on
Recognition (CVPR), July 2017, pp. 3720–3729. Artificial Intelligence, vol. 2016-Janua, pp. 554–560, 2016.
[31] S. Bosse, “Distributed machine learning with self-organizing mobile [49] V. Noroozi, L. Zheng, S. Bahaadini, S. Xie, and P. S. Yu, “SEVEN:
agents for earthquake monitoring,” Proceedings - IEEE 1st Interna- Deep SEmi-supervised verification networks,” IJCAI International
tional Workshops on Foundations and Applications of Self-Systems, Joint Conference on Artificial Intelligence, pp. 2571–2577, 2017.
FAS-W 2016, pp. 126–132, 2016. [50] P. Kouvaros and A. Lomuscio, “Verifying fault-tolerance in parame-
[32] M. Ribeiro, K. Grolinger, and M. A. Capretz, “MLaaS: terised multi-agent systems,” IJCAI International Joint Conference on
Machine Learning as a Service,” 2015 IEEE 14th Artificial Intelligence, pp. 288–294, 2017.
International Conference on Machine Learning and Applications [51] J. Kong and A. Lomuscio, “Model checking multi-agent systems
(ICMLA), no. c, pp. 896–902, 2015. [Online]. Available: against LDLK specifications,” IJCAI International Joint Conference
http://ieeexplore.ieee.org/document/7424435/ on Artificial Intelligence, pp. 1138–1144, 2017.
[33] B. Chen, D. Kumor, and E. Bareinboim, “Identification and Model [52] N. Gorogiannis and F. Raimondi, “A Novel Symbolic Approach to
Testing in Linear Structural Equation Models using Auxiliary Verifying Epistemic Properties of Programs ,” pp. 206–212, 2009.
Variables,” Proceedings of the 34th International Conference on [53] F. Belardinelli, L. Ibisc, and I. Toulouse, “Parameterised Verification
Machine Learning, vol. 70, pp. 757–766, 2017. [Online]. Available: of Data-aware Multi-agent Systems,” pp. 98–104, 2016.
http://proceedings.mlr.press/v70/chen17f.html [54] F. Belardinelli, L. Ibisc, A. Murano, and S. Rubin, “Verification of
[34] M. Harman, P. McMinn, J. De Souza, and S. Yoo, “Search Broadcasting Multi-Agent Systems against an Epistemic Strategy Logic
based software engineering: Techniques, taxonomy, tutorial,” Imperial College London,” pp. 91–97, 2014.
Search, vol. 2012, pp. 1–59, 2011. [Online]. Available: [55] O. Tripp, O. Weisman, and L. Guy, “Finding Your Way in the
http://discovery.ucl.ac.uk/1340709/ Testing Jungle: A Learning Approach to Web Security Testing,”
[35] N. Erman, V. Tufvesson, M. Borg, P. Runeson, and A. Ardö, “Navigat- Proceedings of the 2013 International Symposium on Software
ing information overload caused by automated testing - A clustering Testing and Analysis, pp. 347–357, 2013. [Online]. Available:
approach in multi-branch development,” 2015 IEEE 8th International http://doi.acm.org/10.1145/2483760.2483776
Conference on Software Testing, Verification and Validation, ICST 2015 [56] F. M. Kifetew, A. Panichella, A. De Lucia, R. Oliveto, and P. Tonella,
- Proceedings, 2015. “Orthogonal exploration of the search space in evolutionary test case
[36] R. Carbone, L. Compagna, A. Panichella, and S. E. Ponta, “Security generation,” Proceedings of the 2013 International Symposium on
threat identification and testing,” 2015 IEEE 8th International Con- Software Testing and Analysis - ISSTA 2013, p. 257, 2013. [Online].
ference on Software Testing, Verification and Validation, ICST 2015 - Available: http://dl.acm.org/citation.cfm?doid=2483760.2483789
Proceedings, 2015. [57] F. Howar, D. Giannakopoulou, and Z. Rakamarić, “Hybrid learning:
[37] S. F. Sun and A. Podgurski, “Properties of Effective Metrics for interface generation through static, dynamic, and symbolic analysis,”
Coverage-Based Statistical Fault Localization,” Proceedings - 2016 Proceedings of the 2013 International Symposium on Software Testing
IEEE International Conference on Software Testing, Verification and and Analysis - ISSTA 2013, p. 268, 2013. [Online]. Available:
Validation, ICST 2016, pp. 124–134, 2016. http://dl.acm.org/citation.cfm?doid=2483760.2483783
[38] K. Moran, M. Linares-Vasquez, C. Bernal-Cardenas, C. Vendome, and [58] I. Medeiros, N. Neves, and M. Correia, “DEKANT: A Static
D. Poshyvanyk, “Automatically Discovering, Reporting and Reproduc- Analysis Tool that Learns to Detect Web Application Vulnerabilities,”
ing Android Application Crashes,” Proceedings - 2016 IEEE Inter- Proceedings of the 14th ACM conference on Computer and
national Conference on Software Testing, Verification and Validation, communications security CCS 07, p. 529, 2007. [Online]. Available:
ICST 2016, pp. 33–44, 2016. http://portal.acm.org/citation.cfm?doid=1315245.1315311
[39] B. Marculescu, R. Feldt, and R. Torkar, “Using Exploration Focused [59] T.-D. B. Le, D. Lo, C. Le Goues, and L. Grunske, “A learning-to-rank
Techniques to Augment Search-Based Software Testing: An Experi- based fault localization approach using likely invariants,” Proceedings
mental Evaluation,” Proceedings - 2016 IEEE International Conference of the 25th International Symposium on Software Testing and
283
Analysis - ISSTA 2016, pp. 177–188, 2016. [Online]. Available: [80] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated
http://dl.acm.org/citation.cfm?doid=2931037.2931049 whitebox testing of deep learning systems,” in Proceedings of the
[60] H. Spieker, A. Gotlieb, D. Marijan, and M. Mossige, “Reinforcement 26th Symposium on Operating Systems Principles, ser. SOSP ’17.
Learning for Automatic Test Case Prioritization and Selection in New York, NY, USA: ACM, 2017, pp. 1–18. [Online]. Available:
Continuous Integration,” Proceedings of 26th International Symposium http://doi.acm.org/10.1145/3132747.3132785
on Software Testing and Analysis (ISSTA’17), pp. 12—-22, 2017. [81] Amazon, “Amazon Alexa.” [Online]. Available:
[61] M. Santolucito, “Version Space Learning for Verification on Temporal https://developer.amazon.com/alexa
Differentials,” Proceedings of the 26th ACM SIGSOFT International [82] IBM, “With Watson Program,” 2018. [Online]. Available:
Symposium on Software Testing and Analysis, pp. 428–431, 2017. https://www.ibm.com/watson/with-watson/
[Online]. Available: http://doi.acm.org/10.1145/3092703.3098238 [83] R. G. Smith, “On the development of commercial expert systems,”
[62] D. S. Katz, “Understanding Intended Behavior using Models of Low- AI Magazine, vol. 5, no. 3, pp. 61–73, 1984. [Online]. Available:
Level Signals,” pp. 424–427. https://www.aaai.org/ojs/index.php/aimagazine/article/view/449
[63] J. Hotzkow, “Automatically Inferring and Enforcing User [84] S. Sievers, M. Ortlieb, and M. Helmert, “Efficient Implementation of
Expectations,” Proceedings of the 26th ACM SIGSOFT Pattern Database Heuristics for Classical Planning,” Symposium on
International Symposium on Software Testing and Analysis - Combinatorial Search, pp. 105–111, 2012.
ISSTA 2017, no. July, pp. 420–423, 2017. [Online]. Available: [85] A. Ramaswamy, B. Monsuez, and A. Tapus, “AI Dimensions in
http://dl.acm.org/citation.cfm?doid=3092703.3098236 Software Development for Human-Robot Interaction Systems,” pp.
[64] K. R. Varshney, “Engineering safety in machine learning,” 2016 Infor- 128–130, 2014.
mation Theory and Applications Workshop, ITA 2016, 2017. [86] D. S. Prerau, “Knowledge Acquisition in the Development of a Large
Expert System,” AI Magazine, vol. 8, no. 2, pp. 43–51, 1987.
[65] G. I. Webb and F. Petitjean, “A Multiple Test Correction for Streams
[87] G. Peter, “Knowledge-Based Software Engineering Conference,” 1992.
and Cascades of Statistical Hypothesis Tests,” Proceedings of the 22nd
[88] M. R. Lowry, “Software Engineering in the Twenty-First Century,” AI
ACM SIGKDD International Conference on Knowledge Discovery and
Magazine, vol. 13, no. 3, pp. 71–87, 1992.
Data Mining - KDD ’16, pp. 1255–1264, 2016. [Online]. Available:
[89] S. Giroux and N. Bier, “Cognitive Assistance to Meal Preparation:
http://dl.acm.org/citation.cfm?doid=2939672.2939775
Design, Implementation, and Assessment in a Living Lab,”
[66] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i 2015 AAAI Spring . . . , pp. 01–25, 2015. [Online]. Available:
trust you?”: Explaining the predictions of any classifier,” in https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10329/10013
Proceedings of the 22Nd ACM SIGKDD International Conference [90] M. A. Cohen, F. E. Ritter, and S. R. Haynes, “Applying Software
on Knowledge Discovery and Data Mining, ser. KDD ’16. New Engineering to Agent Development,” pp. 25–44, 2010.
York, NY, USA: ACM, 2016, pp. 1135–1144. [Online]. Available: [91] S. Burton, K. Swanson, and L. Leonard, “Quality and Knowledge in
http://doi.acm.org/10.1145/2939672.2939778 Software Engineering,” AI Magazine, vol. 14, no. 4, pp. 43–50, 1993.
[67] R. Parekh, “Designing AI at Scale to Power Everyday Life,” Proceed- [92] R. Hollander, “Alibaba, Baidu, and Tencent, China’s powerhouses,
ings of the 23rd ACM SIGKDD International Conference on Knowledge focus on AI to surpass the US - Business Insider,” 2017. [Online].
Discovery and Data Mining - KDD ’17, pp. 27–27, 2017. [Online]. Available: http://www.businessinsider.com/alibaba-baidu-and-tencent-
Available: http://dl.acm.org/citation.cfm?doid=3097983.3105815 chinas-powerhouses-focus-on-ai-to-surpass-the-us-2017-11
[68] D. Baylor and E. Breck, “TFX: A TensorFlow-Based Production-Scale [93] A. Nordrum, “Automatic Speaker Verification Systems Can Be Fooled
Machine Learning Platform,” Kdd, pp. 1387–1395, 2017. by Disguising Your Voice - IEEE Spectrum,” 2017. [Online]. Avail-
[69] J. Van Der Herten, I. Couckuyt, D. Deschrijver, P. Demeester, and able: https://spectrum.ieee.org/tech-talk/telecom/security/automatic-
T. Dhaene, “Adaptive modeling and sampling methodologies for In- speaker-verification-systems-can-be-fooled-by-disguising-your-voice
ternet of Things applications,” Proceedings of the 18th Mediterranean [94] S. K. Moore, “DARPA Seeking AI That Learns
Electrotechnical Conference: Intelligent and Efficient Technologies and All the Time - IEEE Spectrum,” 2017. [Online].
Services for the Citizen, MELECON 2016, no. April, pp. 18–20, 2016. Available: https://spectrum.ieee.org/cars-that-think/robotics/artificial-
[70] D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, intelligence/darpa-seeking-ai-that-can-learn-all-the-time
V. Chaudhary, M. Young, and D. Dennison, “Hidden Technical [95] J. Hsu, “Deep Learning AI for NASA Powers Earth
Debt in Machine Learning Systems,” Nips, pp. 2494–2502, 2015. Robots - IEEE Spectrum,” 2017. [Online]. Available:
[Online]. Available: http://papers.nips.cc/paper/5656-hidden-technical- https://spectrum.ieee.org/automaton/robotics/artificial-intelligence/ai-
debt-in-machine-learning-systems.pdf startup-neurala-deep-learning-for-nasa-powers-earth-robots
[71] F. Yang, A. Ramdas, K. Jamieson, and M. J. Wainwright, “A framework [96] ——, “A New Way to Find Bugs in Self-Driving
for Multi-A(rmed)/B(andit) testing with online FDR control,” no. 3, AI Could Save Lives - IEEE Spectrum,” 2017. [On-
2017. [Online]. Available: http://arxiv.org/abs/1706.05378 line]. Available: https://spectrum.ieee.org/tech-talk/robotics/artificial-
[72] R. Sen, A. T. Suresh, K. Shanmugam, A. G. Dimakis, and S. Shakkot- intelligence/better-bug-hunts-in-selfdriving-car-ai-could-save-lives
tai, “Model-Powered Conditional Independence Test,” no. Nips, pp. [97] E. Ackerman, “After Mastering Singapore’s Streets, NuTonomy’s
1–11, 2017. Robo-taxis Are Poised to Take on New Cities - IEEE Spectrum,”
[73] P. Schulam and S. Saria, “Reliable Decision Support using 2016. [Online]. Available: https://spectrum.ieee.org/transportation/self-
Counterfactual Models,” no. Nips, 2017. [Online]. Available: driving/after-mastering-singapores-streets-nutonomys-robotaxis-are-
http://arxiv.org/abs/1703.10651 poised-to-take-on-new-cities
[74] J. Z. Liu and B. Coull, “Robust Hypothesis Test for Nonlinear Effect [98] N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar,
with Gaussian Processes,” Nips2017, vol. 2, no. Nips, pp. 1–10, 2017. “Semi-supervised Knowledge Transfer for Deep Learning from Private
[Online]. Available: http://arxiv.org/abs/1710.01406 Training Data,” no. 2015, pp. 1–16, 2016. [Online]. Available:
[75] H. C. L. Law, C. Yau, and D. Sejdinovic, “Testing and Learning http://arxiv.org/abs/1610.05755
on Distributions with Symmetric Noise Invariance,” no. Mmd, 2017. [99] T. Miyato, A. M. Dai, and I. Goodfellow, “Adversarial Training
[Online]. Available: http://arxiv.org/abs/1703.07596 Methods for Semi-Supervised Text Classification,” pp. 1–11, 2016.
[76] W. Jitkrittum, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton, “A [Online]. Available: http://arxiv.org/abs/1605.07725
Linear-Time Kernel Goodness-of-Fit Test,” no. Nips, pp. 0–3, 2017. [100] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel,
[Online]. Available: http://arxiv.org/abs/1705.07673 “Adversarial Attacks on Neural Network Policies,” 2017. [Online].
[77] Z. Daniel Guo, P. S. Thomas, and E. Brunskill, “Using Available: http://arxiv.org/abs/1702.02284
Options and Covariance Testing for Long Horizon Off- [101] P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and
Policy Policy Evaluation,” Advances in Neural Information D. Amodei, “Deep reinforcement learning from human preferences,”
Processing Systems 30 (NIPS 2017), no. Nips, 2017. [On- 2017. [Online]. Available: http://arxiv.org/abs/1706.03741
line]. Available: http://papers.nips.cc/paper/6843-using-options-and- [102] V. Cheung, J. Schneider, I. Sutskever, and G. Brockman,
covariance-testing-for-long-horizon-off-policy-policy-evaluation.pdf “Infrastructure for Deep Learning,” pp. 1–11, 2016. [Online].
Available: https://openai.com/blog/infrastructure-for-deep-learning/
[78] F. Cecchi and N. Hegde, “Adaptive Active Hypothesis Testing under
[103] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and
Limited Information,” no. d, pp. 1–9, 2017.
D. Mané, “Concrete Problems in AI Safety,” pp. 1–29, 2016.
[79] H. Assem, L. Xu, T. S. Buda, and D. O’Sullivan, “Machine learning
as a service for enabling Internet of Things and People,” Personal and
Ubiquitous Computing, vol. 20, no. 6, pp. 899–914, 2016.
284
1
Software Engineers vs.

Machine Learning Algorithms:
An Empirical Study Assessing
Performance and Reuse Tasks
Nathalia Nascimento, Carlos Lucena, Paulo Alencar and Donald Cowan
Abstract—Several papers have recently contained reports on applying machine learning (ML) to the
arXiv:1802.01096v2 [cs.SE] 7 Feb 2018
automation of software engineering (SE) tasks, such as project management, modeling and development.
However, there appear to be no approaches comparing how software engineers fare against machine-learning
algorithms as applied to specific software development tasks. Such a comparison is essential to gain insight
into which tasks are better performed by humans and which by machine learning and how cooperative work or
human-in-the-loop processes can be implemented more effectively. In this paper, we present an empirical study
that compares how software engineers and machine-learning algorithms perform and reuse tasks. The
empirical study involves the synthesis of the control structure of an autonomous streetlight application. Our
approach consists of four steps. First, we solved the problem using machine learning to determine specific
performance and reuse tasks. Second, we asked software engineers with different domain knowledge levels to
provide a solution to the same tasks. Third, we compared how software engineers fare against
machine-learning algorithms when accomplishing the performance and reuse tasks based on criteria such as
energy consumption and safety. Finally, we analyzed the results to understand which tasks are better
performed by either humans or algorithms so that they can work together more effectively. Such an
understanding and the resulting human-in-the-loop approaches, which take into account the strengths and
weaknesses of humans and machine-learning algorithms, are fundamental not only to provide a basis for
cooperative work in support of software engineering, but also, in other areas.
Index Terms—Machine learning, human-in-the-loop, software engineer, automatic software engineering,

internet of things, empirical study
1 I NTRODUCTION
S OFTWARE engineering processes can be very

complex, costly and time-consuming [1].
They typically consist of a collection of related
enhanced [2]. As a result, the field of software
engineering requires millions of skilled Infor-
mation Technology (IT) professionals to create
tasks [2] such as designing, implementing, main- millions of lines of code, which must be installed,
taining, testing and reusing software applications configured, tuned, and maintained. According to
[3]. In addition, as software has become embed- Kephart (2005) [4], in the near future, it will be ex-
ded in systems of all kinds, millions of computer tremely challenging to manage IT environments,
programs have to be corrected, adapted, and even for the most skilled IT professionals.
Several researchers have proposed the use of
• Laboratory of Software Engineering (LES) at the Depart- artificial intelligence, especially machine-learning
ment of Informatics, Pontifical Catholic University of Rio de (ML) techniques, to automate different software
Janeiro, Brazil. CAPES scholarship/Program 194/Process: engineering (SE) tasks [3], [5]–[20]. For exam-
88881.134630/2016-01
E-mail: nnascimento,lucena@inf.puc-rio.br ple, Zhang has extensively studied this theme
• See http://www.inf.puc-rio.br/∼nnascimento/ recently and in [3] he stated that:
• David R. Cheriton School of Computer Science. University ”The field of software engineering turns
of Waterloo, Canada. out to be a fertile ground where many
E-mail: palencar,dcowan@csg.uwaterloo.ca
software development and maintenance
2
tasks could be formulated as learning in software engineering, but also in other appli-
problems and approached in terms of cation areas.
learning algorithms.” This paper is organized as follows: Section
2 presents the empirical study describing re-
However, there is a lack of approaches to
search questions, hypotheses and the objective
compare how software engineers fare against
of the study. Section 3 presents the method se-
machine-learning algorithms for specific soft-
lected to collect our empirical data. Sections 4
ware development tasks. This comparison is crit-
and 5 present the experimental results. Section
ical in order to evaluate which S.E. tasks are
6 presents the threats to the validity of our ex-
better performed by automation and which re-
periment. Section 7 presents the related work.
quire human involvement or human-in-the-loop
The paper ends with concluding remarks and
approaches [21], [22]. In practice, because there
suggestions for future work.
are no explicit comparisons between the tasks
performed by engineers and automated proce-
dures, including machine learning, it is often not 1.1 Motivation
clear when to use automation in a specific setting. The theme of this paper, namely whether arti-
For example, a Brazilian company acquired a ficial intelligence such as machine learning, can
software system to select petroleum exploration benefit software engineering, has been investi-
models automatically, but the engineers decided gated since 1986, when Hebert A Simon pub-
they could provide a better solution manually. lished a paper entitled “Whether software engi-
However, when there was a comparison of the neering needs to be artificially intelligent” [24].
manual solution with the one provided automat- In this paper, Simon discussed “the roles that
ically by the system, it became clear that the humans now play versus the roles that could be
automated solution was better. This illustrates taken over by artificial intelligence in developing
that a lack of comparisons makes choosing a computer systems.” Notwithstanding, in 1993,
manual or an automated solution or a combined Ian Sommerville raised the following question
human-in-the-loop approach difficult. [25]: “What of the future - can Artificial Intelli-
This paper, contains an empirical study [23] to gence make a contribution to system engineer-
compare how software engineers and machine- ing?” In this paper [25], Sommervile performed a
learning algorithms achieve performance and literature review in applications of artificial intel-
reuse tasks. The empirical study uses a case study ligence to software engineering, and concluded
involving the creation of a control structure for that:
an autonomous streetlight application. The ap- “the contribution of AI will be in sup-
proach consists of four steps. First, the problem porting...activities that are characterized
was solved using machine learning to achieve by solutions to problems which are
specific performance and reuse of tasks. Sec- neither right nor wrong but which are
ond, we asked software engineers with different more or less appropriate for a par-
domain-knowledge levels to provide a solution ticular situation...For example, require-
to achieve the same tasks. Third, we compared ments specification and analysis which
how software engineers compare with machine- involves extensive consultation with do-
learning algorithms when accomplishing the per- main experts and in project manage-
formance and reuse tasks based on criteria such ment.”
as energy consumption and safety. Finally, the Several papers have since investigated the use
results were analyzed to understand which tasks of Machine Learning (ML) [26] in solving differ-
are better performed by either humans or al- ent software engineering (SE) tasks [5]–[20], [27]–
gorithms so that they can work together more [103]. These investigations include approaches
effectively. to: i) project management [27]–[49], dealing with
Such an understanding is essential in re- problems related to cost, time, quality prediction,
alizing novel human-in-the-loop approaches in and resource management; ii) defect prediction
which machine-learning procedures assist soft- [50]–[84]; iii) requirements management, focus-
ware developers in achieving tasks. Such human- ing on problems of classifying or representing re-
in-the-loop approaches, which take into account quirements [85]–[88], or generating requirements
the strengths and weaknesses of humans and [89]; iv) software development, such as code gen-
machine-learning algorithms, are fundamental eration [20], [68], [90]–[96], synthesis [97]–[101],
not only to provide a basis for cooperative work and code evaluation [102], [103].
3
Most of these papers present successful ap- control structure for each thing manually.
plications of machine learning in software engi-
neering, showing that ML techniques can provide
correct automatic solutions to some SE problems.
However, very few papers discuss whether or
not a domain expert could propose a manual
solution more appropriate for the particular
situation. “More appropriate”, means a solution 1.2 Objective
that provides better performance or increases an-
other quality that is important to a particular ap-
plication scenario, such as user preference [104]. In this context, we decided to ask the following
For example, in the medical and aviation engi- question: “How do software engineers compare
neering fields, trust [105] in a solution provided with machine-learning algorithms?” To explore
to the end-user is an important factor to consider this question, we selected the Internet of Things
for a solution to be more appropriate. However, as our application domain and then, compared
although many authors [106]–[109] have been a solution provided by a skilled IoT professional
promoting the use of neural networks [110] in with a solution provided by a learning algorithm
medicine, Abbas et al. [105] and Castelvecchi with respect to performance and reuse tasks.
[111] are among the few authors who questioned: In short, Figure 1 depicts the theory [121] that
“what is the trustworthiness of a prediction made we investigate in this paper. According to the
by an artificial neural network?” theory, the variables that we intend to isolate
In other application scenarios, such as many and measure are the performance and reusabil-
of those related to the Internet of Things (IoT) ity achieved from three kinds of solutions: i)
[112], [113], numerous authors [93], [95], [101], solutions provided by learning techniques; ii)
[114] consider the reuse of a solution as an im- solutions provided by software engineers with
portant quality. They agree that to achieve the IoT skills; and iii) solutions provided by software
goal of billions of things connected to the Internet engineers without IoT skills.
over the next few years [112], it is necessary to To evaluate the relationship among these vari-
find ways to reduce time to market. For example, ables, we performed an empirical study, using
it is desirable that the solution or parts of the FIoT [101]. As shown in Figure 1, we raised
solution to design autonomous streetlights [96] four research questions (RQx) to investigate our
for a specific scenario could be reused to design theory’s propositions (e.g hypotheses (H-RQx)).
streetlights for another scenario. We present these questions and hypotheses in
In particular, the Internet of Things has con- Section 2. To collect and analyze our empiri-
siderably increased the number of approaches cal data, we performed a controlled experiment.
that propose the use of machine learning to To perform this experiment, we reproduced the
automate software development [93]–[96], [101], problem of synthesizing the control structure
[115]. None of this research contains a compar- of autonomous streetlights using neuroevolution
ison of their results to experiments designed by (i.e. “a learning algorithm which uses genetic
IoT experts. For example, do Nascimento and Lu- algorithms to train neural networks” [122]) pre-
cena [101], [116] developed a hybrid framework sented in [118]. Then, we invited 14 software
that uses learning-based and manual program engineers to provide a solution for the same
synthesis for the Internet of Things (FIoT). They problem using the same architecture and envi-
generated four instances of the framework [101], ronment. Lastly, we compared the solution pro-
[108], [117], [118] and used learning techniques vided by the learning algorithm against the so-
to synthesize the control structure automatically. lutions provided by the software engineers. In
These authors stated that the use of machine this application of autonomous streetlights, we
learning made feasible the development of these are considering a “more appropriate” solution
applications. However, they did not present any as one that presents a better performance in
experiment without using learning techniques. In the main scenario [118] or can be satisfactorily
contrast, most of the solutions released for the reused in a new scenario, based on criteria such
Internet of Things, such as Apple’s HomeKit’s as minimal energy consumption and safety
approach [119] and Samsung Smart Things, [120] (that is, maximum visual comfort in illuminated
consider a software developer synthesizing the areas).
4
Technology SE Tasks
Actor
Synthesize and reuse
Machine Learning the control structure of
Techniques autonomous things
Software Engineers produce solutions

better than
with IoT skills

(H-RQ1, H-RQ2) Software System
produce solutions better Internet of Things

than (H-RQ3, H-RQ4)
without IoT skills Improve (H-RQ1, H-RQ3)
Performance
Increase (H-RQ2, H-RQ4)
Reusability
Fig. 1. Theory [121]: Machine Learning can create solutions more appropriate than software engineers in the context of
the Internet of Things.
2 H OW DO SOFTWARE ENGINEERS COM - RQ4. ...software engineers without IoT

PARE WITH MACHINE - LEARNING ALGO - skills with respect to their re-usability?
RITHMS ? A N E MPIRICAL S TUDY A D -
DRESSING P ERFORMANCE AND R EUSE IN 2.2 Hypotheses
THE I OT D OMAIN Each RQ is based on one or more hypotheses,
which are described next.
The experimental goal, based on the Goal- H - RQ1.
Question-Metric (GQM) method [123] is to use
the Framework for the Internet of Things (FIoT) • H0. An ML-based approach does not im-
for the purpose of comparing the use of an prove the performance of autonomous
automated approach against a manual approach things compared to solutions provided by
when synthesizing the control of autonomous IoT expert software engineers.
things with respect to their performance and • HA. An ML-based approach improves the
reuse. performance of autonomous things com-
For this purpose, we asked four research pared to solutions provided by IoT expert
questions (RQs) and performed a controlled ex- software engineers.
periment [23] (section 3) to investigate them. H - RQ2.
• H0. An ML-based approach does not in-
2.1 Questions crease the reuse of autonomous things
compared to solutions provided by IoT
In terms of synthesizing the control structure of expert software engineers.
autonomous things, how does the result from • HA. An ML-based approach increases the
a machine learning-based solution differ from reuse of autonomous things compared to
solutions provided by... solutions provided by IoT expert software
RQ1. ...software engineers with IoT engineers.
skills with respect to their perfor-
H - RQ3.
mance?
RQ2. ...software engineers with IoT • H0. An ML-based approach does not im-
skills with respect to their re-usability? prove the performance of autonomous
RQ3. ...software engineers without IoT things compared to solutions provided by
skills with respect to their perfor- software engineers without experience in
mance? IoT development.
5
• HA. An ML-based approach improves the TABLE 1

performance of autonomous things com- Implementing FIoT flexible points to synthesize streetlight
controllers using a ML-based approach.
pared to solutions provided by software
engineers without experience in IoT de-
FIoT Framework Light Control Application
velopment. Controller Three Layer Neural Network
H - RQ4. Collective Fitness Evaluation:
the solution is evaluated
• H0. An ML-based approach does not in- based on the energy
crease the reuse of autonomous things consumption, the number of
Making Evaluation people that finished their
compared to solutions provided by soft- routes after the
ware engineers without experience in IoT simulation ends, and the
development. total time spent by people
to move during their trip
• HA. An ML-based approach increases the
Evolutionary Algorithm:
reuse of autonomous things compared to Controller Adaptation Generate a pool of
solutions provided by software engineers at design time candidates to represent the
without experience in IoT development. neural network parameters
2.3 The object of the study: The Framework TABLE 2

for the Internet of Things (FIoT) Implementing FIoT flexible points to synthesize streetlight
controllers using a solution provided by a Software
The Framework for the Internet of Things (FIoT) Engineer
[101] is a hybrid software framework that allows
the developer to generate controller structures for FIoT Framework Light Control Application
autonomous things through learning or procedu- if-else module provided by a
Controller
software engineer
ral algorithms. Collective Fitness Evaluation:
If a researcher develops an application using the solution is evaluated
FIoT, the application will contain a Java software based on the energy
component already equipped with modules for consumption, the number of
Making Evaluation people that finished their
detecting autonomous things in an environment, routes after the
assigning a controller to a specific thing, creating simulation ends, and the
software agents, collecting data from devices and total time spent by people
to move during their trip
supporting the communication structure among Controller Adaptation
agents and devices. None
at design time
Some features are variable and may be select-
ed/developed according to the application type,
as follows: (i) a control module such as “if-else”, 3 C ONTROLLED E XPERIMENT
neural network or finite state machine; (2) an The first step of the experiment was to reproduce
adaptive technique to synthesize the controller the experiment presented in [118] by using a not
at design-time, such as reinforcement learning supervised learning approach. Then, we invited
[124] or genetic algorithm; and (iii) an evaluation 14 software engineers to provide a solution for
process to evaluate the behavior of autonomous the same problem. Finally, we compared the so-
things that are making decisions based on the lution provided through the learning algorithm
controller. against solutions provided by the participants.
For example, Table 1 summarizes how the
“Streetlight Control” application will adhere
to the proposed framework using a machine 3.1 Participant Analysis
learning-based approach, while extending the As we have described previously, the knowl-
FIoT flexible points. edge in the application domain is an important
Table 2 summarizes how the “Streetlight Con- variable in our empirical study. Therefore, before
trol” application will adhere to the proposed performing the controlled experiment, we asked
framework using a solution provided by a soft- participants to describe their experience with the
ware engineer, while extending the FIoT flexible development of applications based on the In-
points. ternet of Things, that is, developing distributed
Our goal is to provide both solutions to the systems with embedded characteristics, such as
same application and compare the results based providing each element of the system with sen-
on the same evaluation process. sors and actuators. As shown in Figure 2, 43% of
6
Medium Each streetlight has to execute three tasks

High every second: data collection, decision-making
14% and action enforcement. The first task consists
Low 22% of receiving data related to people flow, ambi-
21% ent brightness, data from the neighboring street-
lights and current light status (activation level
43% of sensors and the previous output value of
listeningDecision). The second task consists of
analyzing collected data and making decisions
None about actions to be enforced. The last task is
the action enforcement, which consists of setting
Fig. 2. Experience of participants in developing applica- the value of three output variables: (i) listen-
tions based on the Internet of Things. ingDecision, that enables the streetlight to receive
signals from neighboring streetlights in the next
cycle; (ii) wirelessTransmitter, a signal value to
participants have never developed an application be transmitted to neighboring streetlights; and
based on the Internet of Things and 57% have (iii) lightDecision, that activates the light’s OF-
developed at least one application. F/DIM/ON functions.
The interested reader may consult a more
3.2 Experiment: Streetlight Application extensive paper about the application scenario
[118] 1 .
In short, our experiment involves developing
autonomous streetlights. The overall goal of this 3.2.1 The Challenge
application is to reduce the energy consump-
As we explained to the participants, the tasks
tion while maintaining appropriate visibility in
of collecting data and enforcing actions have
illuminated areas [118]. For this purpose, we
already been implemented. The challenge was
provided each streetlight with ambient bright-
to provide a solution for the task of making
ness and motion sensors, and an actuator to
decisions, as depicted in Figure 4.
control light intensity. In addition, we also pro-
vided streetlights with wireless communicators
as shown in Figure 3. Therefore, the streetlights
are able to cooperate with each other to establish
the most likely routes for passers-by and thus
achieve the goal of minimizing energy consump-
tion.
Lighting
(Dark/DIM/Light) Light Decision
(OFF/DIM/ON)
Presence Fig. 4. The challenge: how does a streetlight make deci-
(Yes/No) sions based on collected data?
Wireless
Data collected from Transmitter
the closest street light (0.0/0.5/1.0) We provided pseudocode that considered all
(0.0/0.5/1.0) possible combinations of input variables. Then,
Listening participants decided how to set output variables
Previous Listening Decision
Decision
(Yes/No)
according to the collected data. Part 2 of this
(Yes/No)
pseudocode is depicted in Figure 5.
Each participant provided a different solu-
Fig. 3. Variables collected and set by streetlights. tion. Therefore, we conducted the experiment by
Each streetlight in the simulation has a mi- 1. All documents that we prepared to explain this appli-
crocontroller that is used to detect the proximity cation scenario to participants are available at
http://www.inf.puc-rio.br/.̃nnascimento/projects.html
of a person, and control the closest streetlight. A
2. The pseudocode that we provided to participants is
streetlight can change the status of its light to ON, available at:
OFF or DIM. http://www.inf.puc-rio.br/.̃nnascimento/projects.html
7
Fig. 5. Small portion of the pseudocode of the decision module that was filled by participants.
if (lighting_sensor = Medium AND detected_person =

NO AND data_collected = 0.0 AND
previous_listening_decision = YES) then {
light_decision = __X_OFF____DIM____ON
wireless_transmitter = __X_0.0 ___0.5____1.0
listening_decision = _X__YES ____NO
} if (lighting_sensor = Medium AND detected_person =
YES AND data_collected = 0.0 AND
previous_listening_decision = YES) then {
light_decision = ___OFF__X__DIM____ON
wireless_transmitter = _X__0.0 ___0.5____1.0
listening_decision = __X_YES ____NO
} if (lighting_sensor = Dark AND detected_person = NO
AND data_collected = 0.0 AND
previous_listening_decision = NO) then {
light_decision = __X_OFF____DIM____ON
wireless_transmitter = ___0.0 __X_0.5____1.0
listening_decision = ___YES __X__NO
}
Fig. 6. Small portion of the rule decisions that was synthesized according to the learning-based approach.
using each one. In addition, we also considered lights with broken lamps emit “0.5” from their
a “zeroed” solution, which always sets all values wireless transmitters.
to zero. This zeroed solution is supposed to be In addition, we also observed that a streetlight
the worst solution, since streetlights will always that is not broken switches its lamp ON if it de-
switch their lights to OFF. tects a persons proximity or receives “0.5” from a
wireless transmitter.
3.2.2 The solution generated by a machine-
learning algorithm 3.2.3 Scenario constraints
We compared the results from all of these ap- Before starting a solution, each participant
proaches to the result produced using the ma- should consider the following constraints:
chine learning approach. As do Nascimento and
Lucena explain in [118], the learning approach • Do not take light numbering into account,
uses a three-layer feedforward neural network since your solution may be used in differ-
combined with an evolutionary algorithm to gen- ent scenarios (see an example of a scenario
erate decision rules automatically. Figure 6 de- in Figure 7).
picts some of the rules that were generated by the • Three streetlights will go dark during the
evolved neural network. The interested reader simulation.
can consult more extensive papers [101], [118] or • People walk along different paths starting
read Nascimento’s dissertation [116] (chap. ii, sec. at random departure points. Their role is
iii). to complete their routes by reaching a des-
Based on the generated rules and the system tination point. The number of people that
execution, we observe that using the solution finished their routes after the simulation
provided by the neural network, only the street- ends, and the total time spent by people
8
moving during their trip are the most Figure 7 depicts the elements that are part of the
important factors for a good solution. application namely, streetlights, people, nodes
• A person can only move if his current and and edges.
next positions are not completely dark.
In addition, we also consider that people
walk slowly if the place is partially devoid
of light. Execution: 12 seconds
• The energy consumption also influences - A person moves from
the solution evaluation. one point to another in
one second or a second
• The energy consumption is proportional and a half.
- Street lights execute
to the light status (OFF/DIM/ON). cycles of 1 second
• We also consider the use of the wireless
ON
transmitter to calculate energy consump- DIM
OFF
tion (if the streetlight emits something Broken Lamps
different from “0.0”, it consumes 0.1 of departure points

target points
energy).
Therefore, each solution is evaluated after the
Fig. 7. Simulated Neighborhood.
simulation ends based on the energy consump-
tion, the number of people that finished their Nascimento and Lucena [118] modeled the
routes after the simulation ends, and the total scenario as a graph, in which a node represents
time spent by people moving during their trip. a streetlight position and an edge represents
the smallest distance between two streetlights.
(completedP eople × 100) The graph representing the streetlight network
pP eople = (1)
totalP eople consists of 18 nodes and 34 edges. Each node
(totalEnergy × 100) represents a streetlight. In the graph, the yellow,
pEnergy = 11×(timeSimulation×totalSmartLights) gray, black and red triangles represent the street-
( 10 ) light status (ON/DIM/OFF/Broken Lamp). Each
(2) edge is two-way and links two nodes. In addi-
(totalT imeT rip × 100)
pT rip = 3×timeSimulation tion, each edge has a light intensity parameter
(( (2) ) × totalP eople) that is the sum of the environmental light and the
(3) brightness from the streetlights in its nodes. Their
f itness = (1.0 × pP eople) − (0.6 × pT rip)− goal is to simulate different lighting in different
(0.4 × pEnergy) neighborhood areas.
(4)
Equations (1) through (4) show the values to
be calculated for the evaluation in which pP eople
is the percentage of the number of people that
completed their routes before the end of the 1 second/1 second and a half
simulation out of the total number of people
in the simulation; pEnergy is the percentage of
energy that was consumed by streetlights out of
the maximum energy value that could be con-
sumed during the simulation. We also considered
the use of the wireless transmitter to calculate
energy consumption; pT rip is the percentage of
the total duration time of peoples trips out of the
maximum time value that their trip could spend;
and f itness is the fitness of each candidate that Fig. 8. Person moving in the simulated Neighborhood.
encodes the proposed solution.
As depicted in Figure 8, only one person was
3.2.4 Example - Simulating the environment started in the scenario that we showed to partic-
We showed participants the same simulated ipants. For instance, the person starting at point
neighborhood scenario that was used by the ge- 0 has point 11 as a target. We ask participants to
netic algorithm to evolve the neural network. provide a solution to streetlights to assure that
9
this person will conclude his route before the the solution provided by each one of the 14
simulation ends after 12 seconds. participants 3 .
To provide a controlled experiment and be
3.2.5 New Scenario: Unknown environment able to compare the different solutions, we
started with only one person in the scenario and
The second step of the experiment consists of manually we set the parameters that were sup-
executing solutions from participants and the posed to be randomly selected, such as departure
learning approach in a new scenario, but with the and target points and broken lamps.
same constraints. This scenario, that is depicted Each experiment execution consists of execut-
in Figure 9 was not used by the learning algo- ing the simulated scenario three times: (i) night
rithm and was not presented to participants. (environmental light is equal to 0.0); (ii) late
The goal of this new part of the experiment afternoon (environmental light is equal to 0.5);
is to verify if the decision module that was de- and (iii) morning (environmental light is equal
signed to control streetlights in the first scenario to 1.0). The main idea is to determine how the
can be reused in another scenario. solution behaves during different parts of the
day. Figure 10 depicts the percentage of energy
that was spent according to the environmental
light for each one of the 16 different solutions. As
we described previously, we also considered the
use of the wireless transmitter to calculate energy
consumption. As expected, as streetlights using
the zeroed decision never switch their lights ON
and never emit any signal, the energy consumed
using this solution is always zero. It is possible
to observe that only the solutions provided by
the learning algorithm and by the 5th and 11th
participants do not expend energy when the en-
vironmental light is maximum. In fact, according
to the proposed scenario, there is no reason to
turn ON streetlights during the period of the day
with maximum illumination.
100
90
Fig. 9. Simulating a new neighborhood. 80
.
70
0
60
In this new scenario, we also only started
50
one person, who has the point 18 (yellow point)
(
40
as departure and the point 8 as target. As the 30
.)1
20
scenario is larger, we established a simulation
10
time of 30 seconds. 0
4 E XPERIMENT - PART 1 - R ESULTS .)1 ).5 .)1 ).5 % .)1 ).5
We executed the experiment 16 times, only Fig. 10. Scenario1: Percentage of energy spent in different
changing the decision solution of the au- parts of the day according to the participant solutions.
tonomous streetlights. In the first instance, we
set all outputs to zero (the zeroed solution) dur- Figure 11 depicts the percentage of time that
ing the whole simulation, which is supposed to was spent by the unique person in each one of
be the worst solution. For example, streetlights the simulations. As shown, the higher difference
never switch their lights ON. In the second in- between solutions occurs at night. If the time is
stance, we executed the experiment using the
best solution that was found by the learning 3. All files that were generated during the development
of this work, such as executable files and participants’
algorithm, according to the experiment presented solutions results, are available at
in [118]. Then, we executed the simulation for http://www.inf.puc-rio.br/.̃nnascimento/projects.html
10
100%, it means that the person did not complete simulation time did not allow the person to finish
the route, thus the solution did not work the route.
100 4.1 Discussion: Participants Knowledge in

90
80 IoT Versus Results
70
After executing the solution proposed by each
). ) (
60
50 participant, we connect that solution’s results
40
30
with the participant’s knowledge in the IoT do-
20 main, as shown in Table 3.
10
0
TABLE 3
1) 0 Correlation between participants expertises in the Internet
of Things with their solution results.
). ( ). ( % ). (
Experience
Fig. 11. Scenario1: Percentage of time spent by person with IoT Solution Does
to conclude his route based on different parts of the day Software Development Performance the
according to the participant solutions. Engineer (None/Low/ (Fitness solution
Medium/ Average) work?
Besides presenting the results of the different High)
1 High 55.48 Y
solutions in different parts of the day, the best
2 None 26.99 N
solution must be the one that presents the best 3 High 62.88 Y
result for the whole day. Thus, we calculated the 4 Low 62.49 Y
average of each one of the parameters (energy, 5 None 30.50 N
people, trip and fitness) that was achieved by 6 Low 51.09 Y
7 Medium 54.37 Y
solutions in different parts of the day. Figure 12 8 None 16.59 N
depicts a common average. We also calculated a 9 High 28.62 N
weighted average, taking into account the dura- 10 None 61.60 Y
tion of the parts of the day (we considered 12 11 None 29.67 N
12 Medium 47.81 Y
hours for the night period, 3h for dim and 9h for 13 None 30.32 N
the morning), but the results were very similar. 14 Low 56.91 Y
Learning 59.53 Y
100
zeroed 28.33 N
90
80
70 62.88
We observe a significant difference between
61.60
62.49 56.91 59.53
60 55.48
51.09
54.37
47.81 results from software engineers with any ex-
50
%
40
29.67 30.32
perience in IoT development and results from
26.99 30.50 28.62 28.33
30
20
16.59 software engineers without experience in IoT
10 development. Participant 10 is the only individ-
0
ual without knowledge of IoT that provided a
solution that works and participant 9 is the only
Fitness individual with any knowledge of IoT that did
not provide a working solution.
Fig. 12. Scenario1: Average of energy, trip and fitness
calculated for the different parts of the day according to the
participant solutions. 4.2 Hypothesis Testing
In this section, we investigate the hypotheses
As shown in Figure 12, based on the fitness related to the solutions’ performance evaluation
average, three participants namely 3, 4 and 10 (i.e H-RQ1 and H-RQ3), as presented in subsec-
provided a solution slightly better than the so- tion 2.2. Thus, we performed statistical analyses,
lution provided by the learning algorithm. Five as described by Peck and Devore [125], of the
other participants provided a solution that works measures presented in Table 3.
and the remaining six provided a solution that As shown in Table 4, we separated the results
does not work. As explained earlier, we have of the experiments into two groups: i) software
been considering an incorrect solution as one engineers with IoT knowledge and ii) software
in which the person did not finish the route engineers without IoT knowledge. Then, we cal-
before the simulation ends. Even increasing the culated the mean and the standard deviation of
11
the results achieved by each group performing

the experiment and compare each result against (52.46 − 59.53)
the value achieved using the ML-based solution. t − statistic : t = = −1.83 (6)
( 10.91
√ )
8
4.2.1 How does the evaluation result from a ma- According to t-statistic theory, we can safely
chine learning-based solution differ from solutions reject our null hypothesis if the t − statistic
provided by IoT expert software engineers with value is below the negative t − criticalvalue
respect to their performance? (threshold) [125]. This negative t − criticalvalue
H - RQ1. bounds the area of rejection of a T-distribution,
• H0. An ML-based approach does not im- as shown in Figure 13. In our experiment, as we
prove the performance of autonomous specified a statistical significance level of 0.01,
things compared to solutions provided by the probability of getting a T-value less or equal
IoT expert software engineers. than the negative t − criticalvalue is 1%. We cal-
• HA. An ML-based approach improves the culated the tcriticalvalue of this T-distribution
performance of autonomous things com- according to the T-table presented in Peck and
pared to solutions provided by IoT expert Devore (2011, pg 791) [125]. Accordingly, for a
software engineers. distribution with 7 degrees of freedom (see Table
4) and a confidence level of 99%, the negative
The first null hypothesis H0 claims that there tcriticalvalue is -3.00. As we depicted in Figure
is no difference between the mean performance 13, the test statistic of our sample is higher than
for IoT expert software engineers’ solutions and the tcriticalvalue.
the ML-based approach solution. The alternative
hypothesis claims that the ML-based approach
solution improves the performance of the appli-
cation in comparison to IoT expert software en-
gineers’ solutions. Thus, the claim is that the true
IoT expert software engineers’ solutions mean
is below the performance achieved by the ML-
based approach, that is = 59.53.
Therefore, we used the ML performance as
our hypothesizedvalue to test the following one- Fig. 13. Hypothesis H - RQ1 Test Graph.
sided hypothesis:
H0: µs e = 59.53 As the test statistic does not fall in the critical
H1: µs e < 59.53 region, we cannot safely reject this null hypoth-
where µs e denotes the true mean perfor- esis. Based on a t-value of -1.83 and a degree of
mance for all IoT expert software engineers’ so- freedom of 7, we could reject our null hypothesis
lutions. only if we had reduced the precision of our
For instance, we restricted the population experiment to 85%. Thus, we would fail to reject
sample to the number of software engineers that the null hypothesis and would not accept the
confirmed having experience with developing alternative hypothesis. Therefore, we cannot
applications for the Internet of Things. As shown state that an ML-based approach improves the
in Table 4, the performance mean (x) of the IoT performance of autonomous things compared
expert software engineers’ solutions is 52.46 and to solutions provided by IoT expert software
the standard deviation (σ ) is 10.91. To verify engineers.
if the data that we have is sufficient to accept
the alternative hypothesis, we need to verify the
4.2.2 How does the evaluation result from a ma-
probability of rejecting the null hypothesis [125].
chine learning-based solution differ from solutions
Assuming that the H0 is true, and using a statis-
provided by software engineers without IoT skills
tical significance level [125] of 0.01 (the chance of
with respect to their performance?
one in 100 of making an error), we computed the
test statistic (t − statistic), as follows [125]: H - RQ3.
• H0. An ML-based approach does not im-

(x − hypothesizedvalue) prove the performance of autonomous
t − statistic : t = (5)
( √σn ) things compared to solutions provided by
12
TABLE 4
Data to perform test statistic.
Degrees t
Standard
n Highest Mean of critical
Variable Median deviation
samples value x freedom value
σ
(n-1) (.99%)
Software
14 62.88 43.95 49.45 16.00 13 2.65
Engineers
Software
Engineers
8 62.88 52.46 54.92 10.91 7 3.00
with IoT
knowledge
Software
Engineers
6 61.60 32.61 30.00 15.15 5 3.37
without IoT
knowledge
Machine-
learning
1 59.53
based
approach
software engineers without experience in As the t − statistic value is below the

IoT development. negative tcriticalvalue (-4.35 < −3.37), we can
• HA. An ML-based approach improves the safely reject the null hypothesis, assuring that
performance of autonomous things com- the error chancing of making an error is lower
pared to solutions provided by software than 1%. Therefore, we accepted the alternative
engineers without experience in IoT de- hypothesis: An ML-based approach improves the
velopment. performance of autonomous things compared to
solutions provided by software engineers with no
For instance, we restricted the population experience in IoT development.
sample to the number of software engineers that
confirmed not having experience with develop-
ing applications for the Internet of Things. As 5 E XPERIMENT - PART 2 - R ESULTS
shown in Table 4, the performance mean (x) of As explained previously, the second part of the
the solutions from software engineers without experiment consists of translating the solution
experience in IoT development is 32.61 and the provided by machine learning and participants
standard deviation (σ ) is 15.15. Thus, the to an unknown environment. In this second part
of the experiment, we also executed the simula-
(32.61 − 59.53) tion 16 times: for each one of the participants’
t − statistic : t = = −4.35 (7) solutions, for the machine-learning solution and
( 15.15
√ )
6 for the zeroed solution.
Table 5 shows the results that were achieved
As shown in Table 4, this T-distribution has 5
by the different solutions at night in a simulation
degrees of freedom. Thus, for a confidence level
of 30 seconds. As shown, most of the solutions
of 99%, the negative tcriticalvalue is -3.37. As
did not work. The person in these simulations
we depicted in Figure 14, the test statistic of our
did not finish the route even when we increased
sample is below the tcriticalvalue .
the simulation time. Only the solution provided
by the machine-learning algorithm and by par-
ticipant 12 worked. Remember, this scenario was
not used by the machine-learning algorithm dur-
ing the training process. This solution was pro-
vided through machine learning for the first sce-
nario and it was just reused in this new scenario.
In other words, we did not restart the machine-
learning process.
We selected only those solutions that worked
Fig. 14. Hypothesis H - RQ3 Test Graph. and verified their results for the other periods of
13
the day (morning and late afternoon). As shown of 14 engineers, only one participant, who has
in Table 6, when considering the whole day, the experience with IoT development, provided a
machine-learning approach presented the best solution that worked.
result. Because the average time for the trip
was a little higher using the machine-learning
approach, the difference in energy consumption
Bad
between the two solutions is considerably higher.
93% 0% Work better than ML
7%
TABLE 5 Work
Using the same solution in a different environment - only
at night.
Software Fig. 15. Participants’ solution results in the second sce-

Energy% People% Trip% Fitness
Engineer nario.
1 6.50 0 100 -42.60
2 2.77 0 100 -41.11 Therefore, we can safely reject the null hy-
3 6.62 0 100 -42.65
4 4.30 0 100 -41.72 pothesis and accept both alternative hypotheses:
5 2.58 0 100 -41.03
6 6.88 0 100 -42.75 1) H-RQ2: H1: An ML-based approach in-
7 8.33 0 100 -43.33 creases the reuse of autonomous things
8 2.33 0 100 -60.26 compared to solutions provided by IoT
9 3.77 0 100 -41.51 expert software engineers.
10 3.78 0 100 -60.18
11 11.36 0 100 -44.54 2) H-RQ4: H1: 2) An ML-based approach
12 50.56 100 42.22 54.43 increases the reuse of autonomous things
13 2.77 0 100 -41.11 compared to solutions provided by soft-
14 4.50 0 100 -41.80 ware engineers with no experience with
Learning 24.44 100 61.11 53.55
zeroed 0 0 100 -40
IoT development.
TABLE 6 6 D ISCUSSION
Using the same solution in a different environment - day
average.
In this section, we analyze the empirical exper-
imental results to understand which tasks are
Energy% People% Trip% Fitness better performed by humans and which by al-
Average gorithms. This is important for selecting whether
Participant 50.52 100 38.14 56.90 software engineers or machine learning can ac-
12 complish a specific task better.
Average
8.46 100 46.29 68.83 In our empirical study, in which we have as-
Learning
sessed performance and reuse tasks, we accepted
three alternative hypotheses and rejected one:
5.1 Hypothesis Testing
In this section, we investigate the hypotheses re- Accepted:
lated to the solutions’ reuse evaluation, that is H- 1) An ML-based approach improves the
RQ2 and H-RQ4, as presented in subsection 2.2. performance of autonomous things com-
Their alternative hypotheses state that an ML- pared to solutions provided by software
based approach improves the performance of au- engineers without experience with IoT
tonomous things compared to solutions provided development.
by software engineers, software engineers with 2) An ML-based approach increases the
experience in IoT development, and software en- reuse of autonomous things compared
gineers without experience in IoT development, to solutions provided by IoT expert soft-
respectively. We planned to perform a statisti- ware engineers.
cal development to evaluate these hypotheses. 3) An ML-based approach increases the
However, as depicted in Figure 15, in the new reuse of autonomous things compared
scenario, 0% of participants provided a result to solutions provided by software engi-
better than the result provided by the machine- neers without experience with IoT devel-
learning solution. In addition, from the group opment.
14
Rejected: (2016) [127] report the diversity of participants as

1) An ML-based approach improves the another possible threat. Therefore, in our study,
performance of autonomous things com- we needed to be aware of at least two threats
pared to solutions provided by IoT ex- to validity namely: we have selected a sample of
pert software engineers. only 14 participants, which may not be enough
to achieve conclusive results; and our sample
Based on these results, we have found ev- consisted of only graduate students from two
idence that the use of machine-learning tech- Brazilian universities. Such a group may not be
niques can perform some SE tasks better than representative of all software engineers, who may
software engineers, considering solutions that have substantially more professional experience
improve performance and increase reuse. As il- and background.
lustrated in the experimental results, only one To mitigate the problems of the number of
of the 14 software engineers provided a solution participants and their diversity, we selected our
that could be reused in a new scenario. Further participants carefully. All of them have at least
none of those software engineers provided a so- two years of experience with software develop-
lution that works better than the ML’s solution ment. In addition, we allowed participants to
in this new scenario. If the flexibility of the ap- solve the problem by manipulating a pseudocode
plication is the most important factor, based on version, thereby avoiding gaps in the partici-
our results, we can safely recommend the use of pants’ knowledge, such as experience with a par-
machine learning. ticular programming language or architecture.
However, if we had considered performance Note that a survey was used to select participants
as the only important factor to evaluate the qual- and they all indicated a level of experience with
ity of these solutions, we have found evidence pseudocode. The pseudocode provided by each
that software engineers can perform SE tasks bet- participant was carefully translated into Java as
ter than machine learning, considering “better” this is the language supported by the Framework
as a solution that improves performance. As de- for the Internet of Things.
scribed in our experiments, we cannot state that Oizumi et al. (2017) reported a third threat
ML improves the performance of an application to validity in [126], namely, possible misunder-
in comparison to solutions provided by domain standings during the study. To mitigate this prob-
expert software engineers. This is also an interest- lem of misunderstandings, we asked all partici-
ing result as many researchers, especially in the pants to write about their understanding of the
IoT domain, have strictly focused on automating problem both before and after applying the solu-
software development. tion. All participants indicated that they under-
In brief, our experiment indicates that in some stood the task completely. We also asked them
cases, software engineers outperform machine- about their confidence in their proposed solu-
learning algorithms, whereas in other cases, they tion. Most of them evaluated their own solution
do not. The evidence shows that it is important to with the highest grade, allowing us to increase
know which one performs better in different sit- our confidence in the experimental results. In
uations in order to determine ways for software addition, we assisted the participants during the
engineers to work cooperatively and effectively entire study, making sure they understood the
with automated machine-learning procedures. experimental task
7 T HREATS TO VALIDITY 8 R ELATED W ORK

Although we have designed and conducted the Comparing intelligent machines to the ability of
experiments carefully, there are always factors a person to solve a particular problem is not
that can challenge the experiments validity. Some a new approach. This kind of discussion has
threats to validity as described in [123] could been promoted since the beginning of Artificial
indeed limit the legitimacy of our results. In this Intelligence. For example, in 1997, an important
section, we present the actions taken to mitigate moment in the history of technology happened
their impact of these factors on the research re- with Garry Kasparov’s 1997 chess match against
sults. the IBM supercomputer Deep Blue [128].
As Oizumi et al. (2017) report in [126], the Recently, Silver et al. (2016, 2017) [129], [130]
number of participants in the study can be a published a paper in the Nature Science Journal,
threat to validity. In addition, Fernandes et al comparing the performance of a ML technique
15
against the results achieved by the world cham- line of investigation is: “Could a software engi-
pion in the game of Go. In [130], Silver et al. neer solve a specific development task better than
(2017) state that their program “achieved super- an ML algorithm?”. Indeed, it is fundamental to
human performance.” evaluate which tasks are better performed by en-
Whiteson et al. [122] indirectly performed this gineers or ML procedures so that they can work
comparison, by evaluating the use of three dif- together more effectively and also provide more
ferent approaches of the neuroevolution learning insight into novel human-in-the-loop machine-
algorithm to solve the same tasks: (i) coevolution, learning approaches to support SE tasks.
that is mostly unassisted by human knowledge; This paper appears to be the first to pro-
(ii) layered learning, that is highly assisted; and vide an empirical study comparing how soft-
(iii) concurrent layered learning, that is a mixed ware engineers and machine-learning algorithms
approach. The authors state that their results achieve performance and reuse tasks. In brief,
“demonstrate that the appropriate level of hu- as a result of our experiment, we have found
man assistance depends critically on the diffi- evidence that in some cases, software engineers
culty of the problem.” outperform machine-learning algorithms, and in
Furthermore, there is also a new approach other cases, they do not. Further, as is typical in
in machine learning, called Automatic Machine experimental studies, although we have designed
Learning (Auto-ML) [100], which uses learning and conducted the experiment carefully, there are
to set the parameters of a learning algorithm au- always factors that can threaten the experiment’s
tomatically. In a traditional approach, a software validity. For example, some threats include the
engineer with machine learning skills is respon- number and diversity of the software engineers
sible for finding a good configuration for the involved in our experiment.
algorithm parameters. Zoth and Lee [100] present Understanding how software engineers fare
an Auto-ML-based approach to design a neural against ML algorithms is essential to support
network to classify images of a specific dataset. new methodologies for developing human-in-
In addition, they compared their results with the-loop approaches in which machine learning
the previous state-of-the-art model, which was automated procedures assist software developers
designed by an ML expert engineer. According in achieving their tasks. For example, method-
to Zoth and Lee [100] , their AutoML-based ap- ologies to define which agent (engineers or auto-
proach “can design a novel network architecture mated ML procedure) should execute a specific
that rivals the best human-invented architecture task in a software development set. Based on
in terms of test set accuracy.” Zoth and Lee this understanding, these methodologies can pro-
also showed that a machine-learning technique vide a basis for software engineers and machine
is capable of beating a software engineer with learning algorithms to cooperate in Software En-
ML skills in a specific software engineering task, gineering development more effectively.
but the authors do not discuss this subject in the
Future work to extend the proposed experi-
paper.
ment includes: (i) conducting further empirical
Our paper appears to be the first to pro-
studies to assess other SE tasks, such as design,
vide an empirical study to investigate the use
maintenance and testing; (ii) experimenting with
of a machine-learning techniques to solve a
other machine-learning algorithms such as re-
problem in the field of Software Engineering,
inforcement learning and backpropagation; and
by comparing the solution provided by a ML-
(iii) using different criteria to evaluate task exe-
based approach against solutions provided by
cution.
software engineers.
Possible tasks that could be investigated (refer
to (i)) include programming tasks, in which case
tasks performed by software development teams
9 C ONCLUSION AND F UTURE W ORK
and ML algorithms are compared. For example,
Several researchers have proposed the use of we could invite software developers from the
machine-learning techniques to automate soft- team with the highest score in the last ACM Inter-
ware engineering tasks. However, most of these national Collegiate Programming Contest [131],
approaches do not direct efforts toward asking which is one of the most important programming
whether ML-based procedures have higher suc- championships in the world, to be involved in
cess rates than current standard and manual this comparison. This competition evaluates the
practices. A relevant question in this potential capability of software engineers to solve complex
16
software problems. Software engineers are classi- international conference on Software Engineer-
fied according to the number of right solutions, ing. IEEE Computer Society Press, 1987,
performance of the solutions and development pp. 200–211.
time. [7] D. Partridge, “Artificial intelligence and
Another line of investigation could address software engineering: a survey of possibil-
the use of different qualitative or quantitative ities,” Information and Software Technology,
methodologies. For example, the task execu- vol. 30, no. 3, pp. 146–152, 1988.
tion comparison could rely on reference per- [8] L. C. Cheung, S. Ip, and T. Holden, “Survey
formances, such as the performance of highly of artificial intelligence impacts on infor-
successful performers [100], [129], [130] . This mation systems engineering,” Information
research work can also be extended by propos- and Software Technology, vol. 33, no. 7, pp.
ing, based on the comparison between the per- 499–508, 1991.
formance of engineers and ML algorithms, a [9] D. Partridge, Artificial Intelligence in Soft-
methodology for more effective task allocation. ware Engineering. Wiley Online Library,
This methodology could, in principle, lead to 1998.
more effective ways to allocate tasks such as soft- [10] A. Van Lamsweerde and L. Willemet, “In-
ware development in cooperative work involv- ferring declarative requirements specifica-
ing humans and automated procedures. Such tions from operational scenarios,” IEEE
human-in-the-loop approaches, which take into Transactions on Software Engineering, vol. 24,
account the strengths and weaknesses of humans no. 12, pp. 1089–1114, 1998.
and machine learning algorithms, are fundamen- [11] G. D. Boetticher, “Using machine learn-
tal to provide a basis for cooperative work in ing to predict project effort: Empirical case
software engineering and possibly in other areas. studies in data-starved domains,” in Model
Based Requirements Workshop. Citeseer,
2001, pp. 17–24.
ACKNOWLEDGMENTS [12] F. Padberg, T. Ragg, and R. Schoknecht,
This work has been supported by the Labora- “Using machine learning for estimating the
tory of Software Engineering (LES) at PUC-Rio. defect content after an inspection,” IEEE
Our thanks to CAPES, CNPq, FAPERJ and PUC- Transactions on Software Engineering, vol. 30,
Rio for their support through scholarships and no. 1, pp. 17–28, 2004.
fellowships. We would also like to thank the [13] D. Zhang, “Applying machine learning al-
software engineers who participated in our ex- gorithms in software development,” in The
periment. Proceedings of 2000 Monterey Workshop on
Modeling Software System Structures, 2000,
pp. 275–285.
R EFERENCES [14] ——, “Machine learning in value-based
software test data generation,” in Tools with
[1] F. Brooks and H. Kugler, No silver bullet. Artificial Intelligence, 2006. ICTAI’06. 18th
April, 1987. IEEE International Conference on. IEEE,
[2] R. S. Pressman, Software engineering: a prac- 2006, pp. 732–736.
titioner’s approach. Palgrave Macmillan, [15] D. Zhang and J. J. Tsai, Machine learning
2005. applications in software engineering. World
[3] Q. Zhang, “Software developments,” Engi- Scientific, 2005, vol. 16.
neering Automation for Reliable Software, p. [16] D. Zhang, “Machine learning and value-
292, 2000. based software engineering: a research
[4] J. O. Kephart, “Research challenges of auto- agenda.” in SEKE, 2008, pp. 285–290.
nomic computing,” in Software Engineering, [17] T. M. Khoshgoftaar, “Introduction to the
2005. ICSE 2005. Proceedings. 27th Interna- special issue on quality engineering with
tional Conference on. IEEE, 2005, pp. 15–22. computational intelligence,” 2003.
[5] J. Mostow, “Foreword what is ai? and what [18] D. Zhang, “Machine learning and value-
does it have to do with software engineer- based software engineering,” in Software
ing?” IEEE Transactions on Software Engi- Applications: Concepts, Methodologies, Tools,
neering, vol. 11, no. 11, p. 1253, 1985. and Applications. IGI Global, 2009, pp.
[6] D. Barstow, “Artificial intelligence and soft- 3325–3339.
ware engineering,” in Proceedings of the 9th
17
[19] D. Zhang and J. J. Tsai, “Machine learning prediction,” International Journal of Software
and software engineering,” in Tools with Engineering and Computing, vol. 2, no. 2, pp.
Artificial Intelligence, 2002.(ICTAI 2002). Pro- 95–109, 2010.
ceedings. 14th IEEE International Conference [30] W. Zhang, Y. Yang, and Q. Wang, “Han-
on. IEEE, 2002, pp. 22–29. dling missing data in software effort pre-
[20] M. D. Kramer and D. Zhang, “Gaps: a diction with naive bayes and em algo-
genetic programming system,” in Computer rithm,” in Proceedings of the 7th International
Software and Applications Conference, 2000. Conference on Predictive Models in Software
COMPSAC 2000. The 24th Annual Interna- Engineering. ACM, 2011, p. 4.
tional. IEEE, 2000, pp. 614–619. [31] Ł. Radliński, “A framework for inte-
[21] A. Holzinger, M. Plass, K. Holzinger, G. C. grated software quality prediction using
Crişan, C.-M. Pintea, and V. Palade, “To- bayesian nets,” Computational Science and
wards interactive machine learning (iml): Its Applications-ICCSA 2011, pp. 310–325,
applying ant colony algorithms to solve 2011.
the traveling salesman problem with the [32] P. O. O. Sack, M. Bouneffa, Y. Maweed,
human-in-the-loop approach,” in Interna- and H. Basson, “On building an integrated
tional Conference on Availability, Reliability, and generic platform for software quality
and Security. Springer, 2016, pp. 81–95. evaluation,” in Information and Communica-
[22] A. Holzinger, “Interactive machine learn- tion Technologies, 2006. ICTTA’06. 2nd, vol. 2.
ing for health informatics: when do we IEEE, 2006, pp. 2872–2877.
need the human-in-the-loop?” Brain Infor- [33] M. Reformat and D. Zhang, “Introduc-
matics, vol. 3, no. 2, pp. 119–131, 2016. tion to the special issue on:“software
[23] S. Easterbrook, J. Singer, M.-A. Storey, and quality improvements and estimations
D. Damian, “Selecting empirical methods with intelligence-based methods”,” Soft-
for software engineering research,” Guide to ware Quality Journal, vol. 15, no. 3, pp. 237–
advanced empirical software engineering, pp. 240, 2007.
285–311, 2008. [34] B. Twala, M. Cartwright, and M. Shepperd,
[24] H. A. Simon, “Whether software engineer- “Applying rule induction in software pre-
ing needs to be artificially intelligent,” IEEE diction,” in Advances in Machine Learning
Transactions on Software Engineering, no. 7, Applications in Software Engineering. IGI
pp. 726–732, 1986. Global, 2007, pp. 265–286.
[25] I. Sommerville, “Artificial intelligence and [35] V. U. Challagulla, F. B. Bastani, and I.-L.
systems engineering,” Prospects for Arti- Yen, “High-confidence compositional relia-
ficial Intelligence: Proceedings of AISB’93, bility assessment of soa-based systems us-
29 March-2 April 1993, Birmingham, UK, ing machine learning techniques,” in Ma-
vol. 17, p. 48, 1993. chine Learning in Cyber Trust. Springer,
[26] R. S. Michalski, J. G. Carbonell, and T. M. 2009, pp. 279–322.
Mitchell, Machine learning: An artificial in- [36] R. C. Veras, S. R. Meira, A. L. Oliveira,
telligence approach. Springer Science & and B. J. Melo, “Comparative study of clus-
Business Media, 2013. tering techniques for the organization of
[27] A. Marchetto and A. Trentini, “Evaluating software repositories,” in Hybrid Intelligent
web applications testability by combining Systems, 2007. HIS 2007. 7th International
metrics and analogies,” in Information and Conference on. IEEE, 2007, pp. 372–377.
Communications Technology, 2005. Enabling [37] I. Birzniece and M. Kirikova, “Interactive
Technologies for the New Knowledge Society: inductive learning service for indirect anal-
ITI 3rd International Conference on. IEEE, ysis of study subject compatibility,” in Pro-
2005, pp. 751–779. ceedings of the BeneLearn, 2010, pp. 1–6.
[28] S. Bouktif, F. Ahmed, I. Khalil, and G. Anto- [38] D. B. Hanchate, “Analysis, mathemati-
niol, “A novel composite model approach cal modeling and algorithm for software
to improve ality prediction,” Information project scheduling using bcga,” in In-
and Software Technology, vol. 52, no. 12, pp. telligent Computing and Intelligent Systems
1298–1311, 2010. (ICIS), 2010 IEEE International Conference on,
[29] L. Radlinski, “A survey of bayesian net vol. 3. IEEE, 2010, pp. 1–7.
models for software development effort [39] Z. Xu and B. Song, “A machine learning ap-
18
plication for human resource data mining H. Meddeb, “On the use of time series
problem,” Advances in Knowledge Discovery and search based software engineering for
and Data Mining, pp. 847–856, 2006. refactoring recommendation,” in Proceed-
[40] J. Wen, S. Li, Z. Lin, Y. Hu, and C. Huang, ings of the 7th International Conference on
“Systematic literature review of machine Management of computational and collective
learning based software development ef- intElligence in Digital EcoSystems. ACM,
fort estimation models,” Information and 2015, pp. 35–42.
Software Technology, vol. 54, no. 1, pp. 41– [50] V. U. Challagulla, F. B. Bastani, I.-L.
59, 2012. Yen, and R. A. Paul, “Empirical assess-
[41] E. Rashid, S. Patnayak, and V. Bhattacher- ment of machine learning based soft-
jee, “A survey in the area of machine learn- ware defect prediction techniques,” in
ing and its application for software quality Object-Oriented Real-Time Dependable Sys-
prediction,” ACM SIGSOFT Software Engi- tems, 2005. WORDS 2005. 10th IEEE Inter-
neering Notes, vol. 37, no. 5, pp. 1–7, 2012. national Workshop on. IEEE, 2005, pp. 263–
[42] H. A. Al-Jamimi and M. Ahmed, “Machine 270.
learning-based software quality prediction [51] K. Kaminsky and G. Boetticher, “Build-
models: state of the art,” in Information ing a genetically engineerable evolvable
Science and Applications (ICISA), 2013 Inter- program (geep) using breadth-based ex-
national Conference on. IEEE, 2013, pp. 1–4. plicit knowledge for predicting software
[43] Ł. Radliński, “Enhancing bayesian network defects,” in Fuzzy Information, 2004. Process-
model for integrated software quality pre- ing NAFIPS’04. IEEE Annual Meeting of the,
diction,” in Proc. Fourth International Confer- vol. 1. IEEE, 2004, pp. 10–15.
ence on Information, Process, and Knowledge [52] ——, “How to predict more with less, de-
Management, Valencia. Citeseer, 2012, pp. fect prediction using machine learners in
144–149. an implicitly data starved domain,” in The
[44] F. Pinel, P. Bouvry, B. Dorronsoro, and S. U. 8th world multiconference on systemics, cyber-
Khan, “Savant: Automatic parallelization netics and informatics, Orlando, FL. Citeseer,
of a scheduling heuristic with machine 2004.
learning,” in Nature and Biologically Inspired [53] K. Kaminsky and G. D. Boetticher, “Better
Computing (NaBIC), 2013 World Congress on. software defect prediction using equalized
IEEE, 2013, pp. 52–57. learning with machine learners,” Knowledge
[45] D. Novitasari, I. Cholissodin, and W. F. Sharing and Collaborative Engineering, 2004.
Mahmudy, “Optimizing svr using local [54] O. Kutlubay and A. Bener, “A machine
best pso for software effort estimation,” learning based model for software defect
Journal of Information Technology and Com- prediction,” working paer, Boaziçi University,
puter Science, vol. 1, no. 1, 2016. Computer Engineering Department, 2005.
[46] Ł. Radliński, “Towards expert-based mod- [55] X. Ren, “Learn to predict “affecting
elling of integrated software quality,” Jour- changes” in software engineering,” 2003.
nal of Theoretical and Applied Computer Sci- [56] E. Ceylan, F. O. Kutlubay, and A. B. Bener,
ence, vol. 6, no. 2, pp. 13–26, 2012. “Software defect identification using ma-
[47] T. Rongfa, “Defect classification method chine learning techniques,” in Software En-
for software management quality control gineering and Advanced Applications, 2006.
based on decision tree learning,” in Ad- SEAA’06. 32nd EUROMICRO Conference on.
vanced Technology in Teaching-Proceedings IEEE, 2006, pp. 240–247.
of the 2009 3rd International Conference on [57] Y. Kastro and A. B. Bener, “A defect pre-
Teaching and Computational Science (WTCS diction method for software versioning,”
2009). Springer, 2012, pp. 721–728. Software Quality Journal, vol. 16, no. 4, pp.
[48] R. Rana and M. Staron, “Machine learn- 543–562, 2008.
ing approach for quality assessment and [58] O. Kutlubay, B. Turhan, and A. B. Bener, “A
prediction in large software organizations,” two-step model for defect density estima-
in Software Engineering and Service Science tion,” in Software Engineering and Advanced
(ICSESS), 2015 6th IEEE International Con- Applications, 2007. 33rd EUROMICRO Con-
ference on. IEEE, 2015, pp. 1098–1101. ference on. IEEE, 2007, pp. 322–332.
[49] H. Wang, M. Kessentini, W. Grosky, and [59] A. S. Namin and M. Sridharan, “Bayesian
19
reasoning for software testing,” in Proceed- gramms using machine learning tech-
ings of the FSE/SDP workshop on Future of niques,” in Artificial Intelligence Techniques
software engineering research. ACM, 2010, in Software Engineering Workshop, 2008.
pp. 349–354. [70] O. Maqbool and H. Babri, “Bayesian learn-
[60] C. Murphy and G. Kaiser, “Metamor- ing for software architecture recovery,” in
phic runtime checking of non-testable pro- Electrical Engineering, 2007. ICEE’07. Inter-
grams,” Columbia University Dept of Com- national Conference on. IEEE, 2007, pp. 1–6.
puter Science Tech Report cucs-042-09, p. [71] A. Okutan, “Software defect prediction us-
9293, 2009. ing bayesian networks and kernel meth-
[61] W. Afzal, R. Torkar, R. Feldt, and ods,” Ph.D. dissertation, ISIK UNIVER-
T. Gorschek, “Genetic programming for SITY, 2012.
cross-release fault count predictions in [72] D. Cotroneo, R. Pietrantuono, and S. Russo,
large and complex software projects,” Evo- “A learning-based method for combining
lutionary Computation and Optimization Al- testing techniques,” in Proceedings of the
gorithms in Software Engineering, pp. 94–126, 2013 International Conference on Software En-
2010. gineering. IEEE Press, 2013, pp. 142–151.
[62] C. Murphy et al., “Using metamorphic test- [73] D. Zhang, “A value-based framework for
ing at runtime to detect defects in applica- software evolutionary testing,” in Advances
tions without test oracles,” 2008. in Abstract Intelligence and Soft Computing.
[63] D. Qiu, S. Fang, and Y. Li, “A framework to IGI Global, 2013, pp. 355–373.
discover potential deviation between pro- [74] A. Okutan and O. T. Yıldız, “Software de-
gram and requirement through mining ob- fect prediction using bayesian networks,”
ject graph,” in Computer Application and Sys- Empirical Software Engineering, vol. 19, no. 1,
tem Modeling (ICCASM), 2010 International pp. 154–181, 2014.
Conference on, vol. 4. IEEE, 2010, pp. V4– [75] S. Agarwal and D. Tomar, “A feature se-
110. lection based model for software defect
[64] C. Murphy, G. E. Kaiser et al., “Automatic prediction,” assessment, vol. 65, 2014.
detection of defects in applications without [76] G. Abaei and A. Selamat, “Important is-
test oracles,” Dept. Comput. Sci., Columbia sues in software fault prediction: A road
Univ., New York, NY, USA, Tech. Rep. CUCS- map,” in Handbook of Research on Emerging
027-10, 2010. Advancements and Technologies in Software
[65] W. Afzal, “Search-based approaches to soft- Engineering. IGI Global, 2014, pp. 510–539.
ware fault prediction and software test- [77] A. Okutan and O. T. Yildiz, “A novel kernel
ing,” Ph.D. dissertation, Blekinge Institute to predict software defectiveness,” Journal
of Technology, 2009. of Systems and Software, vol. 119, pp. 109–
[66] M. K. Taghi, B. Cukic, and N. Seliya, “An 121, 2016.
empirical assessment on program module- [78] X.-d. Mu, R.-h. Chang, and L. Zhang,
order models,” Quality Technology & Quan- “Software defect prediction based on com-
titative Management, vol. 4, no. 2, pp. 171– petitive organization coevolutionary algo-
190, 2007. rithm,” Journal of Convergence Information
[67] J. H. Wang, N. Bouguila, and T. Bdiri, “Em- Technology (JCIT) Volume7, Number5, 2012.
pirical evaluation of selected algorithms for [79] J. Cahill, J. M. Hogan, and R. Thomas,
complexity-based classification of software “Predicting fault-prone software modules
modules and a new model,” in Intelligent with rank sum classification,” in Software
Systems: From Theory to Practice. Springer, Engineering Conference (ASWEC), 2013 22nd
2010, pp. 99–131. Australian. IEEE, 2013, pp. 211–219.
[68] H. Jin, Y. Wang, N.-W. Chen, Z.-J. Gou, [80] R. Rana, M. Staron, C. Berger, J. Hans-
and S. Wang, “Artificial neural network for son, M. Nilsson, and W. Meding, “The
automatic test oracles generation,” in Com- adoption of machine learning techniques
puter Science and Software Engineering, 2008 for software defect prediction: An initial
International Conference on, vol. 2. IEEE, industrial validation,” in Joint Conference
2008, pp. 727–730. on Knowledge-Based Software Engineering.
[69] J. Ferzund, S. N. Ahsan, and F. Wotawa, Springer, 2014, pp. 270–285.
“Automated classification of faults in pro- [81] T. Schulz, Ł. Radliński, T. Gorges, and
20
W. Rosenstiel, “Predicting the flow of de- 546.

fect correction effort using a bayesian net- [92] J. Fu, F. B. Bastani, I.-L. YEN et al.,
work model,” Empirical Software Engineer- “Semantic-driven component-based auto-
ing, vol. 18, no. 3, pp. 435–477, 2013. mated code synthesis,” Semantic Comput-
[82] E. Rashid, “R4 model for case-based rea- ing, pp. 249–283, 2010.
soning and its application for software [93] A. Katasonov, O. Kaykova, O. Khriyenko,
fault prediction,” International Journal of S. Nikitin, and V. Y. Terziyan, “Smart
Software Science and Computational Intelli- semantic middleware for the internet of
gence (IJSSCI), vol. 8, no. 3, pp. 19–38, 2016. things.” ICINCO-ICSO, vol. 8, pp. 169–178,
[83] ——, “Improvisation of case-based reason- 2008.
ing and its application for software fault [94] L. Baresi, S. Guinea, and A. Shahzada,
prediction,” International Journal of Services “Short paper: Harmonizing heterogeneous
Technology and Management, vol. 21, no. 4-6, components in sesame,” in Internet of
pp. 214–227, 2015. Things (WF-IoT), 2014 IEEE World Forum on.
[84] J. K. Chhabra and A. Parashar, “Predic- IEEE, 2014, pp. 197–198.
tion of changeability for object oriented [95] L. Zhu, H. Cai, and L. Jiang, “Minson: A
classes and packages by mining change business process self-adaptive framework
history,” in Electrical and Computer Engi- for smart office based on multi-agent,” in
neering (CCECE), 2014 IEEE 27th Canadian e-Business Engineering (ICEBE), 2014 IEEE
Conference on. IEEE, 2014, pp. 1–6. 11th International Conference on. IEEE, 2014,
[85] G. Spanoudakis, A. S. d. Garcez, and pp. 31–37.
A. Zisman, “Revising rules to capture re- [96] J. F. De Paz, J. Bajo, S. Rodrı́guez, G. Vil-
quirements traceability relations: A ma- larrubia, and J. M. Corchado, “Intelligent
chine learning approach.” in SEKE, 2003, system for lighting control in smart cities,”
pp. 570–577. Information Sciences, vol. 372, pp. 241–255,
[86] M. Shin and A. Goel, “Modeling soft- 2016.
ware component criticality using a ma- [97] I. Birzniece, “The use of inductive learning
chine learning approach,” Artificial Intelli- in information systems,” in Proceedings of
gence and Simulation, pp. 440–448, 2005. the 16th International Conference on Informa-
[87] J. S. Shirabad, “Predictive techniques in tion and Software Technologies, 2010, pp. 95–
software engineering,” in Encyclopedia of 101.
Machine Learning. Springer, 2011, pp. 782– [98] D. Alrajeh, A. Russo, and S. Uchitel, “Infer-
789. ring operational requirements from scenar-
[88] A. A. Araújo, M. Paixao, I. Yeltsin, A. Dan- ios and goal models using inductive learn-
tas, and J. Souza, “An architecture based ing,” in Proceedings of the 2006 international
on interactive optimization and machine workshop on Scenarios and state machines:
learning applied to the next release prob- models, algorithms, and tools. ACM, 2006,
lem,” Automated Software Engineering, pp. pp. 29–36.
1–49, 2016. [99] A. M. Sharifloo, A. Metzger, C. Quinton,
[89] T. Tourwé, J. Brichau, A. Kellens, and L. Baresi, and K. Pohl, “Learning and evo-
K. Gybels, “Induced intentional software lution in dynamic software product lines,”
views,” Computer Languages, Systems & in Proceedings of the 11th International Sym-
Structures, vol. 30, no. 1, pp. 35–47, 2004. posium on Software Engineering for Adaptive
[90] J. S. Di Stefano and T. Menzies, “Machine and Self-Managing Systems. ACM, 2016, pp.
learning for software engineering: Case 158–164.
studies in software reuse,” in Tools with [100] B. Zoph and Q. V. Le, “Neural architecture
Artificial Intelligence, 2002.(ICTAI 2002). Pro- search with reinforcement learning,” arXiv
ceedings. 14th IEEE International Conference preprint arXiv:1611.01578, 2016.
on. IEEE, 2002, pp. 246–251. [101] N. M. do Nascimento and C. J. P. de Lu-
[91] J. Fu, F. B. Bastani, and I.-L. Yen, “Auto- cena, “Fiot: An agent-based framework for
mated ai planning and code pattern based self-adaptive and self-organizing applica-
code synthesis,” in Tools with Artificial In- tions based on the internet of things,” Infor-
telligence, 2006. ICTAI’06. 18th IEEE Interna- mation Sciences, vol. 378, pp. 161–176, 2017.
tional Conference on. IEEE, 2006, pp. 540– [102] F. Jacob and R. Tairas, “Code template in-
21
ference using language models,” in Proceed- “Softwarization of internet of things

ings of the 48th Annual Southeast Regional infrastructure for secure and smart
Conference. ACM, 2010, p. 104. healthcare,” Computer, vol. 50, no. 7, pp.
[103] B. Amal, M. Kessentini, S. Bechikh, J. Dea, 74–79, 2017.
and L. B. Said, “On the use of machine [114] I. Ayala, M. Amor, L. Fuentes, and J. M.
learning and search-based software engi- Troya, “A software product line process to
neering for ill-defined fitness function: a develop agents for the iot,” Sensors, vol. 15,
case study on software refactoring,” in In- no. 7, pp. 15 640–15 660, 2015.
ternational Symposium on Search Based Soft- [115] J.-P. Briot, N. M. de Nascimento, and C. J. P.
ware Engineering. Springer, 2014, pp. 31– de Lucena, “A multi-agent architecture for
45. quantified fruits: Design and experience,”
[104] Y. Peng, G. Wang, and H. Wang, “User in 28th International Conference on Soft-
preferences based software defect detection ware Engineering & Knowledge Engineering
algorithms selection using mcdm,” Informa- (SEKE’2016). SEKE/Knowledge Systems
tion Sciences, vol. 191, pp. 3–13, 2012. Institute, PA, USA, 2016, pp. 369–374.
[105] H. A. Abbass, E. Petraki, K. Merrick, J. Har- [116] N. M. Nascimento, “FIoT: An agent-
vey, and M. Barlow, “Trusted autonomy based framework for self-adaptive and
and cognitive cyber symbiosis: Open chal- self-organizing internet of things applica-
lenges,” Cognitive computation, vol. 8, no. 3, tions,” Master’s thesis, PUC-Rio, Rio de
pp. 385–408, 2016. Janeiro, Brazil, August 2015.
[106] W. G. Baxt, “Use of an artificial neural [117] N. M. d. Nascimento, C. J. P. d. Lucena,
network for the diagnosis of myocardial and H. Fuks, “Modeling quantified things
infarction,” Annals of internal medicine, vol. using a multi-agent system,” in IEEE / WIC
115, no. 11, pp. 843–848, 1991. / ACM International Conference on Web Intel-
[107] M. A. Mazurowski, P. A. Habas, J. M. ligence and Intelligent Agent Technology (WI-
Zurada, J. Y. Lo, J. A. Baker, and G. D. IAT), vol. 1. IEEE, 2015, pp. 26–32.
Tourassi, “Training neural network clas- [118] N. M. NASCIMENTO and C. J. P. LU-
sifiers for medical decision making: The CENA, “Engineering cooperative smart
effects of imbalanced datasets on classifica- things based on embodied cognition,” in
tion performance,” Neural networks, vol. 21, NASA/ESA Conference on Adaptive Hardware
no. 2, pp. 427–436, 2008. and Systems (AHS 2017). IEEE, 2017.
[108] N. M. do Nascimento, M. L. Viana, and [119] Apple, “Homekit,”
C. J. P. de Lucena, “An iot-based tool for https://developer.apple.com/homekit/,
human gas monitoring,” in IXV Congresso March 2017.
Brasileiro de Informatica em Saude (CBIS), [120] Samsung, “Samsung smart things,”
vol. 1. SBIS, 2016, pp. 96–98. https://www.smartthings.com, March
[109] R. Morejón, M. Viana, and C. Lucena, 2017.
“Generating software agents for data min- [121] D. I. Sjøberg, T. Dybå, B. C. Anda, and
ing: An example for the health data area,” J. E. Hannay, “Building theories in software
in International Conference on Software En- engineering,” Guide to advanced empirical
gineering & Knowledge Engineering-SEKE, software engineering, pp. 312–336, 2008.
2017. [122] S. Whiteson, N. Kohl, R. Miikkulainen, and
[110] S. Haykin, Neural Networks: A Com- P. Stone, “Evolving soccer keepaway play-
prehensive Foundation. Macmillan, 1994. ers through task decomposition,” Machine
[Online]. Available: http://books.google. Learning, vol. 59, no. 1-2, pp. 5–30, 2005.
com.br/books?id=PSAPAQAAMAAJ [123] C. Wohlin, P. Runeson, M. Höst, M. C.
[111] D. Castelvecchi, “Can we open the black Ohlsson, B. Regnell, and A. Wesslén, Exper-
box of ai?” Nature News, vol. 538, no. 7623, imentation in software engineering. Springer
p. 20, 2016. Science & Business Media, 2012.
[112] L. Atzori, A. Iera, and G. Morabito, “The [124] R. S. Sutton and A. G. Barto, Reinforcement
internet of things: A survey,” Computer net- learning: An introduction. MIT press Cam-
works, vol. 54, no. 15, pp. 2787–2805, 2010. bridge, 1998, vol. 1, no. 1.
[113] M. A. Salahuddin, A. Al-Fuqaha, [125] R. Peck and J. Devore, Statistics: The Explo-
M. Guizani, K. Shuaib, and F. Sallabi, ration & Analysis of Data. Nelson Educa-
22
tion, 2011.
[126] W. Oizumi, L. Sousa, A. Garcia, R. Oliveira,
A. Oliveira, O. Agbachi, and C. Lucena,
“Revealing design problems in stinky code:
a mixed-method study,” in Proceedings of
the 11th Brazilian Symposium on Software
Components, Architectures, and Reuse. ACM,
2017, p. 5.
[127] E. Fernandes, F. Ferreira, J. A. Netto, and
E. Figueiredo, “Information systems devel-
opment with pair programming: An aca-
demic quasi-experiment,” in Proceedings of
the XII Brazilian Symposium on Information
Systems on Brazilian Symposium on Infor-
mation Systems: Information Systems in the
Cloud Computing Era-Volume 1. Brazilian
Computer Society, 2016, p. 64.
[128] G. Kasparov, Deep Thinking: Where Machine
Intelligence Ends and Human Creativity Be-
gins. Hachette UK, 2017.
[129] D. Silver, A. Huang, C. J. Maddison,
A. Guez, L. Sifre, G. Van Den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneer-
shelvam, M. Lanctot et al., “Mastering the
game of go with deep neural networks and
tree search,” Nature, vol. 529, no. 7587, pp.
484–489, 2016.
[130] D. Silver, J. Schrittwieser, K. Simonyan,
I. Antonoglou, A. Huang, A. Guez, T. Hu-
bert, L. Baker, M. Lai, A. Bolton et al.,
“Mastering the game of go without human
knowledge,” Nature, vol. 550, no. 7676, pp.
354–359, 2017.
[131] A. Trotman and C. Handley, “Program-
ming contest strategy,” Computers & Edu-
cation, vol. 50, no. 3, pp. 821–837, 2008.

Software Engineering For Machine Learning: A Case Study

Uploaded by

Copyright:

Available Formats

Software Engineering For Machine Learning: A Case Study

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Software Engineering For Machine Learning: A Case Study

Uploaded by

Copyright:

Available Formats

What are the three aspects of the AI domain that make it fundamentally different from prior software application domains according to the paper?

What are the three aspects of the AI domain that make it fundamentally different from prior software application domains according to the paper?

What are some of the challenges that organizations may face in creating large-scale AI solutions according to the paper?

What are some of the challenges that organizations may face in creating large-scale AI solutions according to the paper?

Software Engineering for Machine Learning:

Ece Kamar Nachiappan Nagappan Besmira Nushi Thomas Zimmermann

A Survey of Software Quality for Machine

Abstract—Machine learning (ML) is now widespread. Tradi- A. Survey objective

to unknown data cannot be verified in terms of correctness. Software Engineering

solve the problems. In this paper, we present a survey of Recognition

0-7695-6432-1/18/$31.00 ©2018 IEEE 279

Category Name Number References Tags R 1 S 2 Sub Total

Software Engineers vs.

Index Terms—Machine learning, human-in-the-loop, software engineer, automatic software engineering,

S OFTWARE engineering processes can be very

Software Engineers produce solutions

with IoT skills

produce solutions better Internet of Things

2 H OW DO SOFTWARE ENGINEERS COM - RQ4. ...software engineers without IoT

• HA. An ML-based approach improves the TABLE 1

2.3 The object of the study: The Framework TABLE 2

Medium Each streetlight has to execute three tasks

if (lighting_sensor = Medium AND detected_person =

different from “0.0”, it consumes 0.1 of departure points

4 E XPERIMENT - PART 1 - R ESULTS .)1 ).5 .)1 ).5 % .)1 ).5

100 4.1 Discussion: Participants Knowledge in

the results achieved by each group performing

• H0. An ML-based approach does not im-

software engineers without experience in As the t − statistic value is below the

Software Fig. 15. Participants’ solution results in the second sce-

Rejected: (2016) [127] report the diversity of participants as

7 T HREATS TO VALIDITY 8 R ELATED W ORK

W. Rosenstiel, “Predicting the flow of de- 546.

ference using language models,” in Proceed- “Softwarization of internet of things

You might also like