Hyper Tuner
Hyper Tuner
3.3 Human-in-the-loop Approaches least a decade now. For example, Patel and collaborators [12] in a
The current reality of hyperparameter tuning is highly human-driven. 2008 study with data scientists observed that tuning ML models is an
Hyperparameter tuning is usually performed manually [9] following iterative and exploratory activity. It was hard for the data scientists in
rules-of-thumb and experience accumulated through practice [14, the study to track performance across iterations, which makes tuning
15]. Especially with complex models, the current trial-and-error process challenging. They argued that ML tools should include
process is inefficient in terms of the time spent and the computational more suitable visualizations to help data scientists with their work.
load [22], and does not favor reproducibility, knowledge transfer, A recent attempt to provide more suitable visualizations of model
and collaboration (e.g., [10]). To address these limitations, as well evaluations metrics and model comparisons was made by Tsay and
as increasing the transparency of ML models to humans, current colleagues [27]. This is a research area that we expect will receive
research efforts in the human-in-the-loop camp have focused on increasing attention in the near future.
three visual analytics foci: model structure, model prediction, model
4 H YPEPARAMETER T UNING R EQUIREMENTS
performance metrics.
In the first phase of the project, we interviewed data science practi-
Visualizing model structure. Many visualization techniques
tioners in industry to investigate their hyperparameter tuning prac-
have been developed to help users understand the structure of dif-
tice, as ML experts and as colleagues of domain experts with busi-
ferent ML models. Liu et al. [17] developed a visual analytics
ness needs and knowledge. In this section, we characterize the key
system that shows all the layers, neurons, and other components of
steps of the process and user needs not yet addressed by current
the convolutional neural networks, as well as how the training data
tools.
is processed within. GANViz [29] visualizes the adversarial train-
ing process of generative adversarial nets with linked coordinated 4.1 Method
visualizations. Rather than devoting to an in-depth understanding
of a specific model, we investigate a more light-weight and general- We interviewed six data science practitioners. While all were ex-
purpose support for model-agnostic hyperparameter tuning. perienced in hyperparameter tuning, they held various job roles,
including “data scientist”, “data science engineer and researcher”,
Interpreting model prediction. There is an evolving research “machine learning engineer”, and “software engineer for a data ana-
interest in model-agnostic interpretation, that focuses on understand- lytics product”. We will refer to the six interviewees as P1, P2, P3,
ing model prediction behaviors. Riberio et al. developed LIME [23], P4, P5, P6
a model-agnostic approach that learns an interpretable model locally
around the prediction. This framework is generalizable to different 4.2 Hyperparameter Tuning: Practice
models and dataset, and efficiently enhances the interpretability of Our interviews may be affected by potential sampling biases: the
a given model. However, before drilling down into such expensive interviewees were based in the United States and the UK; they
interpretation and copious details, we focus on comparing many worked for a software company that builds applications for machine
alternative models and identifying better ones. learning and analytics on big data. However, the sample included
Interpreting model performance metrics. The lack of support a good variety of job roles and, we believe, can represent the need
for evaluating the performance of ML models has been known for at of a broader range of data science practitioners in the industry (i.e.,
Figure 2: A text file showing some experiment history from P4
4.2.3 What performance metrics do you track? How? Sub-task 5: Review and Save Progress. If the ML expert
decides, in sub-step 4, that no more tuning is needed, then he answers
The commonly used performance metrics for supervised models
these questions: 1) How well am I able to recall the tuning process
include accuracy, precision, recall, and ROC curves. Other examples
and communicate its results? 2) How useful are the records of my
are learning curve (to check the slope and when it gets saturation),
tuning progress? What is missing? What is superfluous? This
training loss vs. validation loss (to check when the latter increases
sub-step is performed with the support of a final project-level report
as the first keeps decreasing to detect overfitting).
summarizing all the experiments from all batches plus any comments
or reminders the practitioner recorded during the tuning process.
4.2.4 Workflow
The six data science practitioners pointed to a similar underlying 4.3 Hyperparameter Tuning: Support Needed
process of hyperparameter tuning. What we learned was consistent Following the workflow in Figure 4 we identified at each step needs
with the reports from the literature about the general process, which for visual analytics support that are not fully addressed by current
we summarized at the top of Figure 1. An interviewee (P1) summa- data science tools.
rized the process as follows: “We go through a typical data science
workflow, which is to clean data, train a model, and then open it up 4.3.1 Analytics of batches of experiments
over an API to a web front-end.”. We formalize the hyperparameter One of the most evident needs emerging from the interviews is the
tuning process as a workflow with five sub-steps (shown in Figure need to aggregate results across experiments and conduct group-
4), with the first four forming a loop. level comparisons among the experiments in a batch. The data
science practitioners need visualizations that help determine which
Sub-step 1: Set hyperparameter values. At the outset of the hyperparameters values are satisfying and which ones require more
workflow (the first square in Figure 4), the ML experts initiate the exploration. They also need to interactively customize the visu-
first batch of experiments by setting hyperparameter values based alization. In the words of an interviewee (P2): “Visualization is
on their understanding of the data, the model algorithm, and the to bring human interpretation in hyperparameter tuning... build
problem to solve. This sub-step reoccurs later as a restart of the loop a visualization that is a natural interpretation of the actual data
if, after sub-step 4 (the fourth square in Figure 4), the ML expert passed in, ... Users always want variation, they should be allowed
decides that more tuning is still needed. to customize the visualization”. Currently, these visualizations are
created manually and in ad hoc fashion. A data science practitioner
Sub-step 2: Hypothesize the impact of tuned hyperparam-
(P1) summarize his current practice as follows: “You try different
eters with results of all experiments. At this stage the ML
combinations of hyperparameters, keep track of the performance,
expert has just run a batch of experiments and wants to answer
and visualize it somehow”.
two questions: 1) What are the most impactful hyperparameters
in these experiments? Is hyperparameter X relevant? 2) How do 4.3.2 Analytics of individual experiments
the hyperparameters influence the performance of the model and
what performance metrics to consider? This sub-step is performed A second general need is a support to investigate results and metadata
with the support of summative reports of the hyperparameters and of an individual experiment (trained model). As shown in Figure 4,
performance metrics for a full batch of experiments. these investigations happen as drill-down analysis to test the current
working hypotheses. For example, the user may need to understand
Sub-step 3: Validate hypotheses with details of individual the training history of an experiment and if the model predictions
experiments. The ML expert may need to drill into the details of can be trusted. Additionally, s/he may want to review the metadata
specific experiments to test the hypotheses developed in sub-step 2: as a reminder of the hyperparameter values used. If the experiment
1) What do the details of this experiment say about my hypotheses? is worth tracking for later comparisons, s/he can annotate some notes
2) Do I trust the predictions of this models by looking at the results? from the analysis. In particular, three of the six practitioners (P1,
This sub-step represents an in-depth investigation that starts and P4, P6) mentioned the need to review interpretability details such
ends back in sub-step 2 (see bidirectional arrow in Figure 4). It is as examples of misclassifications by a supervised ML model. P1:
performed with the support of detailed reports on hyperparameters “Interpretability is also an important factor to track when turning
and performance metrics from an individual experiment. Multiple hyperparameters. Is it predicting the right class for the right reason?
micro-iterations occur between sub-steps 2 and 3, typically. I would want to get examples which get classified correctly or not.”
[23].
Sub-step 4: Decide if more tuning is needed. Once the ML With computationally expensive experiments that may take hours,
expert has analyzed the results of the current batch of experiments, analysis of the training progress is desirable. Two interviewees
s/he needs to decide: 1) Does the (best) model performance meet explicitly pointed to the need for “early stopping” (P1) of ineffective
my expectations? If not, 2) Will more training improve the model experiments. P1: “If I see the loss curve jumping all over the
performance and will it be worth the efforts, given the resources? place, or very noisy in comparison to the other loss curves, I don’t
This sub-step is performed with the support of the summative and even need to ever look at that model again, I just thought it out
detailed reports from the prior two steps immediately”.
4.3.3 Informing the next round of exploration workflow in Figure 4. Once the experiments are completed, the pro-
A third general need revolves around the support for making complex totype reports and visualizes the values of the performance metrics
decisions. The decision is whether to run a new batch of experiments obtained from each experiment with respect to the corresponding
and, if so, what new set of hyperparameter values to use, given the hyperparameter value settings. The user typically completes multi-
results of the current batch of experiments. One of the challenges is ple runs until satisfied by the results, then selects the best models
to monitor and reason about many hyperparameters at the same time. and reports these as the outcome of the entire model tuning Project.
This is particularly evident when training deep learning models. The
data science practitioners have to restrict the number of variables Prototype Design and Contents. The design of the prototype
to keep in mind due to limited cognitive capacity. When deciding visualizes the experiments results depending on the data types and
what small subset of the hyperparameter space to explore next, values of the hyperparameters and performance metrics, thus is
they need to capture the insights from the analysis of the existing model-agnostic and can be applied to different models (see core
experiments (see the first two needs, above). It is worth remarking, concepts above). However, to demonstrate the prototype with real-
about this need, that the decision consists in balancing observations, world use cases, we run a real model tuning project and populated
expectations, and resources: i.e., the performance observed in the the visualizations with realistic data. Specifically, we built a simple
current batch of experiments (observations), the desired level of convolutional neural network (CNN) and applied it to predict the
performance given the problem (expectations) and the resources such MNIST dataset of handwritten digits [2]. As a proof of concept,
as time and computation available to the project (resources). While we streamlined a training set of 60,000 examples and a test set of
the observations are in the tool, the expectations and resources are 10,000 examples.
mostly implicit knowledge in the head of the data science practitioner The prototype can visualize experiment results of different mod-
- herefrom the need for visual analytics tools that involve the human. els, substituting the current views with the data types and values of
As summarized by P4: “There’s never been a point in any project the corresponding hyperparameters and metrics. We made a strategic
I’ve ever worked on where hyperparameter tuning was done. It’s design decision to use a grid of scatter plot to visualize the training
really just I [judging if] I have seen enough and I’m willing to results with minimum manipulation and leave it to data scientists
make a decision about what the hyperparameter should be [to meet to “perform the art” (P4) of interpretation. It is also a practice to
expectations]. So its more of a question of timelines and schedules”. use grids of scatter plots to explore, at an early stage of the analysis,
correlations among sets of variables (see visualization in statistical
4.3.4 Project-level memory and communication tools such as SPSS and SAS).
The fourth need pertains to memory and communication support at
the project level. About memory support, P5 describes the needs to Use Cases Based on the data obtained from training the CNN
capture what was done to easily recall the analysis trajectory later: model with 8 experiments, below we describe the prototype imple-
“Now in the report, we have the accuracy (performance), but we do mentation through the following use case with two phases:
not capture what I changed. Over time we will probably forget what Sarah is a data scientist and she is building a 10-way classifier
weve done, such as the numbers we have changed. So being able to with the CNN model to recognize the handwritten digit numbers
track them is important. We will be able to go back and see how the in the MNIST dataset. She has implemented the model skeleton in
model has improved. [A] ‘rewinding’ [capability].” Several inter- a python script with the hyperparameters as tunable variables, and
viewees (e.g., P1, P6) also mentioned that a project-level summary needs to decide what values to use.
or report on all experiments should allow filtering, annotating, and
comparing experiments: e.g., delete or archive experiment, mark as Phase 1: Sarah sets a list of values for each hyperparameter and
promising, filter, annotate, tag, select and compare two experiments. launches a batch of Experiments in the current Run. After
For example, P1 reports that his current practice is to compare two obtaining the training results, she makes sense of these results
experiments at a time, in detail. P6 reports that, at the end of her and decides how to continue the tuning process.
tuning project, she typically selects the best 2-3 experiments from
the list and then runs the models on a new dataset, as a final “blind Phase 2: Sarah stops the tuning progress, cleans up the training
validation” step. Some interviewees suggested that project-level records, and saves her progress as a report.
reporting would help collaborate with colleagues and communicate
the results to the domain experts who requested the model. In P5’s
words, “to be able to communicate to the business sponsors outside 5.2 Phase 1
our team how well the model is performing, and also, we would use Set Value to Launch Experiments. Sarah started by experi-
it internally for [guiding future] tuning [...] [and] do a comparison menting with three hyperparameters: batch size (number of samples
between models as well”. that going to be propagated through the network), dropout rate (the
probability at which randomly selected neurons are ignored during
5 H YPER T UNER P ROTOTYPE AND E VALUATION training), and number of epochs (one epoch is when an entire dataset
To explore how to leverage visual analytics support in the hyperpa- is passed forward and backward through the neural network once).
rameter tuning workflow (Figure 4), we implemented and evaluated She sets several candidate values for each of the three hyperparame-
HyperTuner, an interactive prototype. ters (Figure 5 left) and sets the remaining hyperparameters as default
values. Notably, by adding an asterisk (*) after the step size she
5.1 Implementation and Example Data indicates that the step size increases by multiplying the previous
HyperTuner is a web-based application implemented on the Django value by two rather than adding two each time (e.g. 28,56,128 in-
framework [3]. The visualization components are developed with stead of 28,30,32). Then she selects the metrics she wants to use
Bokeh [4] and D3.js [1]. to measure the model performance (Figure 5 right). In response of
her parameter setting actions, the tool automatically generates the
Core Concepts in the Prototype. In the prototype, a user command to execute the script via a command-line interface, which
launches a training script with multiple hyperparameter settings as a she commonly uses to run scripts (see the bottom left field in Figure
Run, where each setting results in an Experiment. 5). She can further customize and add more hyperparameters in the
The user interactions supported around a specific Run correspond final command, and choose to log more performance metrics (the
to completing a full loop connecting the first four sub-steps of the bottom right drop-down menu of Figure 5).
Figure 5: Initial Parameter Setting to Launch Experiments
Run Dashboard. After the experiments are launched and com- lower loss values. Based on this insight, Sarah now selects the ex-
pleted, Sarah reviews the results of all the experiments summarized periments with the largest batch size, batch size=128 (Figure 6 B.5).
in the run dashboard (Figure 6 A). By viewing the parameter panel Based on this selection she has now identified three experiments
on the left she is reminded of the hyperparameters values she had set (Figure 6 C), and it seems that dropout rates 0.3 and 0.5 are not as
when she launched this run plus the metrics she selected to assess good as 0.7. Yet, none of the three experiments has good accuracy,
performance. For each hyperparameter and metric, she can scan the thus she decides to check each experiment in more detail via the
current value ranges under each slide bar. On the right, she sees experiment dashboard.
both the table and a set of visualizations. The experiment results are
summarized in the table at the top: it lists experiment ID, status, hy- Experiment Dashboard. Sarah clicks on the Epoch Observa-
perparameters tuned, and performance metrics obtained. Under the tory sub-tab and enters the experiment dashboard, where the left
table, she finds two types of visualizations. The first is an aggregated panel and table at the top are persisted from the run dashboard where
line chart showing the performance metrics (lines) obtained for each she was earlier. Here, in the table, she selects one of the three rows
of the eight experiments (x-axis). She can click on the legend to (experiments) she is investigating. She replays the training process
choose which performance metric to view. The second is a grid of the individual experiment (Figure 7 label 1). She repeats this
of 12 scatter plots (three rows, four columns) showing the detailed process with the other two experiments. She is investigating how the
results for each metric-hyperparameter combination: each row cor- loss and accuracy curves look like in each experiment, and specifi-
responds to one hyperparameter (always shown on the x-axis) and cally, each epoch. This will help her find a good trade-off between
each column corresponds to one performance metric (always shown good final performance and amount of noise (i.e., metric fluctua-
on the y-axis). tions) in the training process. In the experiment dashboard, under
the table, she analyzes the configuration summaries or metadata of
Sarah notices that experiment 4 has worse performance than the experiment (label 2), the line charts showing the performance
the others. She suspects that it’s because this experiment had a metrics (lines) within epoch (label 3) and across epochs (label 4). In
low number of epochs. So by brushing over the top right scatter- each of these line charts, she clicks on the legend to choose which
plot, she selects the experiments with the smaller number of epochs performance metric to view. On the lower right, she finds a visualiza-
(num epochs=6, Figure 6 A.1) to see how these experiments per- tion that is specific to the current model and dataset. In this case, she
formed (number of epochs corresponds to the first row). Since all sees a confusion matrix as a heat map. This visualization helps her
views in this dashboard are coordinated, the brushing operation re- assess if she can trust the model trained in the current experiment.
sults in selecting three experiments across all views, including the Specifically, she inspects the cells that show what digits are more
table at the top (Figure 6 B.2). It also results into updated sliders frequently misclassified and why by looking at the examples shown,
in the parameter panel on the left: the lower and upper limit of upon cell hovering, under the matrix. For example, she hovers over
each range (blue circles) in each slider is automatically re-positioned row 2 and column 6 and finds out that there are 14 data points that
to reflect the hyperparameter and performance metrics of the ex- are actually digit “2” but classified as digit “6” (see frequency 14
periments selected by the brushing (Figure 6 B.3). At this point, in the matrix, magnified in Figure 7 label 5), and the images at the
Sarah notices that as the experiments selected have num epoch=6 bottom are examples of those misclassified data points. This gives
and dropout rate=0.5, the batch size shows a linear relationship with her a sense of the quality of the model predictions.
all the performance metrics (the relationship is highlighted for the
reader with gray lines in Figure 6 B.4). Thus she infers that experi- Support to Decide the Next Batch of Experiments. After ex-
ments with higher batch size values might have higher accuracy and amining the current experiment results in the global and local views
Figure 8: Project Dashboard