AutoML and XAI PDF

Benchmarking Automated Machine Learning and Explainable AI
frameworks
Summary
As the name suggests, AutoML is an automation of machine learning tasks. It serves as a
bridge between varying levels of expertise when designing a machine learning system,
with the goal of democratizing AI, making it more accessible to the world. There are
various approaches to tackle this objective and many frameworks to put it practical use.
Explainable AI is pretty self explanatory, the objective of this field of study is to make
machine learning models more interpretable and transparent, to shine a light on these so
called “black box” models. This report is a quantitative comparison of popular open
source frameworks for AutoML and Explainable AI. This report is comprised of two
parts.
Part I : Benchmarking of AutoML frameworks
1. Introduction
There are numerous approaches to AutoML, each with its unique theoretical foundations,
since we cannot perform a fair comparison of theoretical methods, therefore they must be
compared based on performance over various datasets and machine learning tasks.
1
2. Selected Frameworks
2.1 Auto-Sklearn
Figure 1.Pipeline optimisation process of Auto-Sklearn
Auto-Sklearn uses the sklearn framework to automatically create machine learning

pipelines. It does not focus on neural network architectures search for deep neural
networks, but it uses Bayesian Optimisation for the hyperparameter tuning of standard
machine learning algorithms. Note: version 0.5.2 was used in the testing of this
framework.
Salient features:
1. It includes various feature engineering methods such as one-hot-encoding,
numeric feature standardization, PCA and more.
2. It handles missing values and comes with 15 feature preprocessing algorithms out
of the box.
3. The models use sklearn estimators for regression and classification, which
provides an easy integration into existing sklearn environments.
4. It computes 38 statistics for a dataset and initializes the hyperparameters to the
optimised parameters of a dataset with similar statistics. (similarity calculated
using L1 norm)
5. It uses the optimisation framework SMAC3 which implements bayesian search
over hyperparameter space.
2
Drawbacks:
1. It lacks the ability to process natural language inputs and the ability to distinguish
between numeric and categorical inputs, which have to be fed to the model
beforehand.
2. It also does not have the ability to handle string inputs and requires integer
encoding of categorical strings.
2.2 H2O
H2O is an open source machine learning framework with its own algorithms that execute
on a server cluster accessible by a variety of interfaces and programming languages. It
includes an automatic machine learning module which uses its own algorithms to
generate a pipeline. h2o is developed in Java and includes Python, Javascript,Tableau,R
and Flow (web-UI) bindings. Note: version 3.28.0.1 was used for this comparison.
Salient Features:
1. It has a high level of abstraction aimed at making it accessible to everyone
regardless of expertise.
2. It performs an exhaustive search over feature engineering methods and model
hyperparameters for optimizing its pipelines.
3. Supports imputation, one-hot-encoding, standardisation for feature engineering
and automatically deals with categorical features.
3
4. It supports two methods of hyperparameter optimisation, cartesian grid search and
random grid search.
5. Supported models include generalized linear models, basic deep learning models,
gradient boosting machines and dense random forests.
6. The AutoML pipeline is limited to algorithm choice, stopping time and number of
validation folds.
7. It uses meta-learning methods like stacking and creates different ensembles of
trained models. Finally creates a leaderboard of best performing models.
Drawbacks:
1. Massive resource storage.
2.3 TPOT
Figure 2. Machine learning processes automated by TPOT
4
TPOT (Tree-based-Pipeline-Optimisation-Tool) is a genetic programming-based pipeline
optimiser that automatically creates machine learning pipelines. It automates certain
processes of the machine learning system design cycle as shown in the figure above.
Note: version 0.11.0 was used for this comparison.
Salient Features:
1. Like auto-sklearn TPOT sources its data manipulators and algorithms from
sklearn.
2. Training time can be restricted by setting a time limit or population size. Its search
space can be restricted by a configuration file.
3. The optimisation process can be paused and resumed.
4. The biggest feature of TPOT is that it can port the optimised pipeline to code to be
further modified manually.
Drawbacks:
1. TPOT cannot automatically process natural language inputs and also categorical
inputs which have to integer encoded before feeding the data
2. Since it uses genetic programming, running times can be long before a high
accuracy is attained, but given time it will find the best parameters.
3. Benchmarking Methodology
To compare these three frameworks quantitatively, we will have to record performance

across different datasets and tasks. Since most datasets need extensive cleaning and
preprocessing, OpenML was chosen as the source for datasets, since it is available in a
clean format and can be accessed through an API. A total of eight datasets were chosen
with increasing instance count (four regression, four classification), details of which are
provided towards the end of the report. Code snippets for each framework were written to
automate the benchmarking process. A time limit of two minutes was set for each
framework to ensure a fair comparison. Performance was recorded along the lines of f1
score for classification tasks, r2 score for regression tasks, and prediction time. To
establish a degree of consistency, models were trained using two random seeds and
performances averaged. Total compute time for the benchmarking of all frameworks is
(8-datasets * 3 frameworks * 2 seeds * 2 minutes runtime) = 96 minutes.
5
3.1 Framework parameters:
Auto-sklearn: Set a time limit of 2 minutes and per_run_time_limit of 30 seconds (time

spent on optimising each model)
TPOT: Set a time limit of 2 minutes, population_size of 15 and 5 cross-validation folds

for internally evaluating models.
H2O: Set a time limit of 2 mins.
rain-Test split is 75:25

Note: T
3.2 Comparison metrics:

● R2 score for regression tasks.
● F1 score for classification tasks.
● Time for prediction in seconds.
Metric Calculation Reading

2×(P recision×Recall)
F1 score P recision + Recall
Higher value indicates
better performance
Values lies in [0,1]
SS res
R2 score R2 ≡1− SS tot
Higher value indicates
better performance
SS residual is residual sum of squares
SS total is total sum of squares
Value lies in [0,1]
Time to predict Lower value indicates

Calculated prediction time in seconds
better performance
6
Figures 3. and 4. Showcase the performance of the frameworks in classification and
regression tasks. Note that the dataset ids are in the order of increasing dataset size.
Figures 5. And 6. Average Classification and Regression performance, H2O clearly

performs best in classification tasks, while TPOT and H2O have similar performance in
regression tasks.
7
Figures 7. And 8. Prediction time performances. H2O performs the worst considering
prediction time and TPOT gives fastest predictions.
3.3 Numerical Values:
Data ID Auto-Sklearn H2O TPOT

Classification
1464 0.5650 0.8839 0.4981
40701 0.8205 0.8744 0.8281
1046 0.9596 0.9694 0.9648
1461 0.5906 0.9501 0.5423
Regression
196 0.8769 0.8200 0.9055
308 0.9237 0.9380 0.8778
537 0.4097 0.7942 0.7747
344 0.9921 0.9996 0.9970
8
Data ID Auto-Sklearn H2O TPOT
Time to make predictions
1464 0.1090 0.0092 0.0058
40701 0.1400 0.8153 0.0114
1046 0.3087 0.4126 0.0280
1461 0.0145 0.8221 0.0154
196 0.0171 0.2113 0.0045
308 0.0788 0.6163 0.0021
537 0.1463 0.6187 0.0072
344 0.3751 0.4140 0.0038
4. Results:
POT generally performs better given a long training time, the default parameters
Note: T
of TPOT take approximately an hour to run, a time limit of 2 mins was set for reducing
compute time.
4.1 Classification:
In classification tasks H2O outperforms the other two frameworks by a significant
amount. Looking at Figure 3. We can see that Auto-Sklearn and TPOT perform poorly
when dataset size is either small(~500 instances) or large(~40k instances).
9
4.2 Regression:
In regression tasks, TPOT and H2O perform the best, with TPOT slightly better on
average. According to figure 4. Auto-Sklearn on all datasets performs on par with the
other two, but on the dataset with ID 537 performs poorly, no conclusive reason for this
behaviour.
4.3 Prediction time :

TPOT gives the fastest predictions, with Auto-sklearn in a close second. H2O performs
poor in this regard with an average of 0.5 seconds over 8 datasets.
4.4 Ease of Use:

Qualitatively, H2O provides the smoothest experience with the least lines of code to
generate a model with almost no preprocessing of data. Auto-sklearn and TPOT need
some basic preprocessing before feeding it to the model.
5. Overall Conclusion:
Based on the collected data, H2O performs the best on classification datasets, and TPOT
performs the best on regression datasets. TPOT gives the fastest predictions, H2O is the
slowest to generate predictions. To give an overall rating of frameworks, we will take a
weighted sum of performances along different lines.
Considering Regression score, Classification score, Time for prediction, and Ease of use
for the overall comparison. Weights distribution - 70% accuracy, 20% prediction time,
10% ease of use. Ease of use - H2O: 8/10, Auto-sklearn:7/10, TPOT: 7/10.
Note: this comparison is purely to get an overall rating, it is not purely quantitative or
rigorous.
Overall Rating (just for comparison) :
H2O - 7.8/10
Auto-Sklearn - 7.1/10
TPOT - 7.4/10
10
Dataset Information:
Classification Datasets-
OpenML- Data ID Name Number of Number of Classes

Instances Features
1464 blood-transfusion-service-center 748 5 2
40701 Churn 5000 21 2
1046 Mozilla4 15545 6 2
1461 Bank-Marketing 45211 17 2
Regression Datasets-
OpenML- Data ID Name Number of Number of

Instances Features
196 AutoMPG 398 8
308 Puma32H 8192 33
537 Houses 20640 9
344 MV 40768 11
11
Part II: Review of Explainable AI frameworks
1. Introduction
In recent years with the introduction of machine learning algorithms and techniques into
the mainstream, most businesses are looking for ways to generate value from this new
paradigm of predictive modelling. Also, with computing power getting stronger
everyday the complexity of models that can be trained is
12

AutoML and XAI PDF

Uploaded by

Copyright:

Available Formats

AutoML and XAI PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AutoML and XAI PDF

Uploaded by

Copyright:

Available Formats

Benchmarking Automated Machine Learning and Explainable AI

Part I : Benchmarking of AutoML frameworks

Figure 1​.Pipeline optimisation process of Auto-Sklearn

Auto-Sklearn uses the sklearn framework to automatically create machine learning

Figure 2.​ Machine learning processes automated by TPOT

3​. ​Benchmarking Methodology

To compare these three frameworks quantitatively, we will have to record performance

Auto-sklearn​: Set a time limit of 2 minutes and ​per_run_time_limit​ of 30 seconds (time

TPOT: ​Set a time limit of 2 minutes, ​population_size​ of 15 and 5 cross-validation folds

H2O: ​Set a time limit of 2 mins.

​ rain-Test split is 75:25

3.2 Comparison metrics:

Metric Calculation Reading

SS total is total sum of squares

Value lies in [0,1]

Time to predict Lower value indicates

Figures 5. And 6. ​Average Classification and Regression performance, H2O clearly

3.3 Numerical Values:

Data ID Auto-Sklearn H2O TPOT

4.3 Prediction time :

4.4 Ease of Use:

OpenML- Data ID Name Number of Number of Classes

OpenML- Data ID Name Number of Number of

You might also like

Figure 1.Pipeline optimisation process of Auto-Sklearn

Figure 2. Machine learning processes automated by TPOT

3. Benchmarking Methodology

Auto-sklearn: Set a time limit of 2 minutes and per_run_time_limit of 30 seconds (time

TPOT: Set a time limit of 2 minutes, population_size of 15 and 5 cross-validation folds

H2O: Set a time limit of 2 mins.

rain-Test split is 75:25

Figures 5. And 6. Average Classification and Regression performance, H2O clearly