Perspectives: Scientific Machine Learning Benchmarks

PERSpECTIvES
Supervised learning is, therefore, possible

Scientific machine learning only when there is a labelled subset of the
data. Once trained, the learned model can
benchmarks be deployed for real-time usage, such as

pattern classification or estimation — which
is often referred to as ‘inference’. Because
Jeyan Thiyagalingam , Mallikarjun Shankar , Geoffrey Fox and Tony Hey of the difficulty in generating labelled data
for supervised learning, particularly for
Abstract | Deep learning has transformed the use of machine learning technologies experimental datasets, it is often difficult
for the analysis of large experimental datasets. In science, such datasets are to apply supervised learning directly. To
typically generated by large-scale experimental facilities, and machine learning circumvent this limitation, training is
often performed on simulated data, which
focuses on the identification of patterns, trends and anomalies to extract
provides an opportunity to have relevant
meaningful scientific insights from the data. In upcoming experimental facilities, labels. However, the simulated data may
such as the Extreme Photonics Application Centre (EPAC) in the UK or the not be representative of the real data and
international Square Kilometre Array (SKA), the rate of data generation and the model may, therefore, not perform
the scale of data volumes will increasingly require the use of more automated data satisfactorily when used for inferencing.
analysis. However, at present, identifying the most appropriate machine learning The unsupervised learning technique, in
contrast, does not rely on labels. A simple
algorithm for the analysis of any given scientific dataset is a challenge due to the example of this technique is clustering,
potential applicability of many different machine learning frameworks, computer where the aim is to identify several groups
architectures and machine learning models. Historically, for modelling and of data points that have common features.
simulation on high-performance computing systems, these issues have been Another example is identification of
addressed through benchmarking computer applications, algorithms and anomalies in data. Example algorithms
include k-means clustering8, Support
architectures. Extending such a benchmarking approach and identifying metrics
Vector Machines (SVMs)9 or neural-
for the application of machine learning methods to open, curated scientific network-based autoencoders10. Finally,
datasets is a new challenge for both scientists and computer scientists. Here, we reinforcement learning relies on a
introduce the concept of machine learning benchmarks for science and review trial-and-error approach to learn a given
existing approaches. As an example, we describe the SciMLBench suite of scientific task, with the learning system being
machine learning benchmarks. positively rewarded whenever it behaves
correctly and penalized whenever it
behaves incorrectly11. Each of these learning
In the past decade, a subfield of artificial solutions to the protein folding problem4. paradigms has a large number of algorithms,
intelligence (AI), namely, deep learning Current developments point towards and modern developmental approaches are
(DL) neural networks (or deep neural specializing these ML approaches to be more often hybrid and use one or more of these
networks, DNNs), has enabled significant domain-specific and domain-aware5–7, and techniques together. This leaves many choices
breakthroughs in many scientifically and aiming to connect the apparent ‘black-box’ of ML algorithms for any given problem.
commercially important applications1. successes of DNNs with the well-understood In practice, the selection of an ML
Such neural networks are themselves a approaches from science. algorithm for a given scientific problem is
subset of a wide range of machine learning The overarching scope of ML in science more complex than just selecting one of
(ML) methods. is broad. A non-exhaustive list includes the ML technologies and any particular
ML methods have been widely used for the identification of patterns, anomalies algorithm. The selection of the most
many years in several domains of science, and trends from relevant scientific effective ML algorithm is based on many
but DNNs have been transformational datasets, the classification and prediction factors, including the type, quantity and
and are gaining a lot of traction in many of such patterns and the clustering of data. quality of the training data, the availability
scientific communities2,3. Most of the The data are not always experimental or of labelled data, the type of problem being
national, international and big laboratories observational but can also be synthetic data. addressed (prediction, classification and so
that host large-scale experimental facilities, There are three approaches for developing on), the overall accuracy and performance
as well as commercial entities capable of ML-based solutions, namely, supervised, required, and the hardware systems
large-scale data processing (big tech), are unsupervised and reinforcement learning. available for training and inferencing.
now relying on DNN-based data analytic In supervised learning, the ML model is With such a multidimensional problem
methods to extract insights from their trained with examples to perform a given consisting of a choice of ML algorithms,
increasingly large datasets. A recent success task. In this case, the training data used hardware architectures and a range of
from industry is the use of DL to find must contain the ‘ground truth’ or labels. scientific problems, selecting an optimal

nature Reviews | PhySiCS volume 4 | June 2022 | 413
0123456789();:
Perspectives
ML algorithm for a given task is not trivial. by relevant scientific datasets on which the and their performance on fixed data
This constitutes a significant barrier for training and/or inference will be based. This assets, typically with the same underlying
many scientists wishing to use modern ML is different to conventional benchmarks hardware and software environment.
methods in their scientific research. for high-performance computing (HPC), This type of benchmark is characterized
In this Perspective, we discuss what where there is little dependency on datasets. by the dataset, together with some
are suitable scientific ML benchmarks The establishment of a set of open, curated specific scientific objectives. The data are
and how to develop guidelines and best scientific datasets with associated ML obtained from a scientific experiment
practices to assist the scientific community benchmarks is, therefore, an important step and should be rich enough to allow
in successfully exploiting these methods. for scientists to be able to effectively use different methods of analysis and
Developing such guidelines and best ML methods in their research and also to exploration. Examples of metrics could
practices at the community level will not identify further directions for ML research. include the F1 score for training accuracy,
only benefit the science community but time to solution and any domain-specific
also highlight where further research into Machine learning benchmarks for metric(s). A more detailed discussion on
ML algorithms, computer architectures and science metrics can be found in the next section.
software solutions for using ML in scientific In this section, we discuss the elements • Application benchmarking. This aspect
applications is needed. of a scientific benchmark and the focus of of ML benchmarks is concerned
We refer to the development of guidelines scientific benchmarking, along with relevant with exploring the performance
and best practices as benchmarking. examples. of the complete ML application
The applications used to demonstrate the (covering loading of inputs from files,
guideline and best practices are referred to as Elements of a benchmark for science. pre-processing, application of ML,
benchmarks. The notion of benchmarking As discussed above, a scientific ML post-processing and writing outputs to
computer systems and applications has benchmark is underpinned by a scientific files) on different hardware and software
been a fundamental cornerstone of problem and should have two elements: environments. This can also be referred
computer science, particularly for compiler, first, the dataset on which this benchmark to as an end-to-end ML application
architectural and system development, is trained or inferenced upon and, second, benchmark. A typical performance
with a key focus on using benchmarks for a reference implementation, which can be in target for these types of benchmarks may
ranking systems, such as the TOP500 or any programming language (such as Python include training time or even complete
Green500 (refs12–16). However, our notion or C++). The scientific problem can be from time to solution. Such application
of scientific ML benchmarking has a any scientific domain. A collection of such benchmarks can also be used to evaluate
different focus and, in this Perspective, we benchmarks can make up a benchmark the performance of the overall system,
restrict the term ‘benchmarking’ to ML suite, as illustrated in Fig. 1. as well as that of particular subsystems
techniques applied to scientific datasets. (hardware, software libraries, runtime
Firstly, these ML benchmarks can be Focus of benchmarking. There are environments, file systems and so
considered as blueprints for use on a range three separate aspects of scientific on). For example, in the case of image
of scientific problems, and, hence, are benchmarking that apply in the context classification, the relevant performance
aimed at fostering the use of ML in science of ML benchmarks for science, namely, metric could be a throughput measure
more generally. Secondly, by using these scientific ML benchmarking, application (for example, images per second) for
ML benchmarks, a number of aspects in benchmarking and system benchmarking. training or inference, or the time to
an ML ecosystem can be compared and These are explained below. solution of the classification problem
contrasted. For example, it is possible to • Scientific ML benchmarking. This is (including I/O, ML, and pre-processing
rank different computer architectures for concerned with algorithmic improvements and post-processing), or the scaling
their performance or to rank different ML that help reach the scientific targets properties of the application.
algorithms for their effectiveness. Thirdly, specified for a given dataset. In this • System benchmarking. This is concerned
these ML benchmarks are accompanied situation, one wishes to test algorithms with investigating performance effects
of the system hardware architecture
on improving the scientific outcomes/
a b targets. These benchmarks have
Environmental similarities with application benchmarks,
sciences
but they are characterized by primarily
focusing on a specific operation that
Dataset(s) Reference exercises a particular part of the system,
implementation
independent of the broader system
Particle Astronomy environment. Suitable metrics could
physics
be time to solution, the number of
floating-point operations per second
Benchmark Integration
achieved or aspects of network and data
movement performance.
Material Life
sciences sciences
Examples of scientific machine learning
Fig. 1 | The notion of a machine learning benchmark and a benchmark suite. a | Elements of benchmarks. Scientific ML benchmarks
a scientific machine learning (ML) benchmark. b | Building a scientific ML benchmark suite that are ML applications that solve a particular
integrates different scientific ML benchmarks from various scientific disciplines. scientific problem from a specific scientific
414 | June 2022 | volume 4 www.nature.com/natrevphys
0123456789();:
Perspectives
domain. For example, this can be as It is clearly essential to agree upon the benchmark suite should provide a good
simple as an application that classifies appropriate figures of merit and metrics coverage of methods and goals, and
the experimental data in some way, or as to be used for comparing different should be extensible.
complex as inferring the properties of a implementations of benchmarks. • Extensibility. Although developing
material from neutron scattering data. • Framework. Providing just a collection scientific ML benchmarks can be valuable
Some examples are given below. of disparate applications without a for scientists, it can be time consuming
• Inferring the structure of multiphase coherent mechanism for evaluation to develop benchmarking-specific codes.
materials from X-ray diffuse multiple requires users to perform a set of fairly If the original scientific application
scattering data. Here, ML is used to complex benchmarking operations that needs substantial refactoring to be
automatically identify the phases are relevant to their specific goals. Ideally, converted into a benchmark, this will
of materials using classification2. the benchmark suite should, therefore, not be an attractive option for scientists.
• Estimating the photometric red shifts of offer a framework that not only helps Any benchmarking framework should,
galaxies from survey data17. Here, ML is users to achieve their specific goals but therefore, try to minimize the amount of
used for estimation. also unifies aspects that are common code refactoring required for conversion
• Clustering of microcracks in a material to all applications in the suite, such as into a benchmark.
using X-ray scattering data18. Here, ML benchmark portability, flexibility and
uses an unsupervised learning technique. logging. In addition to these challenges, ML
• Removing noise from microscope data to • Reporting and compliance. Finally, how benchmarks need to address a number
improve the quality of images. ML is used these results are reported is important. of other issues, such as problems with
for its capability to perform high-quality In many cases, a benchmark framework overtraining and overfitting. In most cases,
regression of pixel values19. as discussed above addresses this such issues can be covered by requiring
concern. However, there are often some compliance with some general rules for the
More detailed examples are provided in specific compliance aspects that must be benchmarks — such as specifying the set of
later sections. followed to ensure that the benchmarking hyperparameters that are open to tuning.
process is carried out fairly across Although one may consider these as aspects
The benchmarking process different hardware platforms. of scientific ML benchmarking, they are best
Although it is possible to provide a collection handled through explicit specification of
of ML-specific scientific applications (with There are also a number of challenges the rules of the benchmarking process. For
relevant datasets) as benchmarks for any of that need to be addressed when dealing with example, the training and validation data,
the purposes mentioned above, the exact the development of ML benchmarks; these and cross-validation procedures, should aim
process of benchmarking requires the are given below. to mitigate the dangers of overfitting.
following elements, given below. • Data. In the previous section, we
• Metrics of choice. First, depending on the highlighted the significance of data Benchmarking initiatives
focus, the exact metric by which different when using ML for scientific problems. Comparing different ML techniques is
benchmarks are compared may vary. The availability of curated, large-scale, not a new requirement and is increasingly
For example, if science is the focus, then scientific datasets — which can be becoming common in ML research. In fact,
this metric may vary from benchmark either experimental or simulated data this approach has been fundamental for the
to benchmark. However, if the focus is — is the key to developing useful ML development of various ML techniques. For
system-level benchmarking, it is possible benchmarks for science. Although a lot example, the ImageNet20,21 dataset spurred
to agree on a common set of metrics that of scientific data are openly available, the a competition to improve computer image
can span across a range of applications. curation, maintenance and distribution analysis and understanding, and has been
However, in the context of ML, owing to of large-scale datasets for public widely recognized for driving innovation in
the uncertainty around the underlying consumption is a challenging process. DL. A recent example of an application and
ML model(s), dataset(s) and system A good benchmarking suite needs to system benchmark is the High-Performance
hardware (for example mixed-precision provide a wide range of curated scientific LINPACK for Accelerator Introspection
systems), it may be more meaningful datasets coupled with the relevant (HPL-AI) benchmark22, which aims to
to ensure that uncertainties of the applications. Reliance on external drive AI innovation by focusing on the
benchmark outputs are quantified datasets has the danger of not having full performance benefits of reduced (and
and compared wherever necessary. control or even access to those datasets. mixed) precision computing. However,
Likewise, the level of explainability of • Distribution. A scientific ML providing a blueprint of applications,
methods (and, hence, outputs) can be benchmark comprises a reference guidelines and best practices in the context
a differentiator between different ML ML implementation together with a of scientific ML is a relatively new and
methods and, hence, of benchmarks. relevant dataset, and both of these must unaddressed requirement. There have
In this way, the explainability of be available to the users. Since realistic been a number of efforts on this aspect
different ML implementations for a dataset sizes can be in the terabytes range, that address some of the challenges we
given benchmark problem could be the access and downloading of these highlighted above. In this brief overview of
considered as a metric as well, provided datasets is not always straightforward. these benchmarking initiatives, we explicitly
this can be well quantified. Another axis • Coverage. Benchmarking is a very broad exclude conventional benchmarking
could be around energy efficiency, such topic and providing benchmarks to cover activities in other areas of computer science,
as the ability of an ML implementation the different focus areas highlighted such as benchmarks for HPC systems,
to perform training or inference with above, across a range of scientific compilers and subsystems, such as memory,
minimum power or energy requirements. disciplines, is not a trivial task. A good storage and networking12,23.

0123456789();:
Perspectives
Instead of giving an exhaustive technical CORAL-2. The CORAL-2 (ref.26) benchmarks running DL workloads. The suite covers a
review covering very-fine-grained aspects, are computational problems relevant to a number of representative scientific problems
we give a high-level overview of the various scientific domain or to data science, and from various domains, with each workload
ML benchmark initiatives, focusing on the are typically backed by a community code. being a real-world scientific DL application,
requirements discussed in the previous Vendors are then expected to evaluate such as extreme weather analysis33. The
sections. We shall, therefore, cover the and optimize these codes to demonstrate suite includes reference implementations,
following aspects: the value of their proposed hardware datasets and other relevant software, along
• Benchmark focus: science, application in accelerating computational science. with relevant metrics. This HPC ML suite
(end-to-end) and system. This allows a vendor to rigorously compares best to the SciMLBench work
• Benchmark process: metrics, framework, demonstrate the performance capabilities discussed below. The AIBench environment
reporting and compliance. and characteristics of a proposed machine also enforces some level of compliance
• Benchmark challenges: data, distribution, on a benchmark suite that should be relevant for reporting ranking information of
coverage and extensibility. for computational scientists. The ML and hardware systems.
data science tools in CORAL-2 include a
In the context of ML benchmarking, number of ML techniques across two suites, DAWNBench. DAWNBench27 is a
there are several initiatives, such as namely, the big data analytics (BDAS) and benchmark suite for end-to-end DL training
Deep500 (ref.24), RLBench25, CORAL-2 DL (DLS) suites. Whereas the BDAS suite and inference. The end-to-end aspect is
(ref.26), DAWNBench27, AIBench28, covers conventional ML techniques, such ideal for application-level and system-level
MLCommons29 and SciMLBench30, as well as principal components analysis (PCA), benchmarking. Instead of focusing on model
as specific community initiatives (such as k-means clustering and SVMs, the DLS suite accuracy, DAWNBench provides common
the well-known community competitions relies on the ImageNet20,21 and CANDLE32 DL workloads for quantifying training
organized by Kaggle31). We overview these benchmarks, which are primarily used time, training cost, inference latency and
initiatives below and note that a specific for testing scalability aspects, rather than inference cost across different optimization
benchmarking initiative may or may not purely focusing on the science. Similarly, strategies, model architectures, software
support all the aspects listed above or, in the BDAS suite aims to exercise the memory frameworks, clouds and hardware. There are
some cases, may only offer partial support. constraints (PCA), computing capabilities two key benchmarks in the suite — image
(SVMs) and/or both these aspects (k-means) classification (using the ImageNet and
Deep500. The Deep500 (ref.24) initiative and is also concerned with communication CIFAR-10 (ref.34) datasets) and natural-
proposes a customizable and modular characteristics. Although these benchmarks language-processing-based question
software infrastructure to aid in comparing are oriented at ML, the constraints and answering35 (based on the Stanford Question
the wide range of DL frameworks, benchmark targets are narrowly specified Answering Dataset or SQuAD35) that covers
algorithms, libraries and techniques. The key and emphasize scalability capabilities. The both training and inference. DAWNBench
idea behind Deep500 is its modular design, overall coverage of science in the CORAL-2 does not offer the notion of a framework
where DL is factorized into four distinct benchmark suite is quite broad, but the and does not have a focus on science.
levels: operators, network processing, footprint of the ML techniques is limited With key metrics around time and cost
training and distributed training. Although to the BDAS and DLS suites, and there is (for training and inference), DAWNBench is
this approach aims to be neutral and little focus on scientific data distribution predominantly targeted towards end-to-end
overarching, and also able to accommodate for algorithm improvement. system and application performance.
a wide variety of techniques and methods, Although the datasets are public and open,
the process of mapping a code to a new AIBench. The AIBench initiative is no distribution mechanisms have been
framework has impeded its adoption for supported by the International Open adopted by DAWNBench.
new benchmark development. Furthermore, Benchmark Council (BenchCouncil)28.
despite its key focus on DL, neural networks The Council is a non-profit international Benchmarks from the MLCommons working
and a very customizable framework, organization that aims to promote groups. MLCommons is an international
benchmarks or applications are not included standardizing, benchmarking, evaluating initiative aimed at improving all aspects of
by default and are left for the end user and incubating big data, AI and other the ML landscape and covers benchmarking,
to provide, as is support for reporting. emerging technologies. The scope of datasets and best practices. The consortium
The main limitation is the lack of a suite AIBench is very comprehensive and has several working groups with different
of representative benchmarks. includes a broad range of internet services, foci for ML applications. Among these
including search engines, social networks working groups, two are of interest here:
RLBench. RLBench25 is a benchmark and and e-commerce. The underlying ML- HPC and Science. The MLCommons HPC
learning environment featuring hundreds specific tasks in these areas include image benchmark29 suite focuses on scientific
of unique, hand-crafted tasks. The focus is classification, image generation, translation applications that use ML, and especially
on a set of tasks to evaluate new algorithmic (image-to-text, image-to-image, text-to- DL, at the HPC scale. The codes and data
developments around reinforcement learning, image, text-to-text), object detection, text are specified in such a way that execution
imitation learning, multitask learning, summarization, advertising and natural of the benchmarks on supercomputers
geometric computer vision and, in particular, language processing. The relevant datasets will help understand detailed aspects
few-shot learning. The tasks are very specific are open and the primary metric is system of system performance. The focus is on
and can be considered as building blocks performance for a fixed target. One of the performance characteristics particularly
of large-scale applications. However, the important components of the AIBench relevant to HPC applications, such as
environment currently lacks support for the initiative is HPC AI500 (ref.33), a standalone model–system interactions, optimization
classes of benchmarking discussed above. benchmark suite for evaluating HPC systems of the workload execution and reducing
0123456789();:
Perspectives
execution and throughput bottlenecks. deliver best practices or guidelines for the actual benchmarking, logging and
The HPC orientation also drives this long term. reporting of the results. Secondly, at the
effort towards exploration of benchmark developer level, it provides a coherent
scalability. The SciMLBench approach application programming interface
By contrast, the MLCommons Science The SciMLBench approach has been (API) for unifying and simplifying the
benchmark36 suite focuses specifically on developed by the authors of this article, development of ML benchmarks.
the application of ML methods to scientific members of the Scientific Machine
applications and includes application Learning Group at the Rutherford The SciML framework is the basic fabric
examples from several scientific domains. Appleton Laboratory, in collaboration upon which the benchmarks are built. It is
The recently announced information on with researchers at Oak Ridge National both extensible and customizable, and offers
the science benchmarks at Supercomputing Laboratory and at the University of Virginia. a set of APIs. These APIs enable easier
2021 will spur improvements in defining Among all the approaches reviewed above, development of benchmarks based on this
datasets for advancing ML for science. The only the SciMLBench benchmark suite framework and are defined with layers
suite currently lacks a supportive framework attempts to address all of the concerns of abstractions. Example APIs (and their
for running the benchmarks but, as with discussed previously. To the best of our abstractions) are given below.
the rest of MLCommons, does enforce knowledge, the SciMLBench approach is • The entry point for the framework to
compliance for reporting of the results. unique in its versatility compared with the run the benchmark in training mode,
The benchmarks cover the three areas of other approaches and its key focus is on abstracted to all benchmark developers
benchmarking — science, application and scientific ML. (scientists), requires the API to follow
system. a specific signature. If defined, the
Core components. SciMLBench has three benchmark can then be called to run in
SciMLBench. The Scientific Machine components, given below. training mode. If this is undefined and
Learning Benchmark suite — or • Benchmarks. The benchmarks are ML the benchmark is invoked in training
SciMLBench30 — is specifically focused applications written in Python that mode, it will fail.
on scientific ML and covers nearly every perform a specific scientific task. These • The entry point for the framework to
aspect of the cases discussed in the previous applications are included by default and run the benchmark in inference mode,
sections. A detailed description of the users are not required to find or write abstracted to all benchmark developers
SciMLBench initiative is described in the their own applications. On the scale of (scientists), requires the API to follow
next section. micro-apps, mini-apps and apps, these a specific signature. If defined, the
codes are full-fledged applications. benchmark can be called to run in
Other community initiatives. In addition to Each benchmark aims to solve a specific inference mode. If this is undefined and
various efforts mentioned above, there are scientific problem (such as those the benchmark is invoked in inference
other efforts towards AI benchmarking by discussed earlier). The set of benchmarks mode, it will fail.
specific research communities. Two examples are organized into specific themes, • Control of logging. APIs for logging
are WeatherBench37 and MAELSTROM38 including DL-focused benchmarks, of details are available at different
from the weather and climate communities, training or inference-intensive granularities. At the highest (abstraction)
both of which have specific goals and include benchmarks, benchmarks emphasizing level, this can be simply the starting and
relevant data and baseline techniques. uncertainty quantification, benchmarks stopping of logging. At the fine-grained
However, these efforts are not full benchmark focusing on specific scientific problems level, it can be controlling what is
suites, and, instead, are engineered as (such as denoising19, nonlinear dynamical specifically being logged.
individual benchmarks, ideally to be systems5, and physics-informed neural • Controlling the execution of benchmarks.
integrated as part of a suite. networks5) and benchmarks focusing These APIs are designed for advanced
Although community-based on surrogate modelling39. Although the benchmark developers to control
competitions, such as Kaggle31, can be current set of benchmarks and their aspects around the actual execution of
seen as a benchmarking activity, these relevant datasets are all image based, the benchmarks and would be expected to be
competitions do not have a coherent design of SciMLBench allows for datasets seldom used by scientists.
methodology or a controlled approach for that are multimodal or include mixed
developing benchmarks. In particular, the types of data. These APIs, in contrast to APIs from
competitions do not provide a framework • Datasets. Each benchmark relies on one other frameworks, such as Deep500, are
for running the benchmarks, nor do they or more datasets that can be used, for layered and are not fine grained. In other
consider data distribution methods. Each example, for training and/or inferencing. words, APIs from SciMLBench are
competition is individually constructed and These datasets are open, task or domain abstracted enough for the benchmarking
relies on its own dataset, set of rules specific and compliant with respect to the process to be automated as much as
and compliance metrics. The competitions FAIR guidelines (Findable, Accessible, possible, instead of providing APIs for
address concerns such as dataset curation, Interoperable and Reusable40). Since obtaining fine-grained measurements,
choice of metric, presentation of results most of these datasets are large, they are such as runtime or I/O or communication
and robustness against overfitting, hosted separately on one of the laboratory times. In fact, SciMLBench retains these
for example. Although such challenge servers (or mirrors) and are automatically measurements and makes them available for
competitions can provide a blueprint for or explicitly downloaded on demand. detailed analysis, but the focus is on science
using ML technologies for specific research • Framework. The framework serves rather than on performance. In addition,
communities, the competitions are generally two purposes. Firstly, at the user level, these APIs are totally independent of the
short lived and are, therefore, unlikely to it facilitates an easier approach to the application, whereas APIs in frameworks

0123456789();:
Perspectives
like Deep500 are intended to reflect the • Diffuse multiple scattering (DMS_ non-ML setting, this task is typically
operational semantics of the layers or Structure). This benchmark uses performed using either thresholding
operations of the neural networks. ML for classifying the structure of or Bayesian methods. The benchmark
The SciMLBench framework is multiphase materials from X-ray exercises DL and includes two datasets,
independent of architecture, and the scattering patterns. More specifically, DS1-Cloud and DS2-Cloud, with sizes
minimum system requirement is determined the ML-based approach enables of 180 GB and 1.2 TB, respectively. The
by the specific benchmark. There is a automatic identification of phases. datasets contain multispectral images
built-in logging mechanism that captures all This application is particularly useful with resolutions of 2,400 × 3,000 pixels
potential system-level and benchmark-level for the materials science community, and 1,200 × 1,500 pixels.
outputs during execution, leaving end as diffuse multiple scattering allows • Electron microscopy image denoising
users or benchmark designers to decide investigation of multiphase materials (EM_Denoise). This benchmark uses
the content and format of the report from from a single measurement — something ML for removing noise from electron
these detailed logs. The central component that is not possible with standard X-ray microscopy images. This improves
that links benchmarks, datasets and the experiments. However, manual analysis the signal-to-noise ratio of the image
framework is the framework configuration of the data can be extremely laborious, and is often used as a precursor to
tool. The most attractive part of the involving searching for patterns to more complex techniques, such as
framework is the possibility of simply identify important motifs (triple surface reconstruction or tomographic
using existing codes as benchmarks, with intersections) that allow for inference projections. Effective denoising can
only a few API calls necessary to register of information. This is a multilabel facilitate low-dose experiments in
the benchmarks. Finally, the framework is classification problem (as opposed to producing images with a quality
designed with scalability in mind, so that a binary classification problem, as in comparable with that obtained in
benchmarks can be run on any computer, the cloud masking example discussed high-dose experiments. Likewise, greater
ranging from a single system to a large-scale below). The benchmark relies on a time resolution can also be achieved
supercomputer. This level of support is simulated dataset of size 8.6 GB with with the aid of effective image denoising
essential, even if the included benchmarks, three-channel images of resolution procedures. This benchmark exercises
in their own, are scalable. 487 × 195 pixels. complex DL techniques on a simulated
• Cloud masking (SLSTR_Cloud). Given dataset of size 5 GB, consisting of
Benchmarks and datasets. The currently a set of satellite images, the challenge for 256 × 256 images covering noised and
released version of SciMLBench has three this benchmark is to classify each pixel denoised (ground truth) datasets.
benchmarks with their associated datasets. of each satellite image as either cloud or
The benchmarks from this release represent non-cloud (clear sky). This problem is The next release of the suite will include
scientific problems drawn from material known as ‘cloud masking’ and is crucial several more examples from various
sciences and environmental sciences, for several important applications in domains with large datasets, such as a
listed below. earth observation. In a conventional, scanning electron tomography benchmark
from material sciences, a benchmark for
Backup Cloud Mirrors quantifying damage to optical lenses in laser
physics and another denoising benchmark
for cryogenic electron microscopic images
from the life sciences domain.
Benchmark focus. With the full-fledged

capability of the framework to log all
Data
activities, and with a detailed set of metrics,
Data (push) Data (pull) it is possible for the framework to collect
Object a wide range of performance details that
storage
can later be used for deciding the focus.
For example, SciMLBench can be used for
Code (pull) science benchmarking (to improve scientific
Code
results through different ML approaches),
Docker
application-level benchmarking and
system-level benchmarking (gathering
Fig. 2 | Moving the benchmark datasets to the evaluation point. A benchmark has two components: end-to-end performance, including I/O
a code and the associated datasets. Whenever a user wants to use a benchmark, the code component and network performance). This is made
can easily be directly downloaded from the server. The data component, however, requires careful possible thanks to the detailed logging
delivery. The associated datasets are often too large for it to be possible to download them from the mechanisms within the framework. These
server through direct download. Instead, they are pushed to the object storage, where they are care-
logging mechanisms rely on various
fully curated and backed up. This curated dataset is then pulled on demand by the user when a bench-
mark that requires this dataset is to be used. Because the exact location of the dataset can lead to
low-level details for gathering system-
delays, these datasets are often mirrored and can also be made available as part of cloud environments. specific aspects, such as memory, GPU
This way, the download location can be opted for by the user (or automatically selected by the down- or CPU usages. Furthermore, there are
loading component). The dotted lines imply that the data can come from any of the locations and can APIs available for logging all the way from
be specified. The ‘pull’ aspect means that the data are downloaded on demand (pulled by the user). the very simple request of starting and
The ‘push’ component means that the dataset distribution is managed by a server or the framework. stopping the logging process to controlling
0123456789();:
Perspectives
Table 1 | Overall assessment of various scientific machine learning benchmarking approaches

Benchmark Focus Process Challenges
Scientific Application System Metrics Framework Reporting Data Distribution Coverage Extensibility
Deep500 None None Partial Full Full Partial None None None Partial
RLBench None Partial Partial Full None Partial Partial Partial Partial Partial
CORAL-2 (DLS/BDAS) Partial Full Full Full Partial Partial None None Full None
AIBench + HPC AI500 Full Full Full Full None Full Partial Partial Partial Partial
DAWNBench None Full Full Full None Partial None None None None
MLCommons Science Full Full Partial Full None Partial Partial Partial Full Partial
SciMLBench Full Full Full Full Full Partial Full Full Full Full
Community Partial None None Partial None Partial Partial None Partial None
competitions
In qualitatively assessing how far each approach addresses the concerns, we have indicated whether they offer no support (none), partial or questionable support
(partial) or fully support the concern (full).
what is specifically being logged, such as Data curation and distribution. ease of use, platform interoperability and
science-specific outputs or domain-specific SciMLBench uses a carefully designed ease of customization. The design relies
metrics. Since the logging process includes curation and distribution mechanism on two API calls, which are illustrated in
all relevant details (including the runtime (a process illustrated in Fig. 2), given below. the documentation with a number of toy
or the power and energy usage, where • Each benchmark has one or more examples, as well as some practical examples.
permitted), the benchmark designer or associated datasets. These benchmark–
developer is responsible for deciding on dataset associations are specified through Conclusion
the appropriate metric, depending on the a configuration tool that is not only In this Perspective, we have highlighted
context. For example, it is possible for framework friendly but also interpretable the need for scientific ML benchmarks and
the developer to rely on a purely scientific by scientists. explained how they differ from conventional
metric or to specify a metric to quantify the • As the scientific datasets are usually large, benchmarking initiatives. We have outlined
energy efficiency of the benchmark. they are not maintained along with the the challenges in developing a suite of useful
code. Instead, they are maintained in scientific ML benchmarks. These challenges
Benchmarking process. With the framework a separate object storage, whose exact span a number of issues, ranging from the
handling most of the complexity of locations are visible to the benchmarking intended focus of the benchmarks and
collecting performance data, there is the framework and to users. the benchmarking processes, to challenges
opportunity to cover a wide range of metrics • Users downloading benchmarks will only around actually developing a useful ML
(even retrospectively, after the benchmarks download the reference implementations benchmark suite. A useful scientific
have been run) and have the ability to (code) and not the data. This enables fast ML suite must, therefore, go beyond just
control the reporting and compliance downloading of the benchmarks and the providing a disparate collection of ML-based
through controlled runs. However, it is framework. Since not all datasets will scientific applications. The critical aspect
worth noting that, although the framework be of interest to everyone, this approach here is to provide support for end users
can support and collect a wide range of prevents unnecessary downloading of not only to be able to effectively use the
runtime and science performance aspects, large datasets. ML benchmarks but also to enable them to
the choice is left to the user to decide the • The framework takes the responsibility develop new benchmarks and extend the
ultimate metrics to be reported. For for downloading datasets on demand suite for their own purposes.
example, the performance data collected or when the user launches the We overviewed a number of
by the framework can be used to generate benchmarking process. contemporary efforts for developing ML
a final figure of merit to compare different benchmarks, of which only a subset has
ML models or hardware systems for the In addition to these basic operational a focus of ML for scientific applications.
same problem. The benchmarks can be aspects, the benchmark datasets are stored in Almost none of these initiatives considers
executed purely using the framework an object storage to enable better resiliency the problem of the efficient distribution
or using containerized environments, and repair mechanisms compared with of large datasets. The majority of the
such as Docker or Singularity. Although simple file storage. The datasets are also approaches rely on externally sourced
running benchmarks natively using mirrored in several locations to enable the datasets, with the implicit assumption that
the framework is possible, native code framework to choose the data source closest users will take care of the data issues. We
execution on production systems is often to the location of the user. The datasets are discussed in more detail the SciMLBench
challenging and ends up demanding various also regularly backed up, as they constitute initiative, which includes a benchmark
dependencies. For these reasons, executing valuable digital assets. framework that not only addresses the
these benchmarks on containerized majority of these concerns but is also
environments is recommended on Extensibility and coverage. The overall designed for easy extensibility.
production, multinode clusters. We design of SciMLBench supports several user The characteristics of these ML
have found that the resulting container scenarios: the ability to add new benchmarks benchmark initiatives are summarized in
execution overheads are minimal. with little knowledge of the framework, Table 1, which shows that the benchmarking

0123456789();:
Perspectives
community has several issues to address 10. Baldi, P. in Proceedings of ICML Workshop on 33. Jiang, Z. et al. in 2021 IEEE International
Unsupervised and Transfer Learning Vol. 27 (eds Conference on Cluster Computing (CLUSTER) 47–58
to ensure that the scientific community Guyon, I., Dror, G., Lemaire, V., Taylor, G. & Silver, D.) (IEEE, 2021).
is equipped with the right set of tools to 37–49 (PMLR, 2012). 34. Krizhevsky, A., Nair, V. & Hinton, G. The CIFAR-10
11. Sutton, R. S. & Barto, A. G. Reinforcement Learning: dataset. Canadian Institute for Advanced Research
become more efficient in leveraging the use An Introduction (MIT Press, 2018). http://www.cs.toronto.edu/~kriz/cifar.html (2010).
of ML technologies in science. 12. Dongarra, J. & Luszczek, P. in Encyclopedia of Parallel 35. Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P.
Computing (ed. Padua, D.) 844–850 (Springer, 2011). in Proceedings of the 2016 Conference on
13. Sakalis, C., Leonardsson, C., Kaxiras, S. & Ros, A. in Empirical Methods in Natural Language Processing
Code availability statement 2016 IEEE International Symposium on Performance 2383–2392 (Association for Computational
Analysis of Systems and Software (ISPASS) 101–111 Linguistics, 2016).
The relevant code for the benchmark suite (IEEE, 2016). 36. MLCommons Science. https://mlcommons.org/en/
can be found at https://github.com/stfc-sciml/ 14. Bailey, D. H. in Encyclopedia of Parallel Computing groups/research-science/.
(ed. Padua, D.) 1254–1259 (Springer, 2011). 37. Rasp, S. et al. WeatherBench: a benchmark data set
sciml-bench. 15. Petitet, A., Whaley, R., Dongarra, J. & Cleary, A. for data-driven weather forecasting. J. Adv. Model.
HPL–a Portable Implementation of the High- Earth Syst. 12, e2020MS002203 (2020).
Performance Linpack Benchmark for Distributed- 38. The MAELSTROM Project. https://www.maelstrom-
Jeyan Thiyagalingam 1, Mallikarjun Shankar 2, Memory Computers (ICL-UTK Computer Science eurohpc.eu/.
Geoffrey Fox 3 and Tony Hey 1 ✉ Department, 2008). 39. Cai, L. et al. Surrogate models based on machine
16. Dongarra, J. & Luszczek, P. in Encyclopedia of Parallel learning methods for parameter estimation of left
1
Rutherford Appleton Laboratory, Science and Computing (ed. Padua, D.) 2055–2057 (Springer, ventricular myocardium. R. Soc. Open Sci. 8, 201121
Technology Facilities Council, Harwell Campus, 2011). (2021).
17. Henghes, B., Pettitt, C., Thiyagalingam, J., Hey, T. 40. Wilkinson, M. D. et al. The FAIR Guiding Principles for
Didcot, UK.
& Lahav, O. Benchmarking and scalability of machine- scientific data management and stewardship. Sci. Data
2
Oak Ridge National Laboratory, Oak Ridge, TN, USA. learning methods for photometric redshift estimation. 3, 160018 (2016).
Mon. Not. R. Astron. Soc. 505, 4847–4856 (2021).
3
Computer Science and Biocomplexity Institute, 18. Müller, A., Karathanasopoulos, N., Roth, C. C. & Acknowledgements
University of Virginia, Charlottesville, VA, USA. Mohr, D. Machine learning classifiers for surface crack We would like to thank Samuel Jackson, Kuangdai Leng, Keith
✉e-mail: Tony.Hey@stfc.ac.uk detection in fracture experiments. Int. J. Mech. Sci. Butler and Juri Papay from the Scientific Machine Learning
209, 106698 (2021). Group at the Rutherford Appleton Laboratory, Junqi Yin and
https://doi.org/10.1038/s42254-022-00441-7 19. Ede, J. M. & Beanland, R. Improving electron Aristeidis Tsaris from Oak Ridge National Laboratory and the
micrograph signal-to-noise with an atrous convolutional MLCommons Science Working Group for valuable discussions.
Published online 6 April 2022 encoder-decoder. Ultramicroscopy 202, 18–25 (2019). This work was supported by Wave 1 of the UKRI Strategic
20. Deng, J. et al. in 2009 IEEE Conference on Computer Priorities Fund under the EPSRC grant EP/T001569/1,
1. Sejnowski, T. J. The Deep Learning Revolution Vision and Pattern Recognition 248–255 (IEEE, 2009). particularly the ‘AI for Science’ theme within that grant, by
(MIT Press, 2018). 21. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet the Alan Turing Institute and by the Benchmarking for AI for
2. Hey, T., Butler, K., Jackson, S. & Thiyagalingam, J. classification with deep convolutional neural networks. Science at Exascale (BASE) project under the EPSRC grant
Machine learning and big scientific data. Philos. Trans. Commun. ACM 60, 84–90 (2017). EP/V001310/1. This research also used resources from the
R. Soc. A Math. Phys. Eng. Sci. 378, 20190054 22. HPL-AI benchmark. https://hpl-ai.org/. Oak Ridge Leadership Computing Facility, which is a DOE
(2020). 23. Müller, M., Whitney, B., Henschel, R. & Kumaran, K. Office of Science user facility supported under contract
3. Callaway, E. ‘It will change everything’: DeepMind’s AI in Encyclopedia of Parallel Computing (ed. Padua, D.) DE-AC05-00OR22725 and from the Science and Technology
makes gigantic leap in solving protein structures. 1886–1893 (Springer, 2011). Facilities Council, particularly that of the Pearl AI resource.
Nature 588, 203–204 (2020). 24. Ben-Nun, T. et al. in 2019 IEEE International Parallel
4. Jumper, J. et al. Highly accurate protein structure and Distributed Processing Symposium (IPDPS) Author contributions
prediction with AlphaFold. Nature 596, 583–589 66–77 (IEEE, 2019). J.T., M.S., G.F. and T.H. conceptualized the idea of scientific
(2021). 25. James, S., Ma, Z., Rovick Arrojo, D. & Davison, A. J. benchmarking. J.T. designed the SciMLBench framework,
5. Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics- RLBench: The robot learning benchmark & learning data architecture and conceptualized the overarching set of
informed neural networks: A deep learning framework environment. IEEE Robot. Autom. Lett. 5, 3019–3026 features. T.H. has overseen the overall developmental efforts,
for solving forward and inverse problems involving (2020). along with J.T., M.S. and G.F. All authors have contributed
nonlinear partial differential equations. J. Comput. Phys. 26. CORAL-2 benchmarks. https://asc.llnl.gov/ towards the writing of the manuscript.
378, 686–707 (2019). coral-2-benchmarks.
6. Greydanus, S., Dzamba, M. & Yosinski, J. in Advances 27. Coleman, C. A. et al. in 31st Conference on Neural Competing interests
in Neural Information Processing Systems Vol. 32 Information Processing Systems (NIPS 2017) (2017). The authors declare no competing interests.
(eds. Wallach, H. et al.) (Curran Associates, Inc., 2019). 28. BenchCouncil AIBench. https://www.benchcouncil.org/
7. Butler, K., Le, M., Thiyagalingam, J. & Perring, T. aibench/index.html. Peer review information
Interpretable, calibrated neural networks for 29. MLCommons HPC Benchmark. https://mlcommons.org/ Nature Reviews Physics thanks Tal Ben-N un, Prasanna
analysis and understanding of inelastic neutron en/groups/training-hpc/. Balaprakash and the other, anonymous, reviewer for their
scattering data. J. Phys. Condens. Matter 33, 194006 30. Thiyagalingam, J. et al. SciMLBench: A benchmarking contribution to the peer review of this work.
(2021). suite for AI for science. https://github.com/stfc-sciml/
8. Hartigan, J. A. & Wong, M. A. A k-means clustering sciml-bench (2021). Publisher’s note
algorithm. J. R. Stat. Soc. C Appl. Stat. 28, 100–108 31. Kaggle Competitions. https://www.kaggle.com/. Springer Nature remains neutral with regard to jurisdictional
(1979). 32. Wu, X. et al. in Proceedings of the 48th International claims in published maps and institutional affiliations.
9. Cortes, C. & Vapnik, V. Support-vector networks. Conference on Parallel Processing 78 (Association for
Mach. Learn. 20, 273–297 (1995). Computing Machinery, 2019). © Springer Nature Limited 2022
0123456789();:

Perspectives: Scientific Machine Learning Benchmarks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Perspectives: Scientific Machine Learning Benchmarks

Uploaded by

Copyright:

Available Formats

PERSpECTIvES

Supervised learning is, therefore, possible

benchmarks be deployed for real-time usage, such as

414 | June 2022 | volume 4 www.nature.com/natrevphys

416 | June 2022 | volume 4 www.nature.com/natrevphys

Benchmark focus. With the full-fledged

418 | June 2022 | volume 4 www.nature.com/natrevphys

Table 1 | Overall assessment of various scientific machine learning benchmarking approaches

420 | June 2022 | volume 4 www.nature.com/natrevphys

You might also like

Perspectives: Scientific Machine Learning Benchmarks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Perspectives: Scientific Machine Learning Benchmarks

Uploaded by

Copyright:

Available Formats

PERSpECTIvES

Supervised learning is, therefore, possible

benchmarks be deployed for real-​time usage, such as

414 | June 2022 | volume 4 www.nature.com/natrevphys

416 | June 2022 | volume 4 www.nature.com/natrevphys

Benchmark focus. With the full-​fledged

418 | June 2022 | volume 4 www.nature.com/natrevphys

Table 1 | Overall assessment of various scientific machine learning benchmarking approaches

420 | June 2022 | volume 4 www.nature.com/natrevphys

You might also like

benchmarks be deployed for real-time usage, such as

Benchmark focus. With the full-fledged