Accelerating AI Innovation

ACCELERATING
AI INNOVATION:
A Comprehensive Guide to ML Data
Processing, Benchmarking, and
Framework Comparison for Optimal
Performance Across CPUs and GPUs
Content
Introduction FAQ
High level ML data processing comparison
Challenges How Ray compares to Apache Spark
How Ray compares to Dask
How Ray helps maximize resources How Ray compares to Databricks
Data Streaming
Heterogeneous pipeline efficiency
Benchmarks
Batch inference benchmarks for ML
CloudSort Benchmark Customer testimonials
Uber’s ML platform migrating from Spark to Ray
Predibase trains 4x faster using Ray dataset
2
Introduction
Data processing is essential for the auto-scaling across common
functioning of machine learning infrastructure such as Kubernetes
and is frequently the most complex and public cloud providers such as
and expensive workflow to handle. AWS and Google Cloud.
With the popularity of deep learning
and generative AI, the infrastructure We also show benchmarks
needed to support these workloads comparing other common
has become increasingly involved. alternatives – such as Amazon
Tasks such as media transcoding, SageMaker and Spark – where Ray
vector embeddings, computer is up to 17x faster and scales linearly
vision, and NLP all require scaling on while maximizing GPU utilization
top of a heterogeneous cluster or using datasets that are terabytes
across AI accelerators and CPUs to in size. Finally, we highlight benefits
process data efficiently and quickly. through some customer case
studies. (Uber, Predibase, Dendra,
This whitepaper aims to cover Blue River Tech, and LiveEO.)
the challenges associated with
processing complex or unstructured
data for ML. It explores how
Ray enables you to scale your
existing ML and Python code and
optimize any ML data workloads
with minimal code changes. Ray
schedules jobs across IO, network,
CPU, GPU, and xPU with low latency,
streams data across the cluster,
and maximizes GPUs utilization
by leveraging spot instances and
3
Challenges
Some of the most demanding requires parallelism across IO,

machine learning (ML) use cases networks, multiple nodes, and
involve data pipelines spanning significant memory to buffer results
CPU and GPU devices in distributed between the CPU and GPU, leading
environments. These situations arise to excess cost and underutilized
in various workloads including: resources.
Consider the decoding of
Batch inference - which involves compressed images in memory. A
a CPU-intensive preprocessing typical JPEG decompression ratio is
stage (e.g., video decoding or image 10:1, resulting in significant memory
resizing) before utilizing a GPU- pressure and CPU load, as the output
intensive model to make predictions. size can be ten times larger than the
Distributed training - where similar input. Challenges intensify with other
CPU-heavy transformations are data types like video. For instance,
required to prepare or augment the H264 decompresses at a 2000:1
dataset prior to GPU training. ratio, generating 200GB of frame
Real-time inference - where a outputs from a 100MB input file. This
combination of business rules, look up poses a problem for practitioners
to internal systems, coupled with GPU attempting to distribute CPU-heavy
for prediction, are needed to power preprocessing across multiple CPUs
an application. or machines, as they have to handle
intermediate results much larger
In many of these workloads, the than their already substantial source
preprocessing steps are often the datasets.
bottleneck and lead to idle GPUs. This
can happen when preprocessing
4
Intermediate data (e.g. 10x larger
Input Data (GBs) than input) spilled to disk Inference Results (GBs)
Decode Inference Save
Cluster Storage
In the figure below, the load and preprocessing steps run on CPU (Stage 1), inference runs on GPUs (Stage 2),
and then the result saving runs on CPU again (Stage 3). This configuration leads to spilling of data to remote
storage when intermediate results (e.g., decoded video frames) exceed cluster memory sizes. Hence, we see
that bulk synchronous parallel (BSP) is not memory optimal for heterogeneous workloads:
5
The figure below illustrates the workflow where the load and preprocessing steps occur on the CPU (Stage 1),
followed by inference on GPUs (Stage 2), and result saving on the CPU as the final step (Stage 3). This setup
can result in data spilling to remote storage when intermediate results, such as decoded video frames, exceed
the memory capacity of the cluster. The bulk synchronous parallel (BSP) approach is not memory efficient for
heterogeneous workloads, increases overhead between stages, is an inefficient use of resources, and is more
likely to result in overrun costs.
Stage 1 Stage 2 Stage 3
data
Load Preprocess Inference Save
partition 1
CPU GPU CPU
data
partition 2 Load Preprocess Inference Save
CPU GPU CPU
data
partition 3 Load Preprocess Inference Save
CPU GPU CPU
Cluster Storage Cluster Storage
These unnecessary overheads can be avoided with end-to-end pipelining (i.e., streaming)
6
How Ray helps maximize resources
Data Streaming
Starting with Ray 2.4, the default execution strategy of Ray Data
is streaming. The streaming Dataset API is fully backwards
compatible with the existing API i.e., it can be transformed lazily
with map operations, support shuffle operations, and also
caching / materialization in memory:
# Create a dataset over parquet files

ds: ray.data.Dataset = ray.data.read_parquet(...)
# Transform the dataset

ds = ds.map_batches(my_preprocess_fn)
ds = ds.map_batches(my_model_fn)
# Iterate over dataset batches in streaming fashion

for batch in ds.iter_batches():
print(batch)
# Materialize all contents of the dataset in memory

ds = ds.materialize()
In summary, the Ray Data API now leverages streaming

execution for improved performance on large datasets, with the
same simple transformation API as in previous Ray versions.
7
Heterogeneous pipelines efficiency
To seamlessly execute streaming steps by routing decoded frames to

topologies, Ray Data provides several actors on the same node.
optimizations under the hood including:
Fault tolerance: Ray Data leverages
Memory stability: Ray Data relies on the Ray’s built-in fault tolerance to handle
underlying Ray scheduler to schedule object loss in large jobs. When objects
tasks and actors, but still needs to are lost, they are recomputed based on
manage back-pressure across the their task lineage tracked by Ray. If actors
streaming topology to bound memory were needed to produce these objects,
usage and avoid object store spilling. Ray restarts these actors prior to re-
It does this by only scheduling new submitting the task.
tasks if it would keep the streaming
execution under configured resource
limits. Intuitively, enforcing a cap on
intermediate result memory usage
is needed to avoid degrading to bulk
execution.
Data locality: While Ray will already

place tasks on nodes where their
input arguments are local, Ray Data’s
streaming backend extends this to
optimize the scheduling of actor tasks.
For example, in the above example, a lot
of network traffic is avoided between the
`decode_frames` and `FrameAnnotator`
8
Benchmarks
Batch inference benchmarks for ML
10 GB Resnet-50 Batch Inference Throughput (Higher is better)
400
Throughput (img/sec)
312.4602065
300
200
147.8485809
113.3525739 108.7687792
100
18.70207679
0
Spark Spark Spark SageMaker Ray
single-cluster single-cluster multi-cluster BatchTransform
Iterator API Iterator API
In this blog post titled “Offline Batch Inference | Ray, Apache Spark & SageMaker”, benchmarks for offline batch
inference compared AWS SageMaker Batch Transform, Apache Spark, and Ray Data. The chart above shows that
Ray Data processes up to 17x faster than SageMaker Batch Transform and 2x faster than Spark for offline image
classification. Ray Data also scales effectively to terabyte-sized datasets while maintaining a 90%+ utilization.
9
CloudSort Benchmark
Ray breaks the $1/TB barrier as the world’s most cost-efficient sorting system.
UC Berkeley announced a new world record on the CloudSort benchmark using Ray. The Sky Computing Lab at UC
Berkeley developed Exoshuffle, a new architecture for building distributed shuffle that is simple, flexible, and achieves
high performance. Building upon this architecture and Ray, Exoshuffle-CloudSort is now the most cost-efficient way
to sort 100TB of data on the public cloud, using only $97 worth of cloud resources, or $0.97 per terabyte. This is 33%
more cost-efficient than the previous world record, set by Apache Spark in 2016, and 15% cheaper when factoring in
decreasing hardware costs.
10
Customer testimonials
Uber’s ML platform migrating from Spark to Ray
By implementing their large-scale deep learning on top of a Ray heterogeneous (CPU + GPU) cluster, the Uber team
benefited from a 50% savings on the ML compute part of training a large-scale deep-learning job.
11
Heterogeneous Ray Cluster: Performance
Performance comparison: training of an I/O intense benchmark model
with a large amount of data and small record size
The Uber internal Autotune service uses Ray Tune, resulting in up to 4x speed up on their hyperparameter tuning jobs.
12
Predibase trains 4x faster using Ray dataset
Petastorm, a library allowing various ML frameworks to read parquet files, is commonly used when
transformations are performed in Spark. Predibase, a low-code deep learning company, leveraged Ray Datasets
to achieve a 4x better performance for their training workloads, as compared to Petastorm.
13
LiveEO
“ By integrating Anyscale and Prefect into its data science operations, LiveEO achieved an optimization
of up to 65% in its geospatial AI workloads, reducing both costs and runtime by significant margins and
streamlining its complex data pipelines. With these improvements, LiveEO is providing better service
to its customers, who maintain critical infrastructure such as electricity and transportation. For these
customers, the change equates to more reliable service, quicker product updates, and safer operations.
Blue River Technology
“ By adopting the Anyscale Platform to run Ray on AWS, Blue River cut regression testing time for AI
workloads in product development by more than half. With this improved productivity, Blue River now
provides faster insights from computer vision images and rolls out new product updates sooner.
Sewer AI
“ Using Anyscale, SewerAI achieved remarkable gains in the speed at which it can process vital inspection
data, while at the same time lowering the cost and resources required:
Faster processing: 3x faster batch inference, reducing processing time from one hour to 20 minutes
Resource efficiency: 50% reduction in required machines, as well as 95% GPU utilization — increased from
25% previously
Accelerated testing: Nearly instantaneous code testing vs. 10 minutes
Cost savings: 75%+ reduction in total cost of ownership (TCO) compared to AWS Batch
14
FAQ
15
How does Ray compare to Apache Spark?
Ray and Spark are both distributed computing frameworks 5. Scheduler: Ray has a decentralized scheduler with low
that can be used for large-scale data processing, but they latency job scheduling in milliseconds vs seconds in Spark.
have some differences in their design and functionality.
6. Data Formats: Spark has built-in support for distributed
Here are some of the key differences between Ray and data processing of structured data using SQL, DataFrames,
Spark: and Datasets, while Ray focuses more on distributed
computing for machine learning workloads and provides
1. Design Philosophy: Ray was designed with a focus on support for common deep learning frameworks such as
distributed computing for machine learning and deep TensorFlow and PyTorch and Apache Arrow.
learning workloads, while Spark was designed with a focus
on data processing and analytics. 7. Distributed Object Store: Ray is unique in how it offers
a Plasma-based distributed object store where data
2. API: Ray provides a lower-level, more flexible API that is accessible to any remote tasks creating objects as
allows developers to build custom distributed applications, futures. This allows hundreds of tasks to be scheduled
while Spark provides a higher-level, more structured API asynchronously.
that is optimized for data processing tasks. As such it
depends on a domain specific language APIs based on a In summary, Ray and Spark have different design
DataFrame, with a schema and structured data. philosophies and target different use cases, so the choice
between the two will depend on the specific requirements
3. Resource Management: No heterogeneous cluster of your application. If you need to process large amounts
possible in Databricks.This is important if you are trying to of structured data on a homogeneous cluster (CPU or GPU)
optimize across IO, Network, CPU and GPUs efficiently. and want a higher-level API, Spark might be the better
choice. However, if you need more flexibility and want
4. Language and execution engine: Ray is Python native to build custom distributed applications in Python on a
bindings to an efficient C++ core, whereas Spark is heterogeneous cluster (CPU and GPU), or if you’re working
Java based. Data Science is primarily Python while data with machine learning workloads, Ray might be a better fit.
processing and ETL is Java.
16
How does Ray compare to Dask?
Ray and Dask are both distributed computing frameworks Dask provides a higher-level API that is optimized for parallel
for Python that enable parallel and distributed processing processing of data using familiar APIs like Pandas and
of large-scale data. Here are some of the key differences NumPy.
between the two:
4. Resource Management: Ray and Dask both support
1. Design Philosophy: Ray and Dask have different design dynamic scaling and can run on a variety of cluster
philosophies. Ray was designed to provide a general- managers, but they have different approaches to resource
purpose, distributed computing framework with a focus on management. Ray uses an external resource manager
machine learning and deep learning workloads, while Dask like Kubernetes or VMs, while Dask has its own distributed
was designed to provide a parallel computing library for scheduler and can manage resources internally.
PyData, namely NumPy, scikit-learn and Pandas.
5. Scheduler: Ray has a decentralized scheduler with low
2. Task Scheduling: Ray and Dask both use task scheduling latency job scheduling in milliseconds while Dask has a
to distribute workloads across a cluster, but they use different global central scheduler.
task schedulers. Ray uses a task scheduler based on the
actor model, which allows for more fine-grained control over In summary, Ray and Dask have different design philosophies
resources and and target different use cases, so
dependencies. Dask uses a task scheduler based on a the choice between the two will depend on the specific
directed acyclic graph (DAG) that is optimized for data requirements of your application. If you
processing workloads. need a general-purpose distributed computing framework
with a focus on machine learning, Ray
3. API: Ray and Dask have different APIs. Ray provides a might be the better choice. However, if you're working
more low-level API that allows developers to build custom with data processing workloads and want
distributed applications and integrate them with popular a higher-level API, Dask might be a better fit.
machine learning frameworks like TensorFlow and PyTorch.
17
How does Anyscale compare to Databricks?
In addition to the Ray vs. Spark comparison on p.16,

Databricks provides many capabilities for organizations
to build their ML platform. Databricks uses a notebook
primarily as a primitive and provides many features to
unify data, analytics, and ML. Anyscale provides the best
place to run your Ray workloads as a managed service.
Ray is more flexible and developer-focused with Anyscale
workspaces, a managed IDE using VSCode and Jupyter
Notebook.
Anyscale’s philosophy is to provide the best ML compute

and allow organizations to integrate with the best tools in
their category to allow organizations to build their best-of-
breed and flexible ML platform.
18

Accelerating AI Innovation

Uploaded by

Copyright:

Available Formats

Accelerating AI Innovation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accelerating AI Innovation

Uploaded by

Copyright:

Available Formats

ACCELERATING

Some of the most demanding requires parallelism across IO,

Decode Inference Save

Stage 1 Stage 2 Stage 3

CPU GPU CPU

CPU GPU CPU

CPU GPU CPU

Cluster Storage Cluster Storage

# Create a dataset over parquet files

# Transform the dataset

# Iterate over dataset batches in streaming fashion

# Materialize all contents of the dataset in memory

In summary, the Ray Data API now leverages streaming

To seamlessly execute streaming steps by routing decoded frames to

Data locality: While Ray will already

10 GB Resnet-50 Batch Inference Throughput (Higher is better)

Blue River Technology

In addition to the Ray vs. Spark comparison on p.16,

Anyscale’s philosophy is to provide the best ML compute

You might also like