Machine Learning

Machine Learning Systems Design
Lecture 15: ML Infrastructure & Platform
Reply in Zoom chat:
Are you ready for demo day?
CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu

Logistics
● Demo Day in person, Mar 9
● [TBD] Will likely move to 3.15pm - 6.15pm
● Final project workshop next Wed!
2
What does infrastructure mean?
3
ML systems are complex
● More components
○ A request might jump 20-30 hops before response
○ A problem occurs, but where? Containers
Schedulers
Serverless
Lambda
functions
Mesh
routing
Load
balancers
Microservices
4
More complex systems, better infrastructure needed
Automate
Reuse/share code
boilerplate code
Reduce surface
area for bugs
5
● Infrastructure: the set of fundamental facilities and systems that support the
sustainable functionality of households and firms.
● ML infrastructure: the set of fundamental facilities that support the
development and maintenance of ML systems.
6
Every company’s infrastructure needs are different
7
63K requests/sec
234M requests/hr
8
63K requests/sec
234M requests/hr
● GB - TBs of data daily

● 10s - 100s data scientists
● 3+ models
9
63K requests/sec
234M requests/hr
Vast majority of apps

(reasonable scale)
10
n=66 | claypot.ai
11
Infrastructure Layers
12
Infrastructure Layers
13
Storage & Compute Layer
14
Storage
● Where data is collected and stored
● Simplest form: HDD, SSD
● More complex forms: data lake, data warehouse
● Examples: S3, Redshift, Snowflake, BigQuery
See Lecture 2
15
Storage: heavily commoditized
● Most companies use storage provided by other companies (e.g. cloud)
● Storage has become so cheap that most companies just store everything
16
Compute layer: engine to execute your jobs
● Compute resources a company has access to
● Mechanism to determine how these resources can be used
17
Compute layer: engine to execute jobs
● Simplest form: a single CPU/GPU core
● Most common form: cloud compute
18
Compute unit
● Compute layer can be sliced into smaller compute units to be used
concurrently
○ A CPU core might support 2 concurrent threads, each thread is used as a compute unit to
execute its own job
○ Multiple CPUs can be joined to form a large compute unit to execute a large job
19
Compute unit
● Compute layer can be sliced into smaller compute units to be used
concurrently
○ A CPU core might support 2 concurrent threads, each thread is used as a compute unit to
execute its own job
○ Multiple CPUs can be joined to form a large compute unit to execute a large job
Wrapper around
Unit: job Unit: pod container
20
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.
To add arrays A and B

1. Load A & B into memory
2. Perform addition on A and B
21
If A & B don’t fit into memory, it’ll be

possible to do the ops without
out-of-memory algorithms
To add arrays A and B

1. Load A & B into memory
2. Perform addition on A and B
22
To add arrays A and B Important metrics of compute layer:

1. Load A & B into memory 1. Memory
2. Perform addition on A and B 2. Speed of computing ops
23
Compute layer: memory
● Amount of memory
○ Straightforward
○ An instance with 8GB of memory is more expensive than an instance with 2GB of memory
24
Compute layer: memory
● Amount of memory
● I/O bandwidth: speed at which data can be loaded into memory
25
Compute layer: speed of ops
● Most common metric: FLOPS
○ Floating Point Operations Per Second
“A Cloud TPU v2 can perform up to 180 teraflops,

and the TPU v3 up to 420 teraflops.”
- Google, 2021
26
Compute layer: speed of ops
● Most common metric: FLOPS
● Contentious
○ What exactly is an ops?
■ If 2 ops are fused together, is it 1 or 2 ops?
○ Peak perf at 1 teraFLOPS doesn’t mean your app will run at 1 teraFLOPS
27
Compute layer: utilization
● Utilization = actual FLOPS / peak FLOPS
If peak 1 trillion FLOPS but job runs 300

billion FLOPS
-> utilization = 0.3
28
Compute layer: utilization
The higher,
● Utilization = actual FLOPS / peak FLOPS the better
● Dependent on how fast data can be loaded into memory
Tensor Cores are very fast. So fast … that they are idle most of the time as they
are waiting for memory to arrive from global memory.
For example, during BERT Large training, which uses huge matrices — the
larger, the better for Tensor Cores — we have utilization of about 30%.
- Tim Dettmers, 2020

29
Compute layer: if not FLOPS, then what?
30
Compute layer: if not FLOPS, then what?
● How long it will take this compute unit to do common workloads
● MLPerf measure hardware on common ML tasks e.g.
○ Train a ResNet-50 model on the ImageNet dataset
○ Use a BERT-large model to generate predictions for the SQuAD dataset
31
MLPerf is also
contentious
32
Compute layer: evaluation
● Memory
● Cores
● I/O bandwidth
● Cost
33
Public Cloud vs. Private Data Centers
● Like storage, compute is largely commoditized
34
2020 – The Year that Cloud Service Revenues Finally Dwarfed Enterprise Spending on Data Centers | Synergy Research Group
Benefits of cloud
● Easy to get started
● Appealing to variable-sized workloads
○ Private: would need 100 machines upfront, most will be idle most of the time
○ Cloud: pay for 100 machines only when needed
35
Benefits of cloud
● Easy to get started
● Appealing to variable-sized workloads
○ Private: would need 100 machines upfront, most will be idle most of the time
○ Cloud: pay for 100 machines only when needed
Autoscaling!
36
Drawbacks of cloud: cost
● Cloud spending: ~50% cost of revenue
37
The Cost of Cloud, a Trillion Dollar Paradox | Andreessen Horowitz (2021)
Drawbacks of cloud: cost
“Across 50 of the top public software companies currently utilizing cloud
infrastructure, an estimated $100B of market value is being lost … due to cloud
impact on margins — relative to running the infrastructure themselves.”
The Cost of Cloud, a Trillion Dollar Paradox | Andreessen Horowitz (2021)
38
Cloud repatriation
● Process of moving workloads from cloud to private data centers
A large chunk
due to cloud
repatriation
39
Multicloud strategy
● To optimize cost
● To avoid cloud vendor lock-in
“81% of respondents said they are working with

two or more providers”
- Gartner (2019)
40
Development Environment
41
● Text editors & notebooks
○ Where you write code, e.g. VSCode, Vim
42
● Notebook: Jupyter notebooks, Colab
○ Also works with arbitrary artifacts that aren’t code (e.g. images, plots, tabular data)
○ Stateful
■ Only need to run from the failed step instead from the beginning
43
● Notebook at Netflix
44
● Versioning
○ Git: code versioning
○ DVC: data versioning
○ WandB: experiment versioning
45
● Versioning
● CI/CD test suite: test your code before pushing it to staging/prod
46
Dev env: underestimated
“if you have time to set up only one piece of infrastructure well, make it
the development environment for data scientists.”
Ville Tuulos, Effective Data Science Infrastructure (2022)
47
Standardize dev environments
● Standardize dependencies with versions
48
● Standardize tools & versions
49
● Standardize tools & versions
● Standardize hardware: cloud dev env
○ Simplify IT support
○ Security: revoke access if laptop is stolen
○ Bring your dev env closer to prod env
○ Make debugging easier
50
Dev to prod
● Elastic compute: can stop/start instances at will
● How to recreate the required environment in a new instance?
51
Container
● Step-by-step instructions on how to recreate an environment in which your
model can run:
○ install this package
○ download this pretrained model
○ set environment variables
○ navigate into a folder
○ etc.
52
Transformers Dockerfile
● CUDA/cuDNN
● bash/git/python3
● Jupyter notebook
● TensorFlow/Pytorch
● transformers
53
Multiple containers: dependency management
● NumPy 1.21
● Numpy 0.8
54
Multiple containers: cost saving
● More memory
● CPU ok
● Less memory
● Need GPU
55
Container orchestration
● Help deploy and manage containerized applications to a serverless cluster
● Spinning up/down containers
56
Breakout exercise
57
Group of 4, 10 mins
● What has been the most difficult parts of working on the project?
● What else do you need to work on for the final demo?
58
Resource Management
59
Resource management
Pre-cloud Cloud
Resources Finite Practically infinite
Implication More resources for an app = More resources for an app

less resources for other apps don’t have to affect other apps
Goal Utilization Utilization + cost efficiency
60
Resource management
Pre-cloud Cloud
Simplify the
Resources Finite Practically infinite allocation
challenge
Implication More resources for an app = More resources for an app
less resources for other apps don’t have to affect other apps
61
Resource management
Pre-cloud Cloud
Resources Finite Practically infinite

OK to use more
Implication More resources for an app = More resources for an app resources if help
less resources for other apps don’t have to affect other apps engineers to be
more productive
62
ML workloads
● Repetitive
○ Batch prediction
○ Periodical retraining
○ Periodical analytics
● Dependencies
○ E.g. train depends on featurize
63
ML workloads
● Repetitive
○ Batch prediction
○ Periodical retraining
○ Periodical analytics
● Dependencies
○ E.g. train depends on featurize
64
Cron: extremely simple
● Schedule jobs to run at fixed time intervals
● Report the results
65
Cron: extremely simple
● Schedule jobs to run at fixed time intervals
● Report the results
Cron can’t
handle this
66
Scheduler
● Schedulers are cron programs that can
handle dependencies
67
Scheduler
● Most schedulers require you to specify your
workloads as DAGs
This is a DAG
● Directed
● Acyclic
● Graph
68
Scheduler
● Can handle event-based & time based triggers
○ Run job A whenever X happens
● If a job fails, specify how many times to retry before giving up
● Jobs can be queued, prioritized, and allocated resources
○ If a job requires 8GB of memory and 2 CPUs, scheduler needs to find an instance with 8GB of
memory and 2 CPUs
69
Scheduler: SLURM example
#!/bin/bash
#SBATCH -J JobName
#SBATCH --time=11:00:00 # When to start the job
#SBATCH --mem-per-cpu=4096 # Memory, in MB, to be allocated per CPU
#SBATCH --cpus-per-task=4 # Number of cores per task
70
Scheduler: optimize utilization
● Schedulers aware of:
○ resources available
○ resources needed for each job
● Sophisticated schedulers (e.g. Google Borg) can reclaim unused resources
○ If I estimate that my job needs 8GB and it only uses 4GB, reclaim 4GB for other jobs
71
Scheduler challenge
● General purpose schedulers are extremely hard to design
● Need to handle any workload with any number of concurrent machines
● If scheduler is down, every workflow this scheduler touches will also be
down
72
Scheduler to Orchestrator
● Scheduler: when to run jobs
● Orchestrator: where to run jobs
73
○ Handle jobs, queues, user-level quotas, etc.
○ Handle containers, instances, clusters, replication, etc.
○ Provision: allocate more instances to the instance pool as needed
74
○ Handle jobs, queues, user-level quotas, etc.
○ Typically used for periodical jobs like batch jobs
○ Handle containers, instances, clusters, replication, etc.
○ Provision: allocate more instances to the instance pool as needed
○ Typically used for long-running jobs like services
75
Scheduler & orchestrator
● Schedulers usually have some orchestrating capacity and vice versa
○ Schedulers like SLURM and Google’s Borg have some orchestrating capacity
○ Orchestrators like HashiCorp Nomad and K8s come with some scheduling capacity
● Often, schedulers are run on top of orchestrators
○ Run Spark’s job scheduler on top of K8s
○ Run AWS Batch scheduler on top of EKS
76
Data science workflow management
77
Data science workflow
● Can be defined using:
○ Code (Python)
○ Configuration files (YAML)
● Examples: Airflow, Argo, KubeFlow, Metaflow
78
Airflow
● 1st gen data science workflow
management
● Champion of
“configuration-as-code”
● Wide range of operators to expand
capabilities
79
Airflow: cons
● Monolithic
○ The entire workflow as a container
● Non-parameterized
○ E.g. need to define another workflow if
you want to change learning rate
● Static DAG
○ Can’t handle workloads with unknown
number of records
80
Argo: next gen
● Created to address Airflow’s problems
○ Containerized
○ Fully parameterized
○ Dynamic DAG
81
Argo: cons
● YAML-based configs
○ Can get very messy
● Only run on K8s clusters
○ Can’t easily test in dev environment
82
Kubeflow & Metaflow: same code in dev & prod
● Allows data scientists to use the same code in both dev and prod
environments
83
Kubeflow: more mature but more boilerplate
84
Metaflow: less mature but cleaner API
● Run note code in cloud with a line
of code (@batch)
○ Run experiments locally
○ Once ready, run code on AWS Batch
● Can run different steps of the
same workflow in different envs
85
ML Platform
86
Model platform: story time
1. Anna started working on recsys at company X
2. To deploy recsys, Anna’s team need to build tool like model deployment,
model store, feature store, etc.
3. Other teams at X started deploying models and needed to build the same
tools
4. X decided to have a centralized platform to serve multiple ML use cases
ML Platform
87
ML platform: key components
● Model deployment
● Model store
● Feature store
88
Deployment: online | batch prediction
See Lecture 8
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction
89
● The most common MLOps tool
○ Cloud providers: SageMaker (AWS), Vertex AI (GCP), AzureML (Azure), etc.
○ Independent: MLflow Models, Seldon, Cortex, Ray Serve, etc.
90
● The most common MLOps tool
● Not all can do batch + online prediction well
○ e.g. some companies use Seldon for online prediction, but Databricks for batch
91
Deployment service: model quality challenge
● How to ensure a model’s quality pre- and during deployment?
○ Traditional code: CI/CD, PR review
○ ML: ???, ???
92
Model store
● Simplest form: store all models in blob storage like S3
● Problem:
○ When something happens, how to figure out:
■ Who/which team is responsible for this model?
■ If the correct model binary was deployed?
■ If the features used are correct?
■ If the code is up-to-date?
■ If something happened with the data pipeline?
93
Model store: artifact tracking
● Track all metadata necessary to debug
a model later
● Severely underestimated
94
Model store: artifact tracking at Stitch Fix
95
Feature store: key challenges
1. Feature management
a. Multiple models might share features, e.g. churn prediction & conversion prediction
b. How to allow different teams to find & use high-value features discovered by other teams?
96
2. Feature consistency
a. During training, features might be written in Python
b. During deployment, features might be written in Java
c. How to ensure consistency between different feature pipelines?
97
3. Feature computation
a. It might be expensive to compute the same feature multiple times for different models
b. How to store computed features so that other models can use?
98
1. Feature management Feature catalog
3. Feature computation Data warehouse
99
Other ML platform components
● Monitoring (ML & ops metrics)
● Experimentation platform
● Measurement (business metrics)
100
Evaluate MLOps tools
1. Does it work with your cloud provider?
2. Open-source or managed service?
3. Data security requirements
101
Machine Learning Systems Design
Next class: Final project workshops
cs329s.stanford.edu | Chip Huyen

Machine Learning

Uploaded by

Copyright:

Available Formats

Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

Machine Learning Systems Design

Lecture 15: ML Infrastructure & Platform

Reply in Zoom chat:

Are you ready for demo day?

CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu

● GB - TBs of data daily

Vast majority of apps

To add arrays A and B

If A & B don’t ﬁt into memory, it’ll be

To add arrays A and B

To add arrays A and B Important metrics of compute layer:

“A Cloud TPU v2 can perform up to 180 teraﬂops,

If peak 1 trillion FLOPS but job runs 300

- Tim Dettmers, 2020

The Cost of Cloud, a Trillion Dollar Paradox | Andreessen Horowitz (2021)

“81% of respondents said they are working with

Resources Finite Practically inﬁnite

Implication More resources for an app = More resources for an app

Goal Utilization Utilization + cost efﬁciency

Goal Utilization Utilization + cost efﬁciency

Resources Finite Practically inﬁnite

Goal Utilization Utilization + cost efﬁciency

cs329s.stanford.edu | Chip Huyen

You might also like