Machine Learning
Machine Learning
Machine Learning
2
What does infrastructure mean?
3
ML systems are complex
● More components
○ A request might jump 20-30 hops before response
○ A problem occurs, but where? Containers
Schedulers
Serverless
Lambda
functions
Mesh
routing
Load
balancers
Microservices
4
More complex systems, better infrastructure needed
Automate
Reuse/share code
boilerplate code
Reduce surface
area for bugs
5
● Infrastructure: the set of fundamental facilities and systems that support the
sustainable functionality of households and firms.
● ML infrastructure: the set of fundamental facilities that support the
development and maintenance of ML systems.
6
Every company’s infrastructure needs are different
7
Every company’s infrastructure needs are different
63K requests/sec
234M requests/hr
8
Every company’s infrastructure needs are different
63K requests/sec
234M requests/hr
9
Every company’s infrastructure needs are different
63K requests/sec
234M requests/hr
10
n=66 | claypot.ai
11
Infrastructure Layers
12
Infrastructure Layers
13
Storage & Compute Layer
14
Storage
● Where data is collected and stored
● Simplest form: HDD, SSD
● More complex forms: data lake, data warehouse
● Examples: S3, Redshift, Snowflake, BigQuery
See Lecture 2
15
Storage: heavily commoditized
● Most companies use storage provided by other companies (e.g. cloud)
● Storage has become so cheap that most companies just store everything
16
Compute layer: engine to execute your jobs
● Compute resources a company has access to
● Mechanism to determine how these resources can be used
17
Compute layer: engine to execute jobs
● Simplest form: a single CPU/GPU core
● Most common form: cloud compute
18
Compute unit
● Compute layer can be sliced into smaller compute units to be used
concurrently
○ A CPU core might support 2 concurrent threads, each thread is used as a compute unit to
execute its own job
○ Multiple CPUs can be joined to form a large compute unit to execute a large job
19
Compute unit
● Compute layer can be sliced into smaller compute units to be used
concurrently
○ A CPU core might support 2 concurrent threads, each thread is used as a compute unit to
execute its own job
○ Multiple CPUs can be joined to form a large compute unit to execute a large job
Wrapper around
Unit: job Unit: pod container
20
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.
21
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.
22
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.
23
Compute layer: memory
● Amount of memory
○ Straightforward
○ An instance with 8GB of memory is more expensive than an instance with 2GB of memory
24
Compute layer: memory
● Amount of memory
● I/O bandwidth: speed at which data can be loaded into memory
25
Compute layer: speed of ops
● Most common metric: FLOPS
○ Floating Point Operations Per Second
- Google, 2021
26
Compute layer: speed of ops
● Most common metric: FLOPS
● Contentious
○ What exactly is an ops?
■ If 2 ops are fused together, is it 1 or 2 ops?
○ Peak perf at 1 teraFLOPS doesn’t mean your app will run at 1 teraFLOPS
27
Compute layer: utilization
● Utilization = actual FLOPS / peak FLOPS
28
Compute layer: utilization
The higher,
● Utilization = actual FLOPS / peak FLOPS the better
● Dependent on how fast data can be loaded into memory
Tensor Cores are very fast. So fast … that they are idle most of the time as they
are waiting for memory to arrive from global memory.
For example, during BERT Large training, which uses huge matrices — the
larger, the better for Tensor Cores — we have utilization of about 30%.
30
Compute layer: if not FLOPS, then what?
● How long it will take this compute unit to do common workloads
● MLPerf measure hardware on common ML tasks e.g.
○ Train a ResNet-50 model on the ImageNet dataset
○ Use a BERT-large model to generate predictions for the SQuAD dataset
31
MLPerf is also
contentious
32
Compute layer: evaluation
● Memory
● Cores
● I/O bandwidth
● Cost
33
Public Cloud vs. Private Data Centers
● Like storage, compute is largely commoditized
34
2020 – The Year that Cloud Service Revenues Finally Dwarfed Enterprise Spending on Data Centers | Synergy Research Group
Benefits of cloud
● Easy to get started
● Appealing to variable-sized workloads
○ Private: would need 100 machines upfront, most will be idle most of the time
○ Cloud: pay for 100 machines only when needed
35
Benefits of cloud
● Easy to get started
● Appealing to variable-sized workloads
○ Private: would need 100 machines upfront, most will be idle most of the time
○ Cloud: pay for 100 machines only when needed
Autoscaling!
36
Drawbacks of cloud: cost
● Cloud spending: ~50% cost of revenue
37
The Cost of Cloud, a Trillion Dollar Paradox | Andreessen Horowitz (2021)
Drawbacks of cloud: cost
“Across 50 of the top public software companies currently utilizing cloud
infrastructure, an estimated $100B of market value is being lost … due to cloud
impact on margins — relative to running the infrastructure themselves.”
38
Cloud repatriation
● Process of moving workloads from cloud to private data centers
A large chunk
due to cloud
repatriation
39
Multicloud strategy
● To optimize cost
● To avoid cloud vendor lock-in
- Gartner (2019)
40
Development Environment
41
Development Environment
● Text editors & notebooks
○ Where you write code, e.g. VSCode, Vim
42
Development Environment
● Notebook: Jupyter notebooks, Colab
○ Also works with arbitrary artifacts that aren’t code (e.g. images, plots, tabular data)
○ Stateful
■ Only need to run from the failed step instead from the beginning
43
Development Environment
● Notebook at Netflix
44
Development Environment
● Text editors & notebooks
● Versioning
○ Git: code versioning
○ DVC: data versioning
○ WandB: experiment versioning
45
Development Environment
● Text editors & notebooks
● Versioning
● CI/CD test suite: test your code before pushing it to staging/prod
46
Dev env: underestimated
“if you have time to set up only one piece of infrastructure well, make it
the development environment for data scientists.”
Ville Tuulos, Effective Data Science Infrastructure (2022)
47
Standardize dev environments
● Standardize dependencies with versions
48
Standardize dev environments
● Standardize dependencies with versions
● Standardize tools & versions
49
Standardize dev environments
● Standardize dependencies with versions
● Standardize tools & versions
● Standardize hardware: cloud dev env
○ Simplify IT support
○ Security: revoke access if laptop is stolen
○ Bring your dev env closer to prod env
○ Make debugging easier
50
Dev to prod
● Elastic compute: can stop/start instances at will
● How to recreate the required environment in a new instance?
51
Container
● Step-by-step instructions on how to recreate an environment in which your
model can run:
○ install this package
○ download this pretrained model
○ set environment variables
○ navigate into a folder
○ etc.
52
Transformers Dockerfile
● CUDA/cuDNN
● bash/git/python3
● Jupyter notebook
● TensorFlow/Pytorch
● transformers
53
Multiple containers: dependency management
● NumPy 1.21
● Numpy 0.8
54
Multiple containers: cost saving
● More memory
● CPU ok
● Less memory
● Need GPU
55
Container orchestration
● Help deploy and manage containerized applications to a serverless cluster
● Spinning up/down containers
56
Breakout exercise
57
Group of 4, 10 mins
● What has been the most difficult parts of working on the project?
● What else do you need to work on for the final demo?
58
Resource Management
59
Resource management
Pre-cloud Cloud
60
Resource management
Pre-cloud Cloud
Simplify the
Resources Finite Practically infinite allocation
challenge
Implication More resources for an app = More resources for an app
less resources for other apps don’t have to affect other apps
61
Resource management
Pre-cloud Cloud
62
ML workloads
● Repetitive
○ Batch prediction
○ Periodical retraining
○ Periodical analytics
● Dependencies
○ E.g. train depends on featurize
63
ML workloads
● Repetitive
○ Batch prediction
○ Periodical retraining
○ Periodical analytics
● Dependencies
○ E.g. train depends on featurize
64
Cron: extremely simple
● Schedule jobs to run at fixed time intervals
● Report the results
65
Cron: extremely simple
● Schedule jobs to run at fixed time intervals
● Report the results
Cron can’t
handle this
66
Scheduler
● Schedulers are cron programs that can
handle dependencies
67
Scheduler
● Most schedulers require you to specify your
workloads as DAGs
This is a DAG
● Directed
● Acyclic
● Graph
68
Scheduler
● Can handle event-based & time based triggers
○ Run job A whenever X happens
● If a job fails, specify how many times to retry before giving up
● Jobs can be queued, prioritized, and allocated resources
○ If a job requires 8GB of memory and 2 CPUs, scheduler needs to find an instance with 8GB of
memory and 2 CPUs
69
Scheduler: SLURM example
#!/bin/bash
#SBATCH -J JobName
#SBATCH --time=11:00:00 # When to start the job
#SBATCH --mem-per-cpu=4096 # Memory, in MB, to be allocated per CPU
#SBATCH --cpus-per-task=4 # Number of cores per task
70
Scheduler: optimize utilization
● Schedulers aware of:
○ resources available
○ resources needed for each job
● Sophisticated schedulers (e.g. Google Borg) can reclaim unused resources
○ If I estimate that my job needs 8GB and it only uses 4GB, reclaim 4GB for other jobs
71
Scheduler challenge
● General purpose schedulers are extremely hard to design
● Need to handle any workload with any number of concurrent machines
● If scheduler is down, every workflow this scheduler touches will also be
down
72
Scheduler to Orchestrator
● Scheduler: when to run jobs
● Orchestrator: where to run jobs
73
Scheduler to Orchestrator
● Scheduler: when to run jobs
○ Handle jobs, queues, user-level quotas, etc.
● Orchestrator: where to run jobs
○ Handle containers, instances, clusters, replication, etc.
○ Provision: allocate more instances to the instance pool as needed
74
Scheduler to Orchestrator
● Scheduler: when to run jobs
○ Handle jobs, queues, user-level quotas, etc.
○ Typically used for periodical jobs like batch jobs
● Orchestrator: where to run jobs
○ Handle containers, instances, clusters, replication, etc.
○ Provision: allocate more instances to the instance pool as needed
○ Typically used for long-running jobs like services
75
Scheduler & orchestrator
● Schedulers usually have some orchestrating capacity and vice versa
○ Schedulers like SLURM and Google’s Borg have some orchestrating capacity
○ Orchestrators like HashiCorp Nomad and K8s come with some scheduling capacity
● Often, schedulers are run on top of orchestrators
○ Run Spark’s job scheduler on top of K8s
○ Run AWS Batch scheduler on top of EKS
76
Data science workflow management
77
Data science workflow
● Can be defined using:
○ Code (Python)
○ Configuration files (YAML)
● Examples: Airflow, Argo, KubeFlow, Metaflow
78
Airflow
● 1st gen data science workflow
management
● Champion of
“configuration-as-code”
● Wide range of operators to expand
capabilities
79
Airflow: cons
● Monolithic
○ The entire workflow as a container
● Non-parameterized
○ E.g. need to define another workflow if
you want to change learning rate
● Static DAG
○ Can’t handle workloads with unknown
number of records
80
Argo: next gen
● Created to address Airflow’s problems
○ Containerized
○ Fully parameterized
○ Dynamic DAG
81
Argo: cons
● YAML-based configs
○ Can get very messy
● Only run on K8s clusters
○ Can’t easily test in dev environment
82
Kubeflow & Metaflow: same code in dev & prod
● Allows data scientists to use the same code in both dev and prod
environments
83
Kubeflow: more mature but more boilerplate
84
Metaflow: less mature but cleaner API
● Run note code in cloud with a line
of code (@batch)
○ Run experiments locally
○ Once ready, run code on AWS Batch
● Can run different steps of the
same workflow in different envs
85
ML Platform
86
Model platform: story time
1. Anna started working on recsys at company X
2. To deploy recsys, Anna’s team need to build tool like model deployment,
model store, feature store, etc.
3. Other teams at X started deploying models and needed to build the same
tools
4. X decided to have a centralized platform to serve multiple ML use cases
ML Platform
87
ML platform: key components
● Model deployment
● Model store
● Feature store
88
Deployment: online | batch prediction
See Lecture 8
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction
89
Deployment: online | batch prediction
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction
● The most common MLOps tool
○ Cloud providers: SageMaker (AWS), Vertex AI (GCP), AzureML (Azure), etc.
○ Independent: MLflow Models, Seldon, Cortex, Ray Serve, etc.
90
Deployment: online | batch prediction
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction
● The most common MLOps tool
● Not all can do batch + online prediction well
○ e.g. some companies use Seldon for online prediction, but Databricks for batch
91
Deployment service: model quality challenge
● How to ensure a model’s quality pre- and during deployment?
○ Traditional code: CI/CD, PR review
○ ML: ???, ???
92
Model store
● Simplest form: store all models in blob storage like S3
● Problem:
○ When something happens, how to figure out:
■ Who/which team is responsible for this model?
■ If the correct model binary was deployed?
■ If the features used are correct?
■ If the code is up-to-date?
■ If something happened with the data pipeline?
93
Model store: artifact tracking
● Track all metadata necessary to debug
a model later
● Severely underestimated
94
Model store: artifact tracking at Stitch Fix
95
Feature store: key challenges
1. Feature management
a. Multiple models might share features, e.g. churn prediction & conversion prediction
b. How to allow different teams to find & use high-value features discovered by other teams?
96
Feature store: key challenges
1. Feature management
2. Feature consistency
a. During training, features might be written in Python
b. During deployment, features might be written in Java
c. How to ensure consistency between different feature pipelines?
97
Feature store: key challenges
1. Feature management
2. Feature consistency
3. Feature computation
a. It might be expensive to compute the same feature multiple times for different models
b. How to store computed features so that other models can use?
98
Feature store: key challenges
1. Feature management Feature catalog
2. Feature consistency
3. Feature computation Data warehouse
99
Other ML platform components
● Monitoring (ML & ops metrics)
● Experimentation platform
● Measurement (business metrics)
100
Evaluate MLOps tools
1. Does it work with your cloud provider?
2. Open-source or managed service?
3. Data security requirements
101
Machine Learning Systems Design
Next class: Final project workshops