Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 102

Machine Learning Systems Design

Lecture 15: ML Infrastructure & Platform

Reply in Zoom chat:

Are you ready for demo day?

CS 329S (Chip Huyen, 2022) | cs329s.stanford.edu


Logistics
● Demo Day in person, Mar 9
● [TBD] Will likely move to 3.15pm - 6.15pm
● Final project workshop next Wed!

2
What does infrastructure mean?

3
ML systems are complex
● More components
○ A request might jump 20-30 hops before response
○ A problem occurs, but where? Containers
Schedulers
Serverless

Lambda
functions
Mesh
routing
Load
balancers
Microservices
4
More complex systems, better infrastructure needed

Automate
Reuse/share code
boilerplate code

Reduce surface
area for bugs

5
● Infrastructure: the set of fundamental facilities and systems that support the
sustainable functionality of households and firms.
● ML infrastructure: the set of fundamental facilities that support the
development and maintenance of ML systems.

6
Every company’s infrastructure needs are different

7
Every company’s infrastructure needs are different

63K requests/sec
234M requests/hr

8
Every company’s infrastructure needs are different

63K requests/sec
234M requests/hr

● GB - TBs of data daily


● 10s - 100s data scientists
● 3+ models

9
Every company’s infrastructure needs are different

63K requests/sec
234M requests/hr

Vast majority of apps


(reasonable scale)

10
n=66 | claypot.ai
11
Infrastructure Layers

12
Infrastructure Layers

13
Storage & Compute Layer

14
Storage
● Where data is collected and stored
● Simplest form: HDD, SSD
● More complex forms: data lake, data warehouse
● Examples: S3, Redshift, Snowflake, BigQuery

See Lecture 2

15
Storage: heavily commoditized
● Most companies use storage provided by other companies (e.g. cloud)
● Storage has become so cheap that most companies just store everything

16
Compute layer: engine to execute your jobs
● Compute resources a company has access to
● Mechanism to determine how these resources can be used

17
Compute layer: engine to execute jobs
● Simplest form: a single CPU/GPU core
● Most common form: cloud compute

18
Compute unit
● Compute layer can be sliced into smaller compute units to be used
concurrently
○ A CPU core might support 2 concurrent threads, each thread is used as a compute unit to
execute its own job
○ Multiple CPUs can be joined to form a large compute unit to execute a large job

19
Compute unit
● Compute layer can be sliced into smaller compute units to be used
concurrently
○ A CPU core might support 2 concurrent threads, each thread is used as a compute unit to
execute its own job
○ Multiple CPUs can be joined to form a large compute unit to execute a large job

Wrapper around
Unit: job Unit: pod container

20
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.

To add arrays A and B


1. Load A & B into memory
2. Perform addition on A and B

21
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.

If A & B don’t fit into memory, it’ll be


possible to do the ops without
out-of-memory algorithms

To add arrays A and B


1. Load A & B into memory
2. Perform addition on A and B

22
Compute layer: how to execute jobs
1. Load data into memory
2. Perform operations on that data
a. Operations: add, subtract, multiply, convolution, etc.

To add arrays A and B Important metrics of compute layer:


1. Load A & B into memory 1. Memory
2. Perform addition on A and B 2. Speed of computing ops

23
Compute layer: memory
● Amount of memory
○ Straightforward
○ An instance with 8GB of memory is more expensive than an instance with 2GB of memory

24
Compute layer: memory
● Amount of memory
● I/O bandwidth: speed at which data can be loaded into memory

25
Compute layer: speed of ops
● Most common metric: FLOPS
○ Floating Point Operations Per Second

“A Cloud TPU v2 can perform up to 180 teraflops,


and the TPU v3 up to 420 teraflops.”

- Google, 2021

26
Compute layer: speed of ops
● Most common metric: FLOPS
● Contentious
○ What exactly is an ops?
■ If 2 ops are fused together, is it 1 or 2 ops?
○ Peak perf at 1 teraFLOPS doesn’t mean your app will run at 1 teraFLOPS

27
Compute layer: utilization
● Utilization = actual FLOPS / peak FLOPS

If peak 1 trillion FLOPS but job runs 300


billion FLOPS
-> utilization = 0.3

28
Compute layer: utilization
The higher,
● Utilization = actual FLOPS / peak FLOPS the better
● Dependent on how fast data can be loaded into memory

Tensor Cores are very fast. So fast … that they are idle most of the time as they
are waiting for memory to arrive from global memory.

For example, during BERT Large training, which uses huge matrices — the
larger, the better for Tensor Cores — we have utilization of about 30%.

- Tim Dettmers, 2020


29
Compute layer: if not FLOPS, then what?

30
Compute layer: if not FLOPS, then what?
● How long it will take this compute unit to do common workloads
● MLPerf measure hardware on common ML tasks e.g.
○ Train a ResNet-50 model on the ImageNet dataset
○ Use a BERT-large model to generate predictions for the SQuAD dataset

31
MLPerf is also
contentious

32
Compute layer: evaluation
● Memory
● Cores
● I/O bandwidth
● Cost

33
Public Cloud vs. Private Data Centers
● Like storage, compute is largely commoditized

34
2020 – The Year that Cloud Service Revenues Finally Dwarfed Enterprise Spending on Data Centers | Synergy Research Group
Benefits of cloud
● Easy to get started
● Appealing to variable-sized workloads
○ Private: would need 100 machines upfront, most will be idle most of the time
○ Cloud: pay for 100 machines only when needed

35
Benefits of cloud
● Easy to get started
● Appealing to variable-sized workloads
○ Private: would need 100 machines upfront, most will be idle most of the time
○ Cloud: pay for 100 machines only when needed

Autoscaling!

36
Drawbacks of cloud: cost
● Cloud spending: ~50% cost of revenue

37
The Cost of Cloud, a Trillion Dollar Paradox | Andreessen Horowitz (2021)
Drawbacks of cloud: cost
“Across 50 of the top public software companies currently utilizing cloud
infrastructure, an estimated $100B of market value is being lost … due to cloud
impact on margins — relative to running the infrastructure themselves.”

The Cost of Cloud, a Trillion Dollar Paradox | Andreessen Horowitz (2021)

38
Cloud repatriation
● Process of moving workloads from cloud to private data centers

A large chunk
due to cloud
repatriation

39
Multicloud strategy
● To optimize cost
● To avoid cloud vendor lock-in

“81% of respondents said they are working with


two or more providers”

- Gartner (2019)

40
Development Environment

41
Development Environment
● Text editors & notebooks
○ Where you write code, e.g. VSCode, Vim

42
Development Environment
● Notebook: Jupyter notebooks, Colab
○ Also works with arbitrary artifacts that aren’t code (e.g. images, plots, tabular data)
○ Stateful
■ Only need to run from the failed step instead from the beginning

43
Development Environment
● Notebook at Netflix

44
Development Environment
● Text editors & notebooks
● Versioning
○ Git: code versioning
○ DVC: data versioning
○ WandB: experiment versioning

45
Development Environment
● Text editors & notebooks
● Versioning
● CI/CD test suite: test your code before pushing it to staging/prod

46
Dev env: underestimated

“if you have time to set up only one piece of infrastructure well, make it
the development environment for data scientists.”
Ville Tuulos, Effective Data Science Infrastructure (2022)

47
Standardize dev environments
● Standardize dependencies with versions

48
Standardize dev environments
● Standardize dependencies with versions
● Standardize tools & versions

49
Standardize dev environments
● Standardize dependencies with versions
● Standardize tools & versions
● Standardize hardware: cloud dev env
○ Simplify IT support
○ Security: revoke access if laptop is stolen
○ Bring your dev env closer to prod env
○ Make debugging easier

50
Dev to prod
● Elastic compute: can stop/start instances at will
● How to recreate the required environment in a new instance?

51
Container
● Step-by-step instructions on how to recreate an environment in which your
model can run:
○ install this package
○ download this pretrained model
○ set environment variables
○ navigate into a folder
○ etc.

52
Transformers Dockerfile

● CUDA/cuDNN
● bash/git/python3
● Jupyter notebook
● TensorFlow/Pytorch
● transformers

53
Multiple containers: dependency management

● NumPy 1.21

● Numpy 0.8

54
Multiple containers: cost saving

● More memory
● CPU ok

● Less memory
● Need GPU

55
Container orchestration
● Help deploy and manage containerized applications to a serverless cluster
● Spinning up/down containers

56
Breakout exercise

57
Group of 4, 10 mins
● What has been the most difficult parts of working on the project?
● What else do you need to work on for the final demo?

58
Resource Management

59
Resource management

Pre-cloud Cloud

Resources Finite Practically infinite

Implication More resources for an app = More resources for an app


less resources for other apps don’t have to affect other apps

Goal Utilization Utilization + cost efficiency

60
Resource management

Pre-cloud Cloud
Simplify the
Resources Finite Practically infinite allocation
challenge
Implication More resources for an app = More resources for an app
less resources for other apps don’t have to affect other apps

Goal Utilization Utilization + cost efficiency

61
Resource management

Pre-cloud Cloud

Resources Finite Practically infinite


OK to use more
Implication More resources for an app = More resources for an app resources if help
less resources for other apps don’t have to affect other apps engineers to be
more productive

Goal Utilization Utilization + cost efficiency

62
ML workloads
● Repetitive
○ Batch prediction
○ Periodical retraining
○ Periodical analytics
● Dependencies
○ E.g. train depends on featurize

63
ML workloads
● Repetitive
○ Batch prediction
○ Periodical retraining
○ Periodical analytics
● Dependencies
○ E.g. train depends on featurize

64
Cron: extremely simple
● Schedule jobs to run at fixed time intervals
● Report the results

65
Cron: extremely simple
● Schedule jobs to run at fixed time intervals
● Report the results

Cron can’t
handle this

66
Scheduler
● Schedulers are cron programs that can
handle dependencies

67
Scheduler
● Most schedulers require you to specify your
workloads as DAGs

This is a DAG
● Directed
● Acyclic
● Graph

68
Scheduler
● Can handle event-based & time based triggers
○ Run job A whenever X happens
● If a job fails, specify how many times to retry before giving up
● Jobs can be queued, prioritized, and allocated resources
○ If a job requires 8GB of memory and 2 CPUs, scheduler needs to find an instance with 8GB of
memory and 2 CPUs

69
Scheduler: SLURM example

#!/bin/bash
#SBATCH -J JobName
#SBATCH --time=11:00:00 # When to start the job
#SBATCH --mem-per-cpu=4096 # Memory, in MB, to be allocated per CPU
#SBATCH --cpus-per-task=4 # Number of cores per task

70
Scheduler: optimize utilization
● Schedulers aware of:
○ resources available
○ resources needed for each job
● Sophisticated schedulers (e.g. Google Borg) can reclaim unused resources
○ If I estimate that my job needs 8GB and it only uses 4GB, reclaim 4GB for other jobs

71
Scheduler challenge
● General purpose schedulers are extremely hard to design
● Need to handle any workload with any number of concurrent machines
● If scheduler is down, every workflow this scheduler touches will also be
down

72
Scheduler to Orchestrator
● Scheduler: when to run jobs
● Orchestrator: where to run jobs

73
Scheduler to Orchestrator
● Scheduler: when to run jobs
○ Handle jobs, queues, user-level quotas, etc.
● Orchestrator: where to run jobs
○ Handle containers, instances, clusters, replication, etc.
○ Provision: allocate more instances to the instance pool as needed

74
Scheduler to Orchestrator
● Scheduler: when to run jobs
○ Handle jobs, queues, user-level quotas, etc.
○ Typically used for periodical jobs like batch jobs
● Orchestrator: where to run jobs
○ Handle containers, instances, clusters, replication, etc.
○ Provision: allocate more instances to the instance pool as needed
○ Typically used for long-running jobs like services

75
Scheduler & orchestrator
● Schedulers usually have some orchestrating capacity and vice versa
○ Schedulers like SLURM and Google’s Borg have some orchestrating capacity
○ Orchestrators like HashiCorp Nomad and K8s come with some scheduling capacity
● Often, schedulers are run on top of orchestrators
○ Run Spark’s job scheduler on top of K8s
○ Run AWS Batch scheduler on top of EKS

76
Data science workflow management

77
Data science workflow
● Can be defined using:
○ Code (Python)
○ Configuration files (YAML)
● Examples: Airflow, Argo, KubeFlow, Metaflow

78
Airflow
● 1st gen data science workflow
management
● Champion of
“configuration-as-code”
● Wide range of operators to expand
capabilities

79
Airflow: cons
● Monolithic
○ The entire workflow as a container
● Non-parameterized
○ E.g. need to define another workflow if
you want to change learning rate
● Static DAG
○ Can’t handle workloads with unknown
number of records

80
Argo: next gen
● Created to address Airflow’s problems
○ Containerized
○ Fully parameterized
○ Dynamic DAG

81
Argo: cons
● YAML-based configs
○ Can get very messy
● Only run on K8s clusters
○ Can’t easily test in dev environment

82
Kubeflow & Metaflow: same code in dev & prod
● Allows data scientists to use the same code in both dev and prod
environments

83
Kubeflow: more mature but more boilerplate

84
Metaflow: less mature but cleaner API
● Run note code in cloud with a line
of code (@batch)
○ Run experiments locally
○ Once ready, run code on AWS Batch
● Can run different steps of the
same workflow in different envs

85
ML Platform

86
Model platform: story time
1. Anna started working on recsys at company X
2. To deploy recsys, Anna’s team need to build tool like model deployment,
model store, feature store, etc.
3. Other teams at X started deploying models and needed to build the same
tools
4. X decided to have a centralized platform to serve multiple ML use cases

ML Platform

87
ML platform: key components
● Model deployment
● Model store
● Feature store

88
Deployment: online | batch prediction
See Lecture 8
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction

89
Deployment: online | batch prediction
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction
● The most common MLOps tool
○ Cloud providers: SageMaker (AWS), Vertex AI (GCP), AzureML (Azure), etc.
○ Independent: MLflow Models, Seldon, Cortex, Ray Serve, etc.

90
Deployment: online | batch prediction
● Deployment service:
○ Package your model & dependencies
○ Push the package to production
○ Expose an endpoint for prediction
● The most common MLOps tool
● Not all can do batch + online prediction well
○ e.g. some companies use Seldon for online prediction, but Databricks for batch

91
Deployment service: model quality challenge
● How to ensure a model’s quality pre- and during deployment?
○ Traditional code: CI/CD, PR review
○ ML: ???, ???

92
Model store
● Simplest form: store all models in blob storage like S3
● Problem:
○ When something happens, how to figure out:
■ Who/which team is responsible for this model?
■ If the correct model binary was deployed?
■ If the features used are correct?
■ If the code is up-to-date?
■ If something happened with the data pipeline?

93
Model store: artifact tracking
● Track all metadata necessary to debug
a model later
● Severely underestimated

94
Model store: artifact tracking at Stitch Fix

95
Feature store: key challenges
1. Feature management
a. Multiple models might share features, e.g. churn prediction & conversion prediction
b. How to allow different teams to find & use high-value features discovered by other teams?

96
Feature store: key challenges
1. Feature management
2. Feature consistency
a. During training, features might be written in Python
b. During deployment, features might be written in Java
c. How to ensure consistency between different feature pipelines?

97
Feature store: key challenges
1. Feature management
2. Feature consistency
3. Feature computation
a. It might be expensive to compute the same feature multiple times for different models
b. How to store computed features so that other models can use?

98
Feature store: key challenges
1. Feature management Feature catalog
2. Feature consistency
3. Feature computation Data warehouse

99
Other ML platform components
● Monitoring (ML & ops metrics)
● Experimentation platform
● Measurement (business metrics)

100
Evaluate MLOps tools
1. Does it work with your cloud provider?
2. Open-source or managed service?
3. Data security requirements

101
Machine Learning Systems Design
Next class: Final project workshops

cs329s.stanford.edu | Chip Huyen

You might also like