15 ML
15 ML
15 ML
Operating Systems
Brad Campbell – bradjc@virginia.edu
https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/
1
Background: Machine Learning
Tra
inin
Training set1 g
Baseball-bat
Model
1 Caltech-256 dataset
Background: Machine Learning
• Most widely used training algorithm:
Gradient Descent.
• Given model , data set
• Minimize the distance between predicted labels
and true labels:
• Gradient of function
• The most common way to solve the minimization is gradient
descent: update towards the negative gradient direction;
• E.g., linear function , least square distance: :
• The gradient , has the same dimensionality with
Background: Machine Learning
• Gradient Descent:
• Model is parameterized by ;
• The gradient on : has the the same
dimensionality , computing it dominates the cost
of training.
Synchronous method:
• is the update rate Can be parallelized Updates to do not break the
FOR t = 1:T dependency of the FOR loop.
//compute update
//apply update Asynchronous method:
END Updates to ignore the
dependency of the FOR loop.
Background: NN Training
Process:
Take input image
Compute loss function (forward
pass)
Compute error gradients (backward
pass)
Update weights
Repeat
Problem:
Training process is usually time
consuming, especially when training
set / model size is large 5
Today: Systems to answer…
1. How to create ML models
2. How to create ML models even faster
3. How to check if ML models are any good
6
Is running or training ML an OS problem?
SOSP = Symposium on Operating System Principles
OSDI = Operating System Design and Implementation
7
Background: Dist Belief, parameter-server arch
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato,
Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. NIPS’12. 8
Background: Shortcomings of DistBelief
1.Difficulty of implementing new layers
a. C++ classes implement layers
b. Configuration file defines DNN architecture
c. Not flexible enough for researchers
2.Refining Algorithms
a. SGD is the heart of the training -- finalized in the parameter server
b. Need atomicity for some techniques -- get/put interface cannot
accommodate
3.Supporting new algorithms
a. If it doesn’t conform to feed-forward, it doesn’t map well to DistBelief
(EM, RF, RL, AdvMl)
4.Scaling down to other environments
a. Designed to run on distributed cluster of multi-cores
b. Augmented for GPGPU support for Conv NN
9
TensorFlow: Solution Strategy
1.Execution Flexibility via DataFlow abstraction
a.Makes it easy to extract the parallelism
2.Provides DFGs for primitive operators
a.Softmax, convolution, MM, …
b.Makes it easy to experiment with novel layers
c.Automatic gradient calculation
3.Deferred execution
a.Offload the larger chunks where possible...
4.Common Abstraction for Accelerators
a.Easy to integrate new accelerators into the fold
b.The operators are specialized for different devices
5.Common data primitive : Tensor
10
TensorFlow API Example
11
TensorFlow Implementation
12
Execution Model
• Single DFG represents all computation and state
for ML algorithm
• Input preprocessing, mathematical operators,
parameters, parameter update rules
• Communication explicit, simplifying scheduling and
partitioning
13
Computation is a DFG
14
Execution Model
• Single DFG represents all computation and state for ML
algorithm
• Input preprocessing, mathematical operators, parameters,
parameter update rules
• Communication explicit, simplifying scheduling and
partitioning
• Differences with existing DF systems:
• Concurrent execution on overlapping subgraphs supported
• Individual vertices contain sharable, mutable state
15
Communication is explicit...
• TensorFlow handles the glue
16
Fault Tolerance
• Days or many hours to train models -- fault tolerance is key
• Don’t need perfect recovery mechanisms
• Strong consistency is not needed for ML
• User-level checkpointing operations and client library for
configuration
• SAVE/RESTORE
17
Time-to-Quality vs Time-per-Update
• Synchronous updates can be delayed by stragglers
• SGD found to be robust to asynchrony
• Asynchronous = better utilization...
• But with GPGPUs fewer machines needed, less coordination
overhead
• Asynchronous operation can suffer from stale updates
• TensorFlow can handle Asynch or Synch updates...
• Also, synchronous + backup workers
• Idea: have K backup workers and N workers, then simply take
updates from first N workers that complete
18
How to evaluate this?
19
20
21
TVM: An automated end-to-
end optimizing compiler for
Deep learning
Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of
Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong University; Eddie Yan, Haichen
Shen, and Meghan Cowan, University of Washington; Leyuan Wang, UC Davis,
AWS; Yuwei Hu, Cornell; Luis Ceze, Carlos Guestrin, and Arvind
Krishnamurthy, University of Washington
OSDI’18
DeepXplore: Automated Whitebox
Testing of Deep Learning Systems
Kexin Pei1, Yinzhi Cao2, Junfeng Yang1, Suman Jana1
1
Columbia University, 2Lehigh University
SOSP’17
50
Existing DL testing methods are seriously limited
x=0
If (x==8)
x+=1 x+=2
Traditional program
Neural network (control flow graph)
56
Quick Summary of DeepXplore
Right
Left Mutate using gradient
12OOO
21
10 descent (DNNs are
differentiable)
On new input, activate different neurons
58
How to achieve multiple goals simultaneously
... f(1)
Eyes
...
1 -11
1 1
vedge ... Face
Activation
3
2 threshold=
Wheel
0.75
Neuron coverage: [3,1,2]. [2,-11,1]T=1
4/7=57%
60
Implementation
DeepXplore
61
Evaluation setup and results summary
62
Sample corner-case errors for images
Driving: Driving:
ImageNet: ImageNet:
MNIST:
MNIST:
6 2 5 3 63
63
Limitations of Approach
• Common to all differential testing solutions
64