Week-4 Lecture Notes
Week-4 Lecture Notes
EL
PT
N
0 mins
Edge AI - Intelligence at the Edge
EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Contents
• Review of Deep Learning
• Optimizing AI for the Edge
• Knowledge Distillation
EL
• Federated Learning
PT
N
Review of Deep Learning
Artificial Intelligence
EL
Machine Learning
PT
Representation Learning
N
Deep Learning
Source: Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press 2016
Review of Deep Learning
EL
PT
N
Source: Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press 2016
Optimizing Models
Model compression and acceleration techniques can be divided in the following categories:
EL
model binarization
parameter sharing
PT
• Low-rank factorization
EL
PT
N
Figure Source: Knowledge Distillation A Survey Jianping Gou, Baosheng Yu, Stephen J. Maybank, Dacheng Tao in arXiv:2006.05525v7
Knowledge Distillation
• In vanilla knowledge distillation, knowledge is transferred from the teacher model to the
student model by minimizing a loss function in which the target is the distribution of class
probabilities predicted by the teacher model.
• Specifically, Knowledge Distillation (KD) is accomplished by minimizing the Kullback-Leibler (KL)
EL
divergence between the predictions of teacher and student.
PT prediction
prediction
softmax
softmax
KD-Loss
N Minimize
Knowledge Distillation KL-Divergence
● KL divergence is a measure of how much a probability distribution diverges from another.
EL
PT
Non-negativity
N
Not Symmetric
Knowledge Distillation
EL
PT
N
Knowledge Distillation
EL
PT
N
Knowledge Distillation
EL
PT
N
Knowledge Distillation
EL
response-based
PT
feature-based
relation-based
N
Knowledge Types Response-Based Knowledge
EL
PT
N
Knowledge Types Response-Based Knowledge
EL
PT
N
Why not use “hard-targets” ?
Knowledge Modeling
EL
PT
N
Knowledge Modeling
EL
PT
N
Knowledge Modeling
EL
PT
N
Knowledge Modeling
EL
PT
N
Knowledge Modeling
EL
PT
N
Knowledge Modeling
EL
PT
N
Knowledge Modeling
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Distillation Methods
EL
PT
N
Federated Learning
EL
PT
N
Federated Learning: Distributed ML
● 2016: the term FL is first coined by Google researchers
● We have already seen some real-world deployments by companies
and researchers for large scale IOT devices
EL
● Several open-source libraries are under development: PySyft,
TensorFlow Federated, FATE, Flower, Substra...
PT
● FL is highly multidisciplinary: it involves machine learning, numerical
optimization, privacy & security, networks, systems, hardware...
N
Federated Learning: Decentralised data
● Federated Learning (FL) aims to collaboratively train a ML model while
keeping the data decentralized
● Enabling devices to learn from each other (ML training is brought close
● A network of nodes and all nodes with their own central server but
EL
instead of sharing data with the central server, we share model we
don't send data from node to server instead send our model to server
PT
N
Gradient Descent Procedure
The procedure starts off with initial values for the coefficient or coefficients for the
function. These could be 0.0 or a small random value.
coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and
calculating the cost.
EL
cost = f(coefficient) or cost = evaluate(f(coefficient))
We need to know the slope so that we know the direction (sign) to move the
coefficient values in order to get a lower cost on the next iteration.
PT
delta = derivative(cost)
we can now update the coefficient values.
A learning rate parameter (alpha) must be specified that controls how much the
N
coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close to 0
It does require you to know the gradient of your cost function or the function you
are optimizing
Gradient Descent Algorithm
Gradient Descent algorithm:
EL
θ=θ−α⋅∇J(θ)
PT
Advantages:
• Easy computation
• Easy to implement
• Easy to understand
N
Edge Computing ML: FL
EL
• trains an ML algorithm with the local data
samples distributed over multiple edge devices
or servers without any exchange of data.
PT
• Federated learning distributes deep learning by
eliminating the necessity of pooling the data
into a single place.
EL
PT
Deep Learning model training
N
Edge Computing ML: FL
Edge Computing ML: FL
Finding the function: model training
EL
PT
N
How is this aggregation applied? FedAvg Algo
EL
PT
N
Example: FL with i.i.d.
Each client trains its model decentral - the model
training process is carried out separately for
each client.
EL
Only learned model parameters are sent to a trusted
center to combine and feed the aggregated main
model.
PT
Then the trusted center sent back the aggregated
main model back to these clients, and this process
is circulated.
N
Apple personalizes Siri without hoovering up data
The tech giant is using privacy-preserving machine learning to improve
its voice assistant while keeping your data on your phone.
EL
available locally.
PT
In this way, raw audio of users’ Siri requests never leaves their
iPhones and iPads, but the assistant continuously gets better at
identifying the right speaker. In addition to federated learning, Apple
N
also uses something called differential privacy to add a further layer of
protection. The technique injects a small amount of noise into any
raw data before it is fed into a local machine-learning model. The
additional step makes it exceedingly difficult for malicious actors to
reverse-engineer the original audio files from the trained model.
Federated Learning: Training
● There are connected devices let's say we have cluster of four
devices from four of the devices and there is one central server
that has an untrained model.
● We will send a copy of the model to each of the node.
● Each node would receive a copy of that model.
EL
PT
N
Federated Learning: Training
● Now all the nodes in the network has that untrained model that is
received from the server.
EL
PT
N
Federated Learning: Training
● In the next step, we are taking data from each node by taking data it
doesn't mean that we are sharing data.
● Every node has its own data based on which it is going to train a
EL
model.
PT
N
Federated Learning: Training
● Each node is training the model to fit the data that they have and it will
train the model accordingly to its data.
EL
PT
N
Federated Learning: Training
● Now the server would combine all these model received from each node
by taking an average or it will aggregate all the models received from the
nodes.
● Then the server will train that a central model, this model which is now
EL
trained by aggregating the models from each node. It captures the pattern
in the training data on all the nodes it is an aggregated one
PT
N
Federated Learning: Training
● Once the model is aggregated, the server will send the copy of the
updated model back to the nodes.
● Everything is being achieved at the edge so no data sharing is done
which means there is privacy preservation and also very less
EL
communication overhead.
PT
N
Federated Learning: Challenges
Systems heterogeneity
● Size of data
● Computational power
EL
● Network stability
● Local learner
● Learning rate
PT
Expensive Communication
● Communication in the
network can be slower
N
than local computation by
many order of magnitude.
Federated Learning: Challenges
Dealing with Non-I.I.D. data i.i.d (independent and identical distributed)
● Learning from non-i.i.d. data is difficult/slow because each IOT device
needs the model to go in a particular direction
● If data distributions are very different, learning a single model which
EL
performs well for all IOT devices may require a very large number of
parameters
● Another direction to deal with non-i.i.d. data is thus to lift the
PT
requirement that the learned model should be the same for all IOT
devices (“one size fits all”)
● Instead, we can allow each IOT k to learn a (potentially simpler)
kind of collaboration N
personalized model θkbut design the objective so as to enforce some
● When local datasets are non-i.i.d., FedAvg suffers from client drift
● To avoid this drift, one must use fewer local updates and/or smaller
learning rates, which hurts convergence
Federated Learning: Challenges
Preserving Privacy
● ML models are susceptible to various attacks on data privacy
● Membership inference attacks try to infer the presence of a
EL
known individual in the training set, e.g., by exploiting the
confidence in model predictions
● Reconstruction attacks try to infer some of the points used to
PT
train the model, e.g., by differencing attacks
● Federated Learning offers an additional attack surface because
N
the server and/or other clients observe model updates (not only
the final model)
Key differences with Distributed Learning
Data distribution
● In distributed learning, data is centrally stored (e.g., in a data center)
○ The main goal is just to train faster
EL
○ We control how data is distributed across workers: usually, it is
distributed uniformly at random across workers
● In FL, data is naturally distributed and generated locally
PT
○ Data is not independent and identically distributed (non-i.i.d.), and it
is imbalanced
EL
● Bandwidth and power consumptions are concerns
● High cost of data transfer
PT
When NOT to apply Federated Learning
EL
● Healthcare (wearables, drug discovery, prognostics, etc.)
● Enterprise/corporate IT (chat, issue trackers, emails, etc.)
PT
N
Conclusion
EL
• What is Knowledge distillation
• Types of knowledge distillation
PT
• Methods of knowledge distillation
• Understanding of Federated Learning
• Different issues with federated learning
N
EL
Thank You!
PT
N
References
• Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press,
http://www.deeplearningbook.org
• Knowledge Distillation: A Survey Jianping Gou · Baosheng Yu · Stephen J. Maybank · Dacheng Tao
arXiv:2006.05525v7
EL
PT
N