Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
39 views

Week-4 Lecture Notes

The document provides an overview of edge AI and knowledge distillation techniques for optimizing AI models. It discusses how knowledge distillation transfers knowledge from a teacher model to a student model through minimizing the KL divergence between their predictions. Various distillation methods and knowledge types are also covered, along with federated learning which decentralizes model training.

Uploaded by

Chinmayi HS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Week-4 Lecture Notes

The document provides an overview of edge AI and knowledge distillation techniques for optimizing AI models. It discusses how knowledge distillation transfers knowledge from a teacher model to a student model through minimizing the KL divergence between their predictions. Various distillation methods and knowledge types are also covered, along with federated learning which decentralizes model training.

Uploaded by

Chinmayi HS
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Week 4 Lecture 1

EL
PT
N
0 mins
Edge AI - Intelligence at the Edge

EL
PT
N
Dr. Rajiv Misra, Professor
Dept. of Computer Science & Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Contents
• Review of Deep Learning
• Optimizing AI for the Edge
• Knowledge Distillation

EL
• Federated Learning

PT
N
Review of Deep Learning

Artificial Intelligence

EL
Machine Learning

PT
Representation Learning

N
Deep Learning

Source: Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press 2016
Review of Deep Learning

EL
PT
N
Source: Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press 2016
Optimizing Models
Model compression and acceleration techniques can be divided in the following categories:

• Parameter pruning and sharing:


model quantization

EL
model binarization
parameter sharing

PT
• Low-rank factorization

• Transferred compact convolutional filters


N
• Knowledge distillation (KD)
Knowledge Distillation

EL
PT
N
Figure Source: Knowledge Distillation A Survey Jianping Gou, Baosheng Yu, Stephen J. Maybank, Dacheng Tao in arXiv:2006.05525v7
Knowledge Distillation
• In vanilla knowledge distillation, knowledge is transferred from the teacher model to the
student model by minimizing a loss function in which the target is the distribution of class
probabilities predicted by the teacher model.
• Specifically, Knowledge Distillation (KD) is accomplished by minimizing the Kullback-Leibler (KL)

EL
divergence between the predictions of teacher and student.

Teacher Logits Student Logits

PT prediction

prediction
softmax

softmax
KD-Loss

N Minimize
Knowledge Distillation KL-Divergence
● KL divergence is a measure of how much a probability distribution diverges from another.

EL
PT
Non-negativity
N
Not Symmetric
Knowledge Distillation

EL
PT
N
Knowledge Distillation

EL
PT
N
Knowledge Distillation

EL
PT
N
Knowledge Distillation

EL
 response-based

PT
 feature-based
 relation-based

N
Knowledge Types Response-Based Knowledge

EL
PT
N
Knowledge Types Response-Based Knowledge

EL
PT
N
Why not use “hard-targets” ?
Knowledge Modeling

EL
PT
N
Knowledge Modeling

EL
PT
N
Knowledge Modeling

EL
PT
N
Knowledge Modeling

EL
PT
N
Knowledge Modeling

EL
PT
N
Knowledge Modeling

EL
PT
N
Knowledge Modeling

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Distillation Methods

EL
PT
N
Federated Learning

EL
PT
N
Federated Learning: Distributed ML
● 2016: the term FL is first coined by Google researchers
● We have already seen some real-world deployments by companies
and researchers for large scale IOT devices

EL
● Several open-source libraries are under development: PySyft,
TensorFlow Federated, FATE, Flower, Substra...

PT
● FL is highly multidisciplinary: it involves machine learning, numerical
optimization, privacy & security, networks, systems, hardware...

N
Federated Learning: Decentralised data
● Federated Learning (FL) aims to collaboratively train a ML model while
keeping the data decentralized
● Enabling devices to learn from each other (ML training is brought close
● A network of nodes and all nodes with their own central server but

EL
instead of sharing data with the central server, we share model we
don't send data from node to server instead send our model to server

PT
N
Gradient Descent Procedure
The procedure starts off with initial values for the coefficient or coefficients for the
function. These could be 0.0 or a small random value.
coefficient = 0.0

The cost of the coefficients is evaluated by plugging them into the function and
calculating the cost.

EL
cost = f(coefficient) or cost = evaluate(f(coefficient))

We need to know the slope so that we know the direction (sign) to move the
coefficient values in order to get a lower cost on the next iteration.

PT
delta = derivative(cost)
we can now update the coefficient values.

A learning rate parameter (alpha) must be specified that controls how much the
N
coefficients can change on each update.
coefficient = coefficient – (alpha * delta)

This process is repeated until the cost of the coefficients (cost) is 0.0 or close to 0
It does require you to know the gradient of your cost function or the function you
are optimizing
Gradient Descent Algorithm
Gradient Descent algorithm:

EL
θ=θ−α⋅∇J(θ)

PT
Advantages:
• Easy computation
• Easy to implement
• Easy to understand
N
Edge Computing ML: FL

• moves the processing over the edge nodes so


that the clients’ data can be maintained.

EL
• trains an ML algorithm with the local data
samples distributed over multiple edge devices
or servers without any exchange of data.

PT
• Federated learning distributes deep learning by
eliminating the necessity of pooling the data
into a single place.

• model is trained at different sites in


numerous iterations
N
Edge Computing ML: FL
Finding the function: model training

EL
PT
Deep Learning model training

N
Edge Computing ML: FL
Edge Computing ML: FL
Finding the function: model training

EL
PT
N
How is this aggregation applied? FedAvg Algo

EL
PT
N
Example: FL with i.i.d.
Each client trains its model decentral - the model
training process is carried out separately for
each client.

EL
Only learned model parameters are sent to a trusted
center to combine and feed the aggregated main
model.

PT
Then the trusted center sent back the aggregated
main model back to these clients, and this process
is circulated.
N
Apple personalizes Siri without hoovering up data
The tech giant is using privacy-preserving machine learning to improve
its voice assistant while keeping your data on your phone.

It relies primarily on a technique called federated learning.

It allows Apple to train different copies of a speaker recognition


model across all its users’ devices, using only the audio data

EL
available locally.

It then sends just the updated models back to a central server to be


combined into a master model.

PT
In this way, raw audio of users’ Siri requests never leaves their
iPhones and iPads, but the assistant continuously gets better at
identifying the right speaker. In addition to federated learning, Apple

N
also uses something called differential privacy to add a further layer of
protection. The technique injects a small amount of noise into any
raw data before it is fed into a local machine-learning model. The
additional step makes it exceedingly difficult for malicious actors to
reverse-engineer the original audio files from the trained model.
Federated Learning: Training
● There are connected devices let's say we have cluster of four
devices from four of the devices and there is one central server
that has an untrained model.
● We will send a copy of the model to each of the node.
● Each node would receive a copy of that model.

EL
PT
N
Federated Learning: Training
● Now all the nodes in the network has that untrained model that is
received from the server.

EL
PT
N
Federated Learning: Training
● In the next step, we are taking data from each node by taking data it
doesn't mean that we are sharing data.
● Every node has its own data based on which it is going to train a

EL
model.

PT
N
Federated Learning: Training
● Each node is training the model to fit the data that they have and it will
train the model accordingly to its data.

EL
PT
N
Federated Learning: Training
● Now the server would combine all these model received from each node
by taking an average or it will aggregate all the models received from the
nodes.
● Then the server will train that a central model, this model which is now

EL
trained by aggregating the models from each node. It captures the pattern
in the training data on all the nodes it is an aggregated one

PT
N
Federated Learning: Training
● Once the model is aggregated, the server will send the copy of the
updated model back to the nodes.
● Everything is being achieved at the edge so no data sharing is done
which means there is privacy preservation and also very less

EL
communication overhead.

PT
N
Federated Learning: Challenges
Systems heterogeneity

● Size of data
● Computational power

EL
● Network stability
● Local learner
● Learning rate

PT
Expensive Communication

● Communication in the
network can be slower
N
than local computation by
many order of magnitude.
Federated Learning: Challenges
Dealing with Non-I.I.D. data i.i.d (independent and identical distributed)
● Learning from non-i.i.d. data is difficult/slow because each IOT device
needs the model to go in a particular direction
● If data distributions are very different, learning a single model which

EL
performs well for all IOT devices may require a very large number of
parameters
● Another direction to deal with non-i.i.d. data is thus to lift the

PT
requirement that the learned model should be the same for all IOT
devices (“one size fits all”)
● Instead, we can allow each IOT k to learn a (potentially simpler)

kind of collaboration N
personalized model θkbut design the objective so as to enforce some

● When local datasets are non-i.i.d., FedAvg suffers from client drift
● To avoid this drift, one must use fewer local updates and/or smaller
learning rates, which hurts convergence
Federated Learning: Challenges
Preserving Privacy
● ML models are susceptible to various attacks on data privacy
● Membership inference attacks try to infer the presence of a

EL
known individual in the training set, e.g., by exploiting the
confidence in model predictions
● Reconstruction attacks try to infer some of the points used to

PT
train the model, e.g., by differencing attacks
● Federated Learning offers an additional attack surface because

N
the server and/or other clients observe model updates (not only
the final model)
Key differences with Distributed Learning
Data distribution
● In distributed learning, data is centrally stored (e.g., in a data center)
○ The main goal is just to train faster

EL
○ We control how data is distributed across workers: usually, it is
distributed uniformly at random across workers
● In FL, data is naturally distributed and generated locally

PT
○ Data is not independent and identically distributed (non-i.i.d.), and it
is imbalanced

● Enforcing privacy constraints


N
Additional challenges that arise in FL

● Dealing with the possibly limited reliability/availability of participants


● Achieving robustness against malicious parties
Federated Learning: Concerns
When to apply Federated Learning

● Data privacy needed

EL
● Bandwidth and power consumptions are concerns
● High cost of data transfer

PT
When NOT to apply Federated Learning

● When more data won’t improve your model (construct a learning


cure) N
● When additional data is uncorrelated
● Performance is already at ceiling
Federated Learning: Applications
● Predictive maintenance/industrial IOT
● Smartphones

EL
● Healthcare (wearables, drug discovery, prognostics, etc.)
● Enterprise/corporate IT (chat, issue trackers, emails, etc.)

PT
N
Conclusion

In this lecture we discussed about

EL
• What is Knowledge distillation
• Types of knowledge distillation

PT
• Methods of knowledge distillation
• Understanding of Federated Learning
• Different issues with federated learning
N
EL
Thank You!

PT
N
References
• Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, MIT Press,
http://www.deeplearningbook.org
• Knowledge Distillation: A Survey Jianping Gou · Baosheng Yu · Stephen J. Maybank · Dacheng Tao
arXiv:2006.05525v7

EL
PT
N

You might also like