Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

On-Device Machine Learning: An Algorithms and Learning

Theory Perspective

SAUPTIK DHAR, America Research Center, LG Electronics


JUNYAO GUO, America Research Center, LG Electronics
arXiv:1911.00623v2 [cs.LG] 24 Jul 2020

JIAYI (JASON) LIU, America Research Center, LG Electronics


SAMARTH TRIPATHI, America Research Center, LG Electronics
UNMESH KURUP, America Research Center, LG Electronics
MOHAK SHAH, America Research Center, LG Electronics
The predominant paradigm for using machine learning models on a device is to train a model in the cloud
and perform inference using the trained model on the device. However, with increasing number of smart
devices and improved hardware, there is interest in performing model training on the device. Given this
surge in interest, a comprehensive survey of the field from a device-agnostic perspective sets the stage for
both understanding the state-of-the-art and for identifying open challenges and future avenues of research.
However, on-device learning is an expansive field with connections to a large number of related topics in
AI and machine learning (including online learning, model adaptation, one/few-shot learning, etc.). Hence,
covering such a large number of topics in a single survey is impractical. This survey finds a middle ground
by reformulating the problem of on-device learning as resource constrained learning where the resources
are compute and memory. This reformulation allows tools, techniques, and algorithms from a wide variety
of research areas to be compared equitably. In addition to summarizing the state-of-the-art, the survey also
identifies a number of challenges and next steps for both the algorithmic and theoretical aspects of on-device
learning.
ACM Reference Format:
Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah. 2020. On-
Device Machine Learning: An Algorithms and Learning Theory Perspective. 1, 1 (July 2020), 45 pages.
https://doi.org/10.1145/nnnnnnn.nnnnnnn

Contents
Abstract 1
Contents 1
1 Introduction 2
1.1 On-device Learning 3
1.1.1 Definition of an Edge Device 3
1.1.2 Training Models on an Edge Device 4
1.2 Scope of this survey 6
Authors’ addresses: Sauptik DharAmerica Research Center, LG Electronics, sauptik.dhar@lge.com; Junyao GuoAmerica
Research Center, LG Electronics, junyao.guo@lge.com; Jiayi (Jason) LiuAmerica Research Center, LG Electronics, jason.
liu@lge.com; Samarth TripathiAmerica Research Center, LG Electronics, samarth.tripathi@lge.com; Unmesh KurupAmerica
Research Center, LG Electronics, unmesh@ukurup.com; Mohak ShahAmerica Research Center, LG Electronics, mohak@
mohakshah.com.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2020 Association for Computing Machinery.
XXXX-XXXX/2020/7-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn

, Vol. 1, No. 1, Article . Publication date: July 2020.


2 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

1.3 How to read this survey 6


2 Resource Constraints in On-Device Learning 7
2.1 Processing Speed 7
2.2 Memory 7
2.3 Power Consumption and Energy Efficiency 8
2.4 Typical Use Case and Hardware 8
3 Algorithms for On-Device Learning 8
3.1 Resource Footprint Characterization 9
3.1.1 Asymptotic Analysis 9
3.1.2 Resource Profiling 11
3.1.3 Resource Modeling and Estimation 13
3.1.4 New Metrics 14
3.2 Resource Efficient Training 14
3.2.1 Lightweight ML Algorithms 15
3.2.2 Reducing Model Complexity 15
3.2.3 Modifying Optimization Routines 17
3.2.4 Data Compression 21
3.2.5 New Protocols for Data Observation 21
3.3 Resource Efficient Inference 22
3.4 Challenges in Resource-Efficient Algorithm Development 23
4 Theoretical Considerations for On-Device Learning 24
4.1 Formalization of the Learning problem 25
4.1.1 Traditional Machine Learning Problem 25
4.1.2 Resource Constrained Machine Learning Problem 25
4.2 Learning theories 26
4.2.1 Traditional Learning Theories 26
4.2.2 Resource-Constrained Learning Theories 29
4.3 Challenges in Resource-Efficient Theoretical Research 31
5 Discussion 31
5.1 Summary of the Current State-of-the-art in On-device Learning 31
5.2 Research & Development Challenges 34
6 Conclusion 36
References 37

1 INTRODUCTION
The addition of intelligence to a device carries the promise of a seamless experience that is tailored
to each user’s specific needs while maintaining the integrity of their personal data. The current
approach to making such intelligent devices is based on a cloud paradigm where data is collected
at the device level and transferred to the cloud. Once transferred, this data is then aggregated with
data collected from other devices, processed, and used to train a machine learning model. When the
training is done, the resulting model is pushed from the cloud back to the device where it is used to
improve the device’s intelligent behavior. In the cloud paradigm, all machine learning that happens
on the device is inference, that is, the execution of a model that was trained in the cloud. This
separation of roles – data collection and inference on the edge, data processing and model training
in the cloud – is natural given that end-user devices have form-factor and cost considerations that

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 3

impose limits on the amount of computing power and memory they support, as well as the energy
that they consume.
Cloud-based systems have access to nearly limitless resources and are constrained only by cost
considerations making them ideal for resource intensive tasks like data storage, data processing,
and model building. However, the cloud-based paradigm also has drawbacks that will become more
pronounced as AI becomes an ubiquitous aspect of consumer life. The primary considerations are
in the privacy and security of user data as this data needs to be transmitted to the cloud and stored
there, most often, indefinitely. Transmission of user data is open to interference and capture, and
stored data leaves open the possibility of unauthorized access.
In addition to privacy and security concerns, the expectation for intelligent devices will be that
their behavior is tailored specifically to each consumer. However, cloud-trained models are typically
less personalized as they are built from data aggregated from many consumers, and each model is
built to target broad user segments because building individual models for every consumer and
every device is cost prohibitive in most cases. This de-personalization also applies to distributed
paradigms like federated learning that typically tend to improve a global model based on averaging
the individual models [127].
Finally, AI-enabled devices will also be expected to learn and respond instantly to new scenarios
but cloud-based training is slow because added time is needed to transmit data and models back
and forth from the device. Currently, most use cases do not require real-time model updates, and
long delays between data collection and model updates is not a serious drawback. But, as intelligent
behavior becomes commonplace and expected, there will be a need for real-time updates, like in
the case of connected vehicles and autonomous driving. In such situations, long latency becomes
untenable and there is a need for solutions where model updates happen locally and not in the
cloud.
As devices become more powerful, it becomes possible to address the drawbacks of the cloud
model by moving some or all of the model development onto the device itself. Model training, espe-
cially in the age of deep learning, is often the most time-consuming part of the model development
process, making it the obvious area of focus to speed up model development on the device. Doing
model training on the device is often referred to variously as Learning on the Edge and On-device
Learning. However, we distinguish between these terms, with learning on the edge used as a broad
concept to signify the capability of real or quasi-real time learning without uploading data to the
cloud while on-device learning refers specifically to the concept of doing model training on the
resource-constrained device itself.

1.1 On-device Learning


1.1.1 Definition of an Edge Device. Before we elaborate on on-device learning, it is helpful to
define what we mean by a device, or specifically an edge device, in the context of on-device
learning. We define an edge device to be a device whose compute, memory, and energy resources
are constrained and cannot be easily increased or decreased. These constraints may be due to
form-factor considerations (it is not possible to add more compute or memory or battery without
increasing the size of the device) or due to cost considerations (there is enough space to add a
GPU to a washing machine but this would increase its cost prohibitively). This definition of an
edge device applies to all such consumer and industrial devices where resource constraints place
limitations on what is available for building and training AI models. Any cloud solution such as
Amazon AWS, Google Cloud Platform, Microsoft Azure, or even on-premise computing clusters do
not fit the edge definition because it is easy to provision additional resources as needed. Likewise,
a workstation would not be considered an edge device because it is straightforward to replace its
CPU, add more memory, and even add an additional GPU card. A standard laptop on the other

, Vol. 1, No. 1, Article . Publication date: July 2020.


4 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

hand would be considered an edge device as it is not easy to add additional resources as needed,
even though their resources generally far exceed what is normally considered as available in a
consumer edge device.

Fig. 1. Levels of constraints on edge devices. Only the topics in gray (Learning Theory and Algorithms) fall
under the scope of this survey.

1.1.2 Training Models on an Edge Device. The primary constraints to training models on-device in
a reasonable time-frame is the lack of compute and memory on the device. Speeding up training is
possible either by adding more resources to the device or using these resources more effectively or
some combination of the two. Fig 1 shows a high-level breakdown of the different levels at which
these approaches can be applied. Each level in this hierarchy abstracts implementation details of
the level below it and presents an independent interface to the level above it.
(1) Hardware: At the bottom of the hierarchy are the actual chipsets that execute all learning
algorithms. Fundamental research in this area aims at improving existing chip design (by
developing chips with more compute and memory, and lower power consumption and
footprint) or developing new designs with novel architectures that speed up model training.
While hardware research is a fruitful avenue for improving on-device learning, it is an
expensive process that requires large capital expenditure to build laboratories and fabrication
facilities, and usually involves long timescales for development.
(2) Libraries: Every machine learning algorithm depends on a few key operations (such as
Multiply-Add in the case of neural networks). The libraries that support these operations
are the interface that separate the hardware from the learning algorithms. This separation
allows for algorithm development that is not based on any specific hardware architecture.
Improved libraries can support faster execution of algorithms and speed up on-device training.
However, these libraries are heavily tuned to the unique aspects of the hardware on which
the operations are executed. This dependency limits the amount of improvement that can be
gained by new libraries.
(3) Algorithms: Since on-device learning techniques are grounded in their algorithmic im-
plementations, research in novel algorithm development is an important part of making
model training more efficient. Such algorithm development can take into account resource
constraints as part of the model training process. Algorithm development leads to hardware-
independent techniques but the actual performance of each algorithm is specific to the exact

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 5

domain, environment, and hardware, and needs to be verified empirically for each config-
uration. Depending on the number of choices available in each of these dimensions, the
verification space could become very large.
(4) Theory: Every learning algorithm is based on an underlying theory that guarantees certain
aspects of its performance. Developing novel theories targeted at on-device learning help us
understand how algorithms will perform under resource-constrained settings. However, while
theoretical research is flexible enough to apply across classes of algorithms and hardware
systems, it is limited due to the inherent difficulty of such research and the need to implement
a theory in the form of an algorithm before its utility can be realized.

Fig. 2. Different approaches to improving on-device learning. Topics in gray are covered in this survey.

Fig 2 shows an expanded hierarchical view of the different levels of the edge learning stack and
highlight different ways to improve the performance of model training on the device at each level.
The hardware approaches involve either adding additional resources to the restricted form-factor of
the device or developing novel architectures that are more resource efficient. Software approaches
to improve model training involve either improving the performance of computing libraries such
as OpenBLAS, Cuda, CuDNN or improving the performance of the machine learning algorithms
themselves. Finally, theoretical approaches help direct new research on ML algorithms and improve
our understanding of existing techniques and their generalizability to new problems, environments,
and hardware.

, Vol. 1, No. 1, Article . Publication date: July 2020.


6 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

1.2 Scope of this survey


There is a large ongoing research effort, mainly in academia, that looks at on-device learning from
multiple points of view including single vs. multiple edge devices, hardware vs. software vs. theory,
and domain of application such as healthcare vs. consumer devices vs. autonomous cars. Given
the significant amount of research in each of these areas it is important to restrict this survey to a
manageable subset that targets the most important aspects of on-device learning.
We first limit this survey to the Algorithms and Learning Theory levels in Fig 1. This allows
us to focus on the machine learning aspects of on-device learning and develop new techniques
that are independent of specific hardware. We also limit the scope of this survey to learning on a
single device. This restriction makes the scope of the survey manageable while also providing a
foundation on which to expanded to distributed settings. In addition, at the right level of abstraction,
a distributed edge system can be considered as a single device with an additional resource focused
on communication latency. This view allows us to extend single-device algorithms and theories to
the distributed framework at a later stage.
The goal of this survey then is to provide a large-scale view of the current state-of-the-art in
algorithmic and theoretical advances for on-device learning on single devices. To accomplish this
goal, the survey reformulates the problem of on-device learning as one of resource constrained
learning. This reformulation describes the efficiency of on-device learning using two resources
– compute and memory – and provides a foundation for a fair comparison of different machine
learning and AI techniques and their suitability for on-device learning. Finally, this survey identifies
challenges in algorithms and theoretical considerations for on-device learning and provides the
background needed to develop a road map for future research and development in this field.

1.3 How to read this survey


This survey is a comprehensive look at the current state-of-the-art in training models on resource-
constrained devices. It is divided into 4 main sections excluding the introduction and the conclusion.
Section 2 briefly introduces the resources and their relevance to on-device learning. Sections 3 and
4 respectively focus on the algorithmic and theoretical levels of the edge platform hierarchy. Finally
section 5 provides a brief summary and identifies various challenges in making progress towards
a robust framework for on-device learning. For those interested in specific aspects of on-device
learning, sections 3 and 4 are mostly self-contained and can be read separately.
Resource Constraints in On-Device Learning (in Section 2) : briefly discusses the relevant
resources that differentiate on-device learning from a cloud-based system. Most existing research at
the various levels in Fig 1 is targeted towards addressing on-device learning when there is limited
availability of these resources.
Algorithm Research (in Section 3): addresses recent algorithmic developments towards accu-
rately capturing the hardware constraints in a software framework and then surveys the state-of-
the-art in machine learning algorithms that take into account resource constraints. This section
categorizes the algorithms from a computational perspective (i.e. the underlying computational
model used).
Theory Research (in Section 4) : addresses on-device learning from a statistical perspective
and surveys traditional learning theories forming the basis of most of the algorithm designs
addressed in section 3. It later addresses the ‘un-learnability’ problem in a resource constrained
setting and surveys newer resource constrained learning theories. Such newer theories abstract the
resource constraints (i.e. memory, processing speed etc.) as an information bottleneck and provides
performance guarantees for learning under these settings.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 7

Finally, section 5 summarises the previous sections and addresses some of the open challenges
in on-device learning research.

2 RESOURCE CONSTRAINTS IN ON-DEVICE LEARNING


The main difference between traditional machine learning and learning/inference on edge are the
additional constraints imposed by device resources. Designing AI capabilities that run on a device
necessitates building machine learning algorithms with high model accuracy while concurrently
maintaining the resource constraints enforced by the device. This section discusses these critical
resource constraints that pose major challenges while designing learning algorithms for edge
devices.

2.1 Processing Speed


The response time is often among the most critical factors for the usability of any on-device
application [140]. The two commonly used measurements are throughput and latency. Throughput
is measured as the rate at which the input data is processed. To maximize throughput it is common
to group inputs into batches resulting in higher utilization. But, measuring this incurs additional
wait time for aggregating data into batches. Hence, for time-critical use cases, latency is the more
frequently used measure. Latency characterizes the time interval between a single input and its
response. Although throughput is the inverse of latency (when batch size is fixed to 1), the runtime
of an application may vary dramatically depending on whether computations are optimized for
throughput vs. latency [183]. To simplify our discussion, in this survey we use an abstract notion
of runtime as a proxy for both throughput and latency.
For the physical system, the processing speed dictates the runtime of an application. This speed
is typically measured in clock frequency (i.e. the number of cycles per second) of a processor.
Within each cycle, a processor carries out a limited number of operations based on the hardware
architecture and the types of the operations. For scientific computations, FLOPS is frequently used
to measure the number of floating point operations per second. Another frequently used measure
specifically for matrix intensive computations such as those in machine learning algorithms is
multiplier–accumulate (MAC). Thus, besides increasing the clock frequency, efficiently combining
multiple operations into a single cycle is also an important topic for improving the processing
speed.
On a separate note, the processing speed of a system is also sensitive to the communication
latency among the components inside a system. As discussed before, the communication latency
aspect is better aligned to the distributed/decentralized computing paradigm for edge learning and
has not been addressed in this survey.

2.2 Memory
At the heart of any machine learning algorithm is the availability of data for building the model. The
second important resource for building AI driven on-device applications is memory. The memory
of a computing device provides immediate data access as compared to other storage components.
However, this speed of data access comes at higher costs. For reasons of cost, most edge devices are
designed with limited memory. As such, the memory footprint of edge applications are typically
tailored for a specific target device.
Advanced machine learning algorithms often take a significant amount of memory during model
building through storage of the model parameters and auxiliary variables etc. For instance, even
relatively simple image classification models like ResNet-50 can take megabytes of memory space
(see Table 3). Therefore, designing a lightweight model is of a key aspect of accomplishing machine
learning on the device. A detailed survey of such techniques is covered in sections 3 and 4.

, Vol. 1, No. 1, Article . Publication date: July 2020.


8 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

Besides the model size, querying the model parameters for processing is both time-consuming
and energy-intensive [166]. For example, a single MAC operation requires three memory reads and
one memory write. In the worst case, these reads and write may be on the off-chip memory rather
than the on-chip buffer. This would result in a significant throughput bottleneck and cause orders
of magnitude higher energy consumption [22].

2.3 Power Consumption and Energy Efficiency


Power consumption is another crucial factor for on-device learning. An energy efficient solution
can prolong the battery lifetime and cut maintenance costs. The system power, commonly studied
in hardware development, is the ratio of energy consumption and time span for a given task.
However, it is not a suitable measure for machine learning applications on the edge. First, the power
consumption depends on the volume of computation required, e.g. data throughput. Second, the
application is often capped at the maximum power of the device when the learning task is intensive.
Therefore to better quantify the power consumption, the total energy consumption along with the
throughput is recommended for comparing energy efficiency.
Linking energy consumption of a particular AI driven application for a specific device jointly
depends on a number of factors like runtime, memory etc. Capturing these dependencies are almost
never deterministic. Hence, most existing research estimate the power/energy usage through a
surrogate function which typically depends on the memory and runtime of an application. A more
detailed survey of such advanced approaches are covered in section 3.

Table 1. Comparison of hardware requirements

Use Device Hardware Chip Computing Memory Power


Workstation NVIDIA DGX-21 16 NVIDIA Tesla V100 GPUs 2 PFLOPS 512 GB 10 kW
Mobile Phone Pixel 32 Qualcomm Snapdragon™ 8453 727 GFLOPS 4 GB 34mW (Snapdragon)
Autonomous Driving NVDIA DRIVE AGX Xavier4 NVIDIA Xavier processor 30 TOPS 16 GB 30W
Smart home Amazon Echo5 TI DM3725 ARM Cortex-A8 up to 1 GHz 256 MB 4 W (peak)
General IoT Qualcomm AI Engine 6 Hexagon 685 and Adreno 615 2.1 TOPS 2-4 GB 1W
General IoT Raspberry Pi 37 Broadcom ARM Cortex A53 1.2 GHz 1 GB 0.58 W
General IoT Arduino Uno Rev38 ATmega328P 16 MHz 2 KB 0.3 W
1 http://images.nvidia.com/content/pdf/dgx-2-print-datasheet-738070-nvidia-a4-web.pdf
2 https://store.google.com/product/pixel_3_specs
3 https://www.qualcomm.com/products/snapdragon-845-mobile-platform
4 https://www.nvidia.com/en-us/self-driving-cars/drive-platform/hardware/
5 https://www.ifixit.com/Teardown/Amazon+Echo+Teardown/33953
6 https://www.qualcomm.com/products/vision-intelligence-400-platform
7 https://www.raspberrypi.org/magpi/raspberry-pi-3-specs-benchmarks/
8 https://store.arduino.cc/usa/arduino-uno-rev3

2.4 Typical Use Case and Hardware


Finally we conclude this section by providing a brief landscape of the variety of edge devices used
in several use-case domains and their resource characterization. As seen in Table 1, learning on
edge spans a wide spectrum of hardware specifications. Hence, designing machine learning models
for edge devices requires a very good understanding of the resource constraints, and appropriately
incorporating these constraints into the systems, algorithms and theoretical design levels.

3 ALGORITHMS FOR ON-DEVICE LEARNING


The algorithms approach targets developing resource-efficient techniques that work with existing
resource-constrained platforms. This section provides a detailed survey on the various approaches to

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 9

analyze and estimate a Machine Learning (ML) model’s resource footprint, and the state-of-the-art
algorithms proposed for on-device learning.
We present these algorithms and model optimization approaches in a task-agnostic manner. This
section discusses general approaches that adapt both traditional ML algorithms and deep learning
models to a resource constrained setting. These approaches can be applied to multiple tasks such
as classification, detection, regression, image segmentation and super-resolution.
Many traditional ML algorithms, such as SVM and Random Forest, are already suitable for
multiple tasks and do not need special consideration per task. Deep learning approaches, on the
other hand, do vary considerably from task to task. However, for the tasks mentioned above, these
networks generally deploy a CNN or RNN as the backbone network for feature extraction [90, 129].
As a consequence the resource footprint of training the backbone networks would directly affect the
training performance of the overall model. For example, deep learning based image segmentation
methods usually use ResNet as the backbone, while adding additional modules such as a graphical
models, modifications to the convolution computation, or combining backbone CNNs with other
architectures such as encoder-decoders to achieve their goal. [129].
Given these commonalities, we expect the approaches presented in this section to be generalizable
to different tasks and refer the readers to surveys [50, 90, 129, 190] for detailed comparison of
task-specific models. More importantly, we categorize the directions in which improvements can be
made, which could serve as a guideline for proposing novel resource-efficient techniques for new
models/tasks. Note that for CNN benchmarking, we use the image classification task as an example
as benchmarking datasets are well established for this area. Tasks such as scene segmentation
and super-resolution currently lack standard benchmarks making it difficult to equitably compare
models.

3.1 Resource Footprint Characterization


Before adapting ML algorithms to the resource-constrained setting, it is important to first understand
the resource requirements of common algorithms and identify their resource bottlenecks. The
conventional approach adopts an asymptotic analysis (such as the Big-O notation) of the algorithm’s
implementation. For DNNs, an alternate analysis technique is to use hardware-agnostic metrics
such as the number of parameters and number of operations (FLOPs/MACs). However, it has
been demonstrated that hardware-agnostic metrics are too crude to predict the real performance
of algorithms [119, 123, 202] because these metrics depend heavily on the specific platform and
framework used. This hardware dependency has lead to a number of efforts aimed at measuring
the true resource requirements of many algorithms (mostly DNNs) on specific platforms. A third
approach to profiling proposes building regression models that accurately estimate and predict
resource consumption of DNNs from their weights and operations. We will provide an overview
of all these resource characterization approaches in this subsection as well as new performance
metrics for algorithm analysis that incorporates the algorithm’s resource footprint.

3.1.1 Asymptotic Analysis. In this section, we present a comparative overview of the computational
and space complexities of both traditional machine learning algorithms and DNNs using asymptotic
analysis and hardware-agnostic metrics.
Traditional Machine Learning Algorithms: Due to the heterogeneity of model architectures
and optimization techniques, there is no unified approach that characterizes the resource utilization
performance of traditional machine learning algorithms. The most commonly used method is
asymptotic analysis that quantifies computational complexity and space complexity using the
Big-O notation. Table 2 summarizes the computational and space complexities of 10 popular
machine learning algorithms based on their implementation in Map-Reduce [26] and Scikit-Learn

, Vol. 1, No. 1, Article . Publication date: July 2020.


10 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

[103]. Note that algorithm complexity can vary across implementations, but we believe the results
demonstrated in Table 2 are representative of the current landscape. For methods that require
iterative optimization algorithms or training steps, the training complexity is estimated for one
iteration.

Table 2. Comparison of traditional machine learning algorithms. Notation: m-number of training samples;
n-input dimension; c-number of classes.

Algorithm Model size Optimization Training complexity Inference complexity


Decision tree O(m) - O(mnloд(m)) O(loд(m))
Random forest O(N t r e e m) - O(N t r e e mnloд(m)) O(N t r e e loд(m))
SVM O(n) gradient descent O(m 2 n) O(m sv n)
Logistic regression O(n) Newton-Raphson O(mn 2 + n 3 ) O(n)
kNN O(mn) - - O(mn)
Naive Bayes O(nc) - O(mn + nc) O(nc)
Linear regression O(n) matrix inversion O(mn 2 + n 3 ) O(n)
k-Means - - O(mnc) -
EM - - O(mn 2 + n 3 ) -
PCA - eigen-decomposition O(mn 2 + n 3 ) -

Some key observations can be made from Table 2. First, most traditional machine learning
algorithms (except tree-based methods and kNN) have a model size that is linear to the input
dimension, which do not require much memory (compared to DNNs which will be discussed
later). Second, except for kNN which requires distance calculation between the test data and
all training samples, the inference step of other algorithms is generally very fast. Third, some
methods require complex matrix operations such as matrix inversion and eigen-decomposition
with computational complexity around O(n3 ) [26]. Therefore, one should consider whether these
matrix operations can be efficiently supported by the targeted platform when deploying these
methods on resource-constrained devices.
In terms of accuracy of traditional machine learning algorithms, empirical studies have been
carried out using multiple datasets [2, 49, 207]. However, for on-device learning, there are few
studies that analyze both accuracy and complexity of ML algorithms and the tradeoffs therein.
Deep Neural Networks: Deep neural networks have shown superior performance in many
computer vision and natural language processing tasks. To harvest their benefits for edge learning,
one has to first evaluate whether the models and the number of operations required can fit into the
available resources on a given edge device. Accuracy alone cannot justify whether a DNN is suitable
for on-device deployment. To this end, recent studies that propose new DNN architectures also
consider MACs/FLOPs and number of weights to provide a qualitative estimate of the model size
and computational complexity. However, both memory usage and energy consumption also depend
on the feature map or activations [22, 119]. Therefore, in the following, we review popular DNNs
in terms of accuracy, weights, activations and MACs. Accuracies of models are excerpted from
best accuracies reported on open platforms and the papers that proposed these models. Number
of weights, activations and MACs are either excerpted from papers that first proposed the model
or calculated using the Netscope tool [34]. Note that these papers only count the MACs involved
in a forward pass. Since the majority of the computational load in training DNNs happens in the
backward pass, the values reported do not give a realistic picture of the computational complexity
for model training.
Compared to other network architectures, most on-device learning research focuses on CNNs,
which can be structured as an acyclic graph and analyzed layer-by-layer. Specifically, a CNN
consists of two types of layers, namely, the convolutional layer (CONV) and the fully connected

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 11

Table 3. Comparison of popular CNNs.

AlexNet VGG-16 GoogLeNet ResNet-18 ResNet-50 Inception


Metric
[96] [157] [167] [69] [69] v3 [168]
Top-1 acc. 57.2 71.5 69.8 69.6 76.0 76.9
Top-5 acc. 80.2 91.3 90.0 89.2 93.0 93.7
Input size 227×227 224×224 224×224 224×224 224×224 299×299
# of stacked CONV
5 13 21 17 49 16
layers
Weights 2.3M 14.7M 6.0M 9.5M 23.6M 22M
Activations 0.94M 15.23M 6.8M 3.2M 11.5M 10.6M
MACs 666M 15.3G 1.43G 1.8G 3.9G 3.8G
# of FC layers 3 3 1 1 1 1
Weights 58.7M 125M 1M 0.5M 2M 2M
Activations 9K 9K 2K 1.5K 3K 3K
MACs 58.7M 125M 1M 0.5M 2M 2M
Total weights 61M 138M 7M 10M 25.6M 24M
Total activations 0.95M 15.24M 6.8M 3.2M 11.5M 10.6M
Total MACs 724M 15.5G 1.43G 1.8G 3.9G 3.8G

layer (FC). It has been shown in [22] that the resource requirement of these two layers can be
very different due to their different data flow and data reuse patterns. In Table 3, we summarize
the layer-wise statistics of popular high-performance CNN models submitted to the ImageNet
challenge. In general, these models are very computation and memory intensive, especially during
training. Training requires allocating memory to weights, activations, gradients, data batches, and
workspace, which is at least hundreds of MBs if not GBs. These models are hard to deploy on
a resource-constrained device, let alone be trained on one. To enable CNN deployment on edge
devices, models with smaller sizes and more compact architectures are proposed, which we will
review in Section 3.2.2.
3.1.2 Resource Profiling. The most accurate way to quantify the resource requirements of machine
learning algorithms is to measure them during deployment. Table 4 summarizes current efforts
on DNN benchmarking using various platforms and frameworks. For inference, we only present
benchmarks that use at least one edge device such as a mobile phone or an embedded computing
system (such as NVidia Jetson TX1). However, as it has not been feasible to train DNNs at large-scale
on edge devices yet, we present training benchmarks utilizing single or multiple high-performance
CPU or GPUs. Interestingly, apart from measuring model-level performance, three benchmarks [1,
52, 146] further decompose the operations involved in running DNNs and profile micro-architectural
performance. We believe that these finer-grained measurements can provide more insights into
resource requirement estimation for both training and inference, which are composed of these
basic operations.
As opposed to DNNs, there are few profiling results reported for on-device deployment of
traditional machine learning algorithms, and usually as a result of comparing these algorithms to
newly developed ones. For example, inference time and energy are profiled in [98] for a newly
proposed tree-based Bonsai method, local deep kernel learning, single hidden layer neural network,
and gradient boosted decision tree on the Arduino Uno micro-controller board. However, memory
footprint is not profiled. Some other works empirically analyze the complexity and performance of
machine learning algorithms [113, 205] where the experiments are conducted on computers but not
resource constrained devices. To better understand the resource requirements of traditional machine
learning methods, more systematic experiments need to be designed to profile their performance
on different platforms and various frameworks.

, Vol. 1, No. 1, Article . Publication date: July 2020.


12
Table 4. DNN profiling benchmarks.

Benchmark Platform Framework Model Metric Highlight


the most comprehensive
57 SoC Processors; 200 mobile CNNs for 9 image processing
AI Android[86] Tensorflow Lite inference: runtime benchmark on CNN deploy-
phones tests
ment on Android system
8 light-weight CNNs from survey of models gener-
Intel i5-7600, NVidia Jetson
NAS[23] N/A neural architecture search inference: time and memory ated from NAS methods on
TX1, Xiaomi Redmi Note 4
methods resource-limited devices
training: single-/multi-thread
Skylake i7-6700k CPU with
3CNNs, 2RNNs, 1DRL, 1Au- execution time by op type (e.g., micro-architectural analysis;
Fathom[1] 32GB RAM; NVidia GeForce TensorFlow
toencoder, 1Memory Network MatMul, Conv2D etc., 22 in to- exploration of parallelism
GtX 960
tal)
Hadoop, Spark, JStorm, MPI, Micro benchmarks: 21 ops,
training: runtime, bandwidth, operating system level bench-
Intel ®Xeon E5-2620 V3 CPU Impala, Hive, TensorFlow, e.g., sort, conv, etc.; Comp
BigDataBench[52] memory (L1-L3 cache and marking; large variety of mod-
with 64GB RAM Caffe (each supporting benchmarks:23, e.g., pagerank,

, Vol. 1, No. 1, Article . Publication date: July 2020.


DRAM) els
different models) kmeans, etc.
training: time-to-accuracy vs.
defining a new metric by end-
optimization techniques, in-
DAWNBench[27] Nvidia K80 GPU TensorFlow, PyTorch ResNets to-end training time to a cer-
ference runtime vs. training
tain validation accuracy
runtime
Training:Intel Xeon Phi 7250,
profiling basic operations
7 types of NVidia GPUs; Infer- Operations: Matmul, Conv, Re- training and inference run-
DeepBench[146] N/A with respect to input dimen-
ence: 3 NVidia GPUs, iPhone6 current Ops, All-reduce time, FLOPs
sions
& 7, Raspberry Pi3
16 machines each with a Xeon training: throughput, GPU
profiling of same model
28-core CPU and 1-4 NVidia compute utilization, FP32
TBD[213] TensorFlow, MXNet, CNTK 3CNNs, 3RNNs, 1GAN, 1DRL across frameworks; multi-
Quadro P4000 / NVidia TITAN utilization, CPU utilization,
GPU multi-machine training
Xp GPUs memory
inference: energy consump-
proposing a multi-variable re-
tion (both per layer and net-
SyNERGY[148] NVidia Jetson TX1 Caffe 11 CNNs gression model to predict en-
work),SIMD instructions, bus
ergy consumption
access
inference: runtime, power,
comparison of CNN inference
memory, accuracy vs.
DNNAnalysis[16] NVidia Jetson TX1 Torch7 14 CNNs on an embedded computing
throughput, parameter
system
utilization
very comprehensive compari-
NVidia Jetson TX1; NVidia TI- inference:accuracy, memory, son of CNN inference on both
DNNBenchmarks[10] PyTorch 44 CNNs
TAN Xp GPU FLOPs, runtime embedded computing system
and powerful workstation
Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah
On-Device Machine Learning: An Algorithms and Learning Theory Perspective 13

Table 5. DNN resource requirements modeling. ASIC: Application-Specific Integrated Circuit. Matmul: matrix
multiplication. RMSPE: root mean square percentage error.

Measured Regression Relative


Work Platform Framework Metric
features model error
inference: matrix dimensions memory: 28% -
NVidia TK1,
Augur[119] Caffe memory, in matmul, weights, linear 50%; time: 6% -
TX1
time activations 20%
forward & backward
NVidia Titan training &
FLOPs, weights, acti-
Paleo[141] X GPU clus- TensorFlow inference: linear 4%-30%
vations, data, platform
ter time
percent of peak
NVidia forward & backward
Gianniti et al. training:
Quadro - FLOPs of all types of linear < 23%
[56] time
M6000 GPU layers
SyNERGY Nvidia inference: < 17% (w/o Mo-
Caffe MACs linear
[148] Jetson TX1 energy bileNet)
layer configuration
Nvidia Titan inference: time: < 24%;
NeuralPower TensorFlow hyper-parameters, mem-
X & GTX time, power, polynomial power: < 20%;
[13] & Caffe ory access, FLOPs,
1070 energy energy: < 5%
activations, batch size
Nvidia inference:
HyperPower layer configuration
GTX1070 & Caffe power, linear RMSPE < 7%
[160] hyper-parameters
Tegra TX1 memory
Yang et al. ASIC inference:
- MACs, memory access - -
[202] Eyeriss[22] energy
training&
DeLight Nvidia Tegra layer configuration
Theano inference: linear -
[149] TK1 hyper-parameters
energy

3.1.3 Resource Modeling and Estimation. To provide more insights into how efficiently machine
learning algorithms can be run on a given edge device, it’s helpful to model and predict the
resource requirements of these algorithms. Even though performance profiling can provide accurate
evaluation of resource requirements, it can be costly as it requires the deployment of all models to
be profiled. If such requirements can be modeled before deployment, it will then be more helpful
for algorithm design and selection. Recently, there have been an increasing number of studies that
attempt to estimate energy, power, memory and runtime of DNNs based on measures such as matrix
dimension and MACs. There are many benefits if one can predict resource requirements without
deployment and runtime profiling. First, it can provide an estimate of training and deployment cost
of a model before actual deployment, which can reduce unnecessary implementation costs. Second,
the modeled resource requirements can be integrated into the algorithm design process, which
helps to efficiently create models that are tailored to specific user-defined resource constraints.
Third, knowing the resource consumption of different operation types can help to decide when
offloading and performance optimization are needed to successfully run learning models on edge
devices [119].
In Table 5, we summarize recent studies that propose approaches to model resource require-
ments of DNNs. Generally, a linear regression model is built upon common features such as
FLOPs for certain types of operations, activation size, matrix size, kernel size, layer configuration
hyper-parameters, etc., which are not hard to aquire. The Relative Error column shows how these
estimation models perform compared to the actual runtime profiling results of a network. We can
see that most models demonstrate a relative error between 20% and 30%, which indicates that
even though hardware dependency and various measures are taken into account, it can still be
challenging to make consistently accurate predictions on resource requirements. This large relative
error further shows that metrics such as FLOPs or MACs alone are very crude estimates that can
provide limited insights. Note that all models are platform and framework dependent, meaning

, Vol. 1, No. 1, Article . Publication date: July 2020.


14 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

that the coefficients of the regression model will change when the platform or framework changes.
Nevertheless, these studies provide a feasible approach to approximate resource requirements of
DNNs. It remains to be seen whether similar methodology can be adopted to model the resource
requirements of machine learning algorithms other than DNNs.
3.1.4 New Metrics. Resource constrained machine learning models cannot be evaluated solely on
traditional performance metrics like, accuracy for classification, perplexity in language models, or
mean average precision for object detection. In addition to these traditional metrics there is a need
for system specific performance metrics. Such metrics can be broadly categorized as below:
(1) Analyze the multi-objective pareto front of the designed model along the dimension of
accuracy and resource footprint. Following [23] some commonly used metrics include,
(
Accuracy, if Power ≤ Power budget
Reward = (1)
0, otherwise,
or (
1 − Energy∗ , if Accuracy ≥ Accuracy threshold
Reward = (2)
0, otherwise,
where Energy∗ is a normalized energy consumption. Using this measure, a model can be
guaranteed to fulfill the resource constraints. However, it is hard to compare several models
as there is no single metric. In addition, it is also harder to optimize the model when reward
goes to zero.
(2) Scalarization of the multi-objective metric into a unified reward [70], Figure of Merit (FoM)
[182], or NetScore [197] that merges all performance measures into one number, e.g.
Reward = −Error × log(FLOPs), (3)
or
Accuracy
FoM = , (4)
Runtime × Power
or
Accuracyα
 
NetScore = 20 log , (5)
Parametersβ MACsγ
where α, β and γ are coefficients that control the influence of individual metrics. Using a
single value, a relative ordering between several models is possible. However, now there are
no separate threshold on the resource requirements. Hence, the optimized model may still
not fit the hardware limitations.
Note that, in addition to using the metrics in eq. (1) to (5) to evaluate the model performance,
the works in [23, 70, 182] also utilize these metrics to build and optimize their learning algorithms.
The main idea is to accurately maintain the system requirements and also improve the model’s
performance during training [170].

3.2 Resource Efficient Training


In this section, we review existing algorithm improvements on resource-constrained training.
Note that orthogonal approaches have been made in specialized hardware architecture design
and dataflow/memory optimization to enable on-device learning [147, 166]. While algorithmic
and architectural approaches are tightly correlated and in most cases complement each other, we
maintain our focus on resource-efficient algorithms that can work across platforms.
In the machine learning task, the available resources are mainly consumed by three sources: 1)
the ML model itself; 2) the optimization process that learns the model; and 3) the dataset used for

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 15

learning. For instance, a complex DNN has a large memory footprint while a kernel based method
may require high computational power. The dataset requires storage when it is collected and
consumes memory when it is used for training. Correspondingly, there are mainly three approaches
that adapt existing ML algorithms to a resource-constrained setting: 1) reducing model complexity
by incorporating resource constraints into the hypothesis class; 2) modifying the optimization
routine to reduce resource requirements during training; and 3) developing new data representations
that utilize less storage. Particularly, for data representation, there is a line of work that proposes
new learning protocols for resource-constrained scenarios where full data observability is not
possible [8, 156]. In the following, we broadly review resource efficient training methods according
to these categories.

3.2.1 Lightweight ML Algorithms. As shown in Table 2 in Section 3.1.1, some traditional ML


algorithms, such as Naive Bayes, Support Vector Machines (SVMs), Linear Regression, and certain
Decision Tree algorithms like C4.5, have relatively low resource footprints [2]. The lightweight
nature of these algorithms make them good candidates for on-device learning applications. These
algorithms can be readily implemented on a particular embedded sensor system or portable device
with few refinements to fit the specific hardware architecture [66, 105, 117].

3.2.2 Reducing Model Complexity. Adding constraints to the hypothesis class or discovering
structures in the hypothesis class is a natural approach to selecting or generating simpler hypotheses.
This approach is mostly applied to large models, such as decision trees, ensemble-based models,
and DNNs.
Decision trees can have large memory footprint due to the large number of tree nodes. To
avoid over-fitting and to reduce this memory footprint, pruning is a typical approach applied for
deploying decision trees [97]. Recently, there are also studies that develop shallow and sparse tree
learners with powerful nodes that only require a few kilobytes of memory [98].
For ensemble-based algorithms such as boosting and random forest, a critical question is how to
select weak learners to achieve higher accuracy with lower computational cost. To address this
issue, [61] proposed a greedy approach that selects weak decision-tree-based learners that can
generate a prediction at any time, where the accuracy can increase if larger latency is allowed to
process weaker learners.
For DNNs, there have been a plethora of studies that propose more lightweight model architec-
tures for on-device learning. While most of these models are designed for edge deployment, we still
include these approaches in efficient training because reductions in weights and connections in
these new architectures lead to fewer resource requirements for continued training. In Table 6, we
provide a comparative overview of some representative lightweight CNN architectures. Compared
to networks presented in Table 3, these networks are much smaller in terms of size and computation,
while still retaining fairly good accuracy compared to large CNNs. Unfortunately, there is not much
work developing lightweight models for DNN architectures other than CNNs.
Among the models presented in Table 6, MnasNet is a representative model generated via
automatic neural architecture search (NAS), whereas all other networks are designed manually.
NAS is usually based on reinforcement learning (RL) or evolutionary algorithms where a predefined
search space is explored to generate the optimal network architecture that achieves the best tradeoff
between accuracy and energy/runtime [195]. To speed up NAS, the resource modeling techniques
mentioned in Section 3.1.3 could be helpful, as they can replace the computationally intensive
process of measuring the actual model performance on device. For example, in ChamNet[31],
predictors for model accuracy, latency, and energy consumption are proposed that can guide the
search and reduce time to find desired models from many GPU hours to mere minutes.

, Vol. 1, No. 1, Article . Publication date: July 2020.


16 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

Table 6. Comparison of lightweight CNNs.

Squeeze- ShuffleNet
MobileNet MobileNet Squeeze- Condense- MnasNet
Metric Next-1.0- 1×д =
V1-1.0[79] V2-1.0[152] Net[85] Net[83] [170]
23[55] 8[210]
Top-1 acc. 70.9 71.8 57.5 59.0 67.6 71.0 74.0
Top-5 acc. 89.9 91.0 80.3 82.3 - 90.0 91.8
Input size 224×224 224×224 224×224 227×227 224×224 224×224 224×224
# of stacked CONV
27 20 26 22 17 37 18
layers
Weights 3.24M 2.17M 1.25M 0.62M 3.9M 2.8M 3.9M
Activations 5.2M 1.46M 4.8M 4.7M 3.2M 1.1M 3.9M
MACs 568M 299M 388M 282M 138M 274M 317M
# of FC layers 1 1 0 1 1 1 1
Weights 1M 1.3M 0 0.1M 1.5M 0.1M 0.3M
Activations 2K 2.3K 0 1.1K 2.5K 1.1K 1.3K
MACs 1M 1.3M 0 0.1M 1.5M 0.1M 0.3M
Total weights 4.24M 3.47M 1.25M 0.72M 5.4M 2.9M 4.2M
Total activations 5.2M 1.46M 4.8M 4.7M 3.2M 1.1M 3.9M
Total MACs 569M 300M 388M 282M 140M 274M 317M

Note that even though for most NAS algorithms, the goal is to optimize for inference performance,
the resulting architectures usually contain fewer parameters and require fewer FLOPS, which could
lead to a lower resource footprint when trained on-device. In fact, a recent study [142] shows that
training time can be significantly reduced on the architectures found in their proposed search
space. If needed, on-device resource constraints that are specific to model training can be taken into
account by NAS. Keeping this in consideration, we include NAS in the resource-efficient training
section, as they provide good candidates for on-device model training tasks.
Figure 3 shows a chronological plot of the evolution of CNN architectures. We include the models
in Table 3, Table 6, as well as recent NAS works including FBNet[198], FBNetV2[184], ChamNet[31],
ProxylessNAS[14], SinglePath-NAS[161], EfficientNet[171], MixNet[172], MobileNetV3[78], and
RegNet[142]. This plot shows the FLOPS each model requires, and only models that achieve more
than 60% top-1 accuracy on the ImageNet dataset are presented. Figure 3 shows a clear transition
from hand-designed CNN architectures to NAS over time, with the best performing models all
being generated using NAS techniques. An advantage of NAS over hand-designed architectures
is that different accuracy-resource tradeoffs can be made a part of the search space resulting in a
group of models that are suitable for platforms with various resource constraints.
Apart from designing new architectures from scratch, another approach to reducing DNN
model complexity is to exploit the sparse structure of the model architecture. One efficient way
to learn sparse DNN structures is to add a group lasso regularization term in the loss function
which encourages sparse structures in various DNN components like filters, channels, filter shapes,
layer depth, etc. [48, 104, 192]. Particularly in [192], when designing the regularization term, the
authors take into account how computation and memory access involved in matrix multiplication
are executed on hardware platforms such that the resulting DNN network can achieve practical
computation acceleration.
Another approach adopted for reduced model complexity involves quantization by using low or
mixed precision data representation. Quantization was first applied to inference where only the
weights and/or activations are quantized post-training. This approach was then extended to training
where gradients are also quantized. Conventionally, all numerical values including weights, data,
and intermediate results are represented and stored using 32-bit floating point data format. There
are a plethora of emerging studies exploring the use of 16-bit or lower precision for storing some
or all of these numerical values in model training without much degradation in model accuracy

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 17

Fig. 3. Ball chart of the chronological evolution of model complexity. Top-1 accuracy is measured on the
ImageNet dataset. The model complexity is represented by FLOPS and reflected by the ball size. The accuracy
and FLOPS are taken from original publications of the models. The time of the model is when the associated
publication is first made available online.

[64, 128]. Most of these approaches involve modifications in the training routine to account for
the quantization error. A more detailed discussion on the training routine modifications of these
methods is provided in the next Section.
3.2.3 Modifying Optimization Routines. There are broadly two directions of research aimed at
improving the performance of quantized models.
Resource constrained Model-Centric Optimization Routines: Optimization or other numer-
ical computation involved in training certain traditional ML models can be very resource intensive.
For example, for kernel-based algorithms such as SVM, kernel computation usually consumes the
most energy and memory. Efficient algorithms have been proposed to reduce the computation
complexity and memory footprint of kernel computations, and are beneficial for both training and
inference [80, 91, 102].
The choice of optimization strategy is crucial for DNN training, as a proper optimization technique
is critical to finding a good local minimum. While many efficient optimization algorithms such as
Adam as well as network initialization techniques are exploited in DNN training, the main objective
is to improve convergence and consequently reduce training time. Other resource constraints are
not taken into account explicitly. In this part, we focus on efficient training algorithms that consider
performance other than runtime and could potentially enable DNN training on devices with limited
memory and power.
The most popular approach for reducing memory footprint of DNN training is quantization, as
mentioned in section 3.2.2. However, quantization inevitably introduces error which could make
training unstable. To resolve this issue, recent works propose minor modifications to the model
training process. A popular approach involves designing a mathematical model to characterize the

, Vol. 1, No. 1, Article . Publication date: July 2020.


18 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

statistical error introduced when limiting the model’s precision through rounding or quantization
[25, 77, 92, 200]; and proposing improvements [76, 95, 121]. In fact, most of the recent research
in this line involves modifications in training updates either through stochastic rounding, weight
initialization or through introducing the quantization error into gradient updates. The work in
[65] uses fixed point representation and stochastic rounding to account for quantization error.
[28] further limits the precision to binary representation of weights using stochastic binarization
during forward and back propagation. However the full precision is maintained during gradient
updates. The authors claim that such unbiased binarization lends to additional regularization of the
network. A more detailed theoretical analysis of stochastic rounding and binary connect networks
can be found in [109]. The work in [116] introduced a quantized version of back-propagation by
representing the weights as powers of 2. This representation converts the multiplication operations
to cheaper bit-shift operations. An alternative noisy back propagation algorithm to account for
the error due to binarization of weights is also introduced in [94]. In a slightly different approach
[84] uses a deterministic quantization but introduces a ‘straight through estimator’ for gradient
computation during back propagation. They further introduce a shift-based batch normalization
and a new shift-based AdaMax algorithm. [143] introduces a new mechanism to compute binarized
convolutions and, additionally, introduces a scaling for the binarized gradients. [194] applies a
non-subtractive dither quantization function to gradients and shows that this function can induce
sparsity and non-zero values with low bitwidth for large enough quantization stepsize.
A more recent approach in [186] moves away from fixed-points to a new floating point repre-
sentation with stochastic rounding and chunk based accumulation during training. Finally, recent
work in [43, 89] simulates the effect of quantization during inference and adds correction to the
training updates by introducing quantization noise in the gradient updates. Slightly different from
[89] where all the weights are quantized, [43] quantizes a random subset of weights and back
propagates unbiased gradients for the complimentary subset.
Apart from the theoretical development of methods that reduce quantization error, some other
works focus on making quantization more practical and generalizable. [211] generalizes the method
of binarized neural networks to allow arbitrary bit-widths for weights, activations, and gradients,
and stochastically quantizes the gradients during the backward pass. To generalize the quantization
technique to different types of models, [15] introduces two learnable statistics of DNN tensors,
namely, shifted and squeezed factors, which are used to adjust the tensors’ range in 8-bits for
minimizing quantization loss. They show that their method works out-of-the-box for large-scale
DNNs without much tuning. However, in most works, there are still some computation in training
that requires full precision representation. To address this issue [204] proposes a framework that
quantizes the complete training path including weights, activations, gradients, error, updates, and
batch-norm layers, and converts them to 8-bit integers. Different quantization functions are used
for different compute elements.
In a very different approach, [158] propose expectation back propagation (EBP), an algorithm
for learning the weights of a binary network using a Variational Bayes technique. The algorithm
can be used to train the network such that, each weight can be restricted to be binary or ternary
values. This approach is very different from the above gradient-based back-propagation algorithms.
However, this approach assumes that the bias is real and is not currently applicable to CNNs.
A chronological summary of the approaches modifying the training algorithms to account for
quantization error is provided in Table 7. While the above approaches (summarized in Table 7)
mainly target modifications to account for the limited bit representations (due to quantization), in
the following section, we highlight several other approaches that are not limited to quantization
corrections.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 19

Table 7. The chronology of the recent approaches which modifies the training algorithm to account for quanti-
zation error.

Quantization1 Benchmark
Year Approach Keywords
Parameter
Forward Backward Data Model
Update
2014 EBP [158] Expectation Back Propagation 1 bit, FP - - used in [30] Proprietary MLP
16 bits 16 bits 16 bits MNIST Proprietary MLP , LeNet-5
Stochastic Rounding
Gupta et. al [65] 20 bits 20 bits 20 bits CIFAR-10 used in [75]
2015
MNIST
Binary Connect [28] Stochastic Binarization 1 bit 1 bit Float 32 2 CIFAR-10 Proprietary MLP, CNN
SVHN
Stochastic Binarization MNIST
Proprietary
Lin et. al [116] No forward pass multiplication 1 bit 1 bit Float 32 CIFAR-10
MLP, CNN
Quantized back propagation SVHN
2016 Weight Compression 1 bit
Bitwise Net [94] 1 bit 1 bit MNIST Proprietary MLP
Noisy back propagation Float 323
Binary convolution AlexNet
1 bit
XNOR-Net [143] Binary dot-product 1 bit 1 bit ImageNet ResNet-18
Float 324
Scaling binary gradient GoogLenet
stochastic gradient quantization SVHN proprietary CNN
DoReFa-Net [211] 1-8 bit 1-8 bit 2-32 bit
arbitrary bit-width ImageNet AlexNet
Deterministic binarization
Straight through estimators 1 bit 1 bit 1 bit 5 MNIST proprietary MLP
2017 QNN [84]
to avoid saturation CiFAR-10
CNN from [28]
Shift based Batch Normalization SVHN
Shift based AdaMAX AlexNet
ImageNet
GoogLenet
Penn proprietary RNN
4 bit 4 bit 4 bit 6
Treebank LSTM
proprietary CNN
novel floating point CIFAR-10
ResNET
Wang et. al [186] chunk based accumulation 8 bit 8 bit 8 bit 7
BN50 [177] proprietary MLP
stochastic rounding
2018 AlexNet
ImageNet ResNET18
ResNET50
Resnet
Imagenet Inception v3
training with simulated
Jacob et. al [89] 8 bit 8 bit 8 bit 8 MobileNet
quantization
COCO MobileNet SSD
Flickr [79] MobileNet SSD
batch-norm layer quantization
8-bit integer representation
2019 WAGEUBN [204] 8 bit 8 bit 8 bit ImageNet ResNet18/34/50
combination of direct, constant
and shift quantization
CIFAR-10 ResNet20/34/50
shifted and squeezed FP8
ImageNet ResNet18/50
S2FP8 [15] representation of tensors 8 bit 8 bit 32 bit
English-Vietnamese Transformer-Tiny
tensor distribution learning
Neural Collaborative
MovieLens
Filtering (NCF)
stochastic gradient quantization MNIST LeNet
2020
Wiedemann et. al [194] induce sparsity 8 bit 8 bit 32 bit AlexNet
non-subtractive dither CIFAR-10/100 ResNet18
VGG11
ImageNet ResNet18
Wikitext-103 RoBERT
training using
Quant-Noise [43] 8 bit 8 bit 8 bit MNLI RoBERT
quantization noise
ImageNet EfficientNet-B3

1 minimum quantization for best performing model reported.


2 all real valued vectors are reported as Float 32 by default.
3 involves tuning a separate set of parameters with floating point precision.
4 becomes Float 32 if gradient scaling is used.
5 except the first layer input of 8 bits.
6 contains results with 2 bit, 3 bit and floating point precision.
7 additional 16 bit for accumulation.
8 uses 7 bit precision for some Inception v3 experiments.

, Vol. 1, No. 1, Article . Publication date: July 2020.


20 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

Apart from quantization, there are mainly two other approaches targeted at reducing memory
footprint of DNN training, namely, layer-wise training [19, 60], and trading computation for
memory [62]. One major cause for the heavy memory footprint of DNN training is end-to-end
back-propagation which requires storing all intermediate feature maps for gradient calculation.
Sequential layer-by-layer training has been proposed as an alternative [19]. While this method was
originally proposed for better DNN interpretation, it requires less memory usage while retaining the
generalization ability of the network. Trading computation for memory reduces memory footprint
by releasing some of the memory allocated to intermediate results, and recomputing these results
as needed. This approach is shown to tightly fit within almost any user-set memory budget while
minimizing the computational cost [62]. Apart from these two approaches, some other studies focus
on exploiting the specific DNN architectures and targeted datasets, and propose methods using
novel loss functions [21, 206], and training pipelines [21] to improve the memory and computation
efficiency of DNNs.
While the aforementioned studies demonstrate promising experimental results, they barely
provide any analysis on the generalization error nor the sample complexity under a limited memory
budget. A few studies target at this issue by developing memory-bounded learning techniques
with some performance guarantees. For example, [57, 101, 162, 169] proposed memory-bounded
optimization routines for sparse linear regression and provided upper bounds on regret in an
online learning setting. For PCA, [4, 108, 130, 203, 209] proposed memory or computation-efficient
optimization algorithms while guaranteeing optimal sample complexity.
Note that in an environment with a cluster of CPUs or GPUs, the most popular technique for
speeding up training is parallelization, where multiple computational entities share the computation
and storage for both data and model [38]. However, in IoT applications with embedded systems,
it is not common to have similar settings as a computer cluster. Usually the devices are scattered
and operated in a distributed fashion. To enable cooperation among a group of devices, distributed
learning techniques such as consensus, federated learning and many others have been proposed
[126, 175]. However, the primary concern of these techniques are data privacy and the lack of
central control entity in the application field, which is out of the scope of this paper.
Resource constrained Generic Optimization Routines: While the methods in the previous
subsection are designed for specific algorithms, the approaches in this section target generic
optimization routines. One of the first works in this line of research is BuckWild! [37] which
introduces low precision SGD and provides theoretical convergence proofs of the algorithm. An
implementation of the Buckwild! algorithm through a computation model called "Dataset Model
Gradients Communicate" (DMGC) is provided in [35]. As an improvement to the Buckwild! approach,
the work in [36] proposes novel ‘bit-centering’ quantization for low precision SGD and stochastic
variance reduction gradient (SVRG). An alternative approach to improve upon low-precision SGD
in [37] is also introduced in [201]. This approach introduces the SWALP algorithm, an extension of
the stochastic weight averaging (SWA) scheme for SGDs for low precision arithmetic. While most of
the above approaches mainly analyze the effect of quantization to SGD and its variants; [164] adopts
an approach of improving the SGD algorithm under sparse representation. Finally, in a more recent
work, [112] provides the first dimension-free bound for the convergence of low precision SGD
algorithms. There has been significant research on analyzing traditional first order algorithms like
SGD, SVRG etc., in parallel and distributed settings with low precision / quantized bit representation
[3, 51, 115, 124, 165, 173, 191, 193, 199]. Even though these approaches generalize to single device
learning, the main focus of these research is in reducing the communication bottleneck for parallel
and distributed settings. This area is not within the scope of the current survey. A more detailed
survey on communication-efficient distributed optimization is available in [63, 110, 114, 174].

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 21

An alternative line of work involves implementing fixed point Quadratic Programs (QP) for
solving linear Model Predictive Control (MPC). Most of these algorithms involve modifying the fast
gradient methods developed in [139] to obtain a suboptimal solution in a finite number of iterations
under resource constrained settings. In fact these algorithms extend the generic Interior point
methods, Active-set methods, Fast Gradient Methods, Alternating Direction method of multipliers
and Alternating minimization algorithm, to handle quantization error introduced due to fixed point
implementations. However, all these approaches are targeted towards linear MPC problems, and
is not the main focus of this survey. A comprehensive survey on these fixed point QPs for linear
MPCs is available in [125].

3.2.4 Data Compression. Besides complexity in models or optimization routines, the dimensionality
and volume of training data significantly dictates the algorithm design for on-device learning. One
critical aspect is the sample complexity 1 i.e. amount of training data needed by the algorithm
to achieve a desired statistical accuracy. This aspect has been widely studied by the machine
learning community and has led to newer learning settings like Semi-Supervised learning [18],
Transductive learning [18, 179], Universum learning [41, 179], Adversarial learning [59], Learning
under Privileged Information (LUPI) [179], Learning Using Statistical Invariants (LUSI) [180] and
many more. These settings allow for the design of advanced algorithms that can achieve high
test accuracies even with limited training data. A detailed coverage of these approaches is outside
the scope of this survey but readers are directed to [18, 24, 41, 54, 59, 178–180, 214] for a more
exhaustive reading.
In this paper we focus on those approaches that reduce the resource footprint imposed by the
dimensionality and the volume of training data.
For example, the storage and memory requirement of k-Nearest Neighbor (kNN) approaches are
large due the fact that all training samples need to be stored for inference. Therefore, techniques
that compress training data are usually used to reduce the memory footprint of the algorithm
[99, 187, 208]. Data compression or sparsification is also used to improve DNN training efficiency.
Some approaches exploit matrix sparsity to reduce the memory footprint to store training data.
For example, [150] transforms the input data to a lower-dimensional embedding which can be
further factorized into the product of a dictionary matrix and a block-sparse coefficient matrix.
Training using the transformed data is shown to be more efficient in memory, runtime and energy
consumption. In [111], a novel data embedding technique is used in RNN training, which can
represent the vocabulary for language modeling in a more compressed manner.

3.2.5 New Protocols for Data Observation. For most learning algorithms, a major assumption is
that full i.i.d. data is accessible in batch or streaming fashion. However, under certain constraints,
only partial features of part or the entire dataset can be observed. For example, feature extraction
can be costly due to either expensive feature computation or labor-intensive feature acquisition.
Another instance is that memory is not sufficient to store a small batch of high-dimensional data
samples. This problem leads to studies on limited attribute observation [8, 17], where both memory
and computation are limited by restricted observation of sample attributes. Under this setting, a
few studies propose new learning algorithms and analyze sample complexity for applications such
as sparse PCA[156], sparse linear regression [17, 68, 88, 136], parity learning[163], etc. While this
new learning protocol provides valuable insights into the trade-offs of sample complexity, data
observability, and memory footprint, it remains to be explored how they can be used to design
resource-efficient algorithms for a wider range of ML models.

1 Formal introduction provided in Section 4.2.1

, Vol. 1, No. 1, Article . Publication date: July 2020.


22 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

3.3 Resource Efficient Inference


While inference is not the main focus of this paper, for the sake of completeness, we provide a
brief overview of resource efficient approaches that enable on-device inference. Due to the fact
that most traditional ML algorithms are not resource intensive during inference, we will only focus
on methods proposed for DNNs.
Apart from designing lightweight DNN architectures as discussed in Section 3.2.2, another popular
approach is static model compression, where, during or after training, the model is compressed in
size via techniques such as network pruning [67, 118], vector quantization [58, 67, 185], distillation
[74], hashing [20], network projection [144], binarization [29], etc. These studies demonstrate
significant reduction in model size while retaining most of the network’s predictive capacity.
However, since these models will not change after compression, their complexities cannot
further adapt to dynamic on-device resources or inputs. To address this issue, some novel adaptive
techniques for faster inference on embedded devices are explored in recent works. For example, [40]
uses dynamic layer-wise partitioning and partial execution of CNN based model inference, allowing
for more robust support of dynamic sensing rates and large data volume. Another approach proposed
in [122] involves developing a low cost predictive model which dynamically selects models from a
set of pre-trained DNNs by weighing desired accuracy and inference time as metrics for embedded
devices. [106] introduced Neuro.ZERO to provide energy and intermittence aware DNN inference
and training along with adaptive high-precision fixed-point arithmetic to allow for accelerated
run-time embedded hardware performance.
There have been observations that not all inputs require the same amount of computational
power to be processed. A simple model may be sufficient to classify samples that are easy to
distinguish, while a more complex model is only needed to process difficult samples. Based on
this reasoning, [11, 82, 189] focus on building and running adaptive models that dynamically scale
computation according to inputs. The specific methods proposed in these works include early
termination, exploration of cascaded models, and selectively using/skipping parts of the network.
These adaptive approaches are complementary to the compression approach, and further resource
efficiency can be expected if these two approaches are applied altogether.
Apart from the ‘model-centric‘ approaches as discussed in Section 3.2.3, hardware and system
optimizations have also been exploited for efficient model deployment. To explore how the choice
of computing devices can impact performance of these models on the edge we direct the reader to
[137]. The survey discusses how various low power hardware such as ASICs, FPGAs, RISC-V, and
embedded devices are used for efficient inferencing of Deep Learning models. Another interesting
survey [188] expands on the various communication and computation modes for deep learning
models where edge devices and the cloud server work in tandem. Their work explores concepts
such as integral offloading, partial offloading, vertical collaboration, and horizontal collaboration to
allow efficient edge-based inference in collaboration with the cloud. [212] further advances these
ideas from the point of view of the network latency between the cloud server and edge device,
exploring techniques such as model partitioning, edge caching, input filtering, early exit strategies
etc., to provide efficient on device inference. Lately, the embedded ML research community has also
focused on interconnected and smart home devices for processing data privately and locally with
stronger privacy restrictions and lower latencies. [81] aggregates processing capability of potential
embedded devices at home with comparisons between processing on mobile phones and specific
hardware for efficient DL processing such as Coral TPU and the NVIDIA Jetson Nano.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 23

3.4 Challenges in Resource-Efficient Algorithm Development


From the discussions in this section, we observe the following challenges for resource characteriza-
tion of existing ML algorithms and development of new resource-efficient algorithms:
(1) Hardware dependency: As observed from existing hardware platform benchmarks and
resource requirement modeling approaches, resource characterization greatly depends on
the platform, framework, and the computing library used. For example, on the Nvidia Jetson
TX1 platform, the memory footprint of a CNN model can be very different based on the
frameworks (Torch vs. Caffe) used [16, 148]. Among all resource constraints, memory footprint
is particularly hard to quantify as many frameworks and libraries deploy unique memory
allocation and optimization techniques. For example, MXNet provides multiple memory
optimization mechanisms and the memory footprint is very different under each mechanism
[138]. Consequently, given specific on-device resource budgets and performance requirements,
it is challenging to choose the optimal algorithm that fits in the resource constraints.
With all the variability and heterogeneity in implementation, it is thereby important to draw
insights and discover trends on what is invariant across platforms. As discussed in [123],
cross-platform models which account for chip variability are needed to transfer knowledge
from one type of hardware to another. Furthermore, we postulate that cross-framework
or cross-library models are also important to enable resource prediction for all types of
implementations.
(2) Metric design:While a few novel metrics are proposed for on-device ML algorithm analysis,
they either cannot be used directly for algorithm comparison, or cannot provide guarantees
that the training process will fit the resource budget. Particularly, for metrics devised based on
hardware agnostic measures such as number of parameters or FLOPs, they rarely accurately
reflect realistic algorithm performance on a specific platform. Therefore, more practical,
commensurable and interpretable metrics need to be designed to determine which algorithm
can provide the best trade-off between accuracy and multiple resource constraints.
(3) Algorithm focus: Most studies on on-device learning focus on DNNs, especially CNNs,
because their layer-wise architectures carry a declarative specification of computational
requirements. In contrast, there has been very little focus on RNNs and traditional machine
learning methods. This is mainly due to their complexity or heterogeneity in structure or
optimization method, and lack of a benchmarking dataset. However, as observed in Table
4, the edge platforms used for DNN profiling are generally mobile phones or embedded
computing cards which still have gigabytes of RAM and large computational power. For IoT
or embedded devices with megabytes or even kilobytes of RAM and low computational power,
DNNs can barely fit. Therefore, traditional machine learning methods are equally important
for edge learning, and their resource requirements and performance on edge devices need
to be better profiled, estimated and understood. In addition, designing advanced learning
algorithms that require only small amounts of training data to achieve high test accuracies is
a huge challenge. Addressing this area can significantly improve the resource footprint (i.e.
low memory footprint to store the training data), while maintaining highly accurate models.
(4) Dynamic resource budget: Most studies assume that the resource available for a given
algorithm or application is static. However, resource budget for a specific application can
be dynamic on platforms such as mobile phone due to events such as starting or closing
applications and application priority changes [44]. Contention can occur when resource is
smaller than the model requirements and resource can be wasted when it is abundant. To
solve this issue, [44] proposed an approach for multi-DNN applications. When trained offline,
for each application, five models with diverse accuracy-latency tradeoffs are generated and

, Vol. 1, No. 1, Article . Publication date: July 2020.


24 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

their resource requirements are profiled. The models are nested with shared weights to reduce
memory footprint. During deployment, a greedy heuristic approach is used to choose the best
models for multiple applications at runtime that achieve user-defined accuracy and latency
requirements while not exceeding the total memory of the device.

4 THEORETICAL CONSIDERATIONS FOR ON-DEVICE LEARNING

Table 8. Broad Categorization of resource constrained algorithms in section 3 with respect to their underlying
learning theories.

Theory Algorithms
Low-Footprint ML algorithms (section 3.2.1)
Reducing model-complexity (section 3.2.2)
Traditional Machine Learning Theory
Modifying Optimization routines (section 3.2.3)
Data Compression (section 3.2.4)
Novel resource constrained machine learning theory New protocols for data observation (section 3.2.5)

While Section 3 categorizes the approaches used for building resource-constrained machine
learning models, this section focuses on the computational learning theory that is used to design
and analyze these algorithms. Nearly every approach in the previous section is based on one of the
following underlying theories,
• Traditional machine learning theory: Most of the existing approaches used to build the
resource efficient machine learning models for inference (in section 3.3) follow what is
canonically referred to as traditional machine learning theory. In addition, approaches like,
low-footprint machine learning (section 3.2.1), reducing model complexity (section 3.2.2), data
compression (section 3.2.4), modifying optimization routines (section 3.2.3) used for resource
efficient training, also follow this traditional learning theory. For example, low-footprint
machine learning involves using algorithms with inherently low resource footprints designed
under traditional learning theory. Approaches like reducing model complexity and data
compression, incorporates additional resource constraints for algorithms designed under
traditional learning theory. Finally, the approach of modifying optimization routines simply
modifies the optimization algorithms used to solve the machine learning problem designed
under such theories.
• Novel resource constrained machine learning theory: These approaches highlight the
gaps in traditional learning theory and propose newer notions of learnability with resource
constraint settings. Most of the algorithms in new protocols for data observation (section 3.2.5)
fall in this category. Typically, such approaches modify the traditional assumption of i.i.d
data being presented in a batch or streaming fashion and introduces a specific protocol of
data observability. Additional algorithmic details are available in 3.2.5.
In this section we provide a detailed survey of the advancements in the above categories. Note
that there is an abundance of literature addressing traditional learning theories [131, 154, 179]. For
completeness however we still include a very brief summary of such traditional theories in section
4.2.1. This summary also acclimatizes the readers with the notations used throughout this section.
Section 4.1 formalizes the learning problem for both traditional as well as resource constrained
settings. Next, section 4.2.1 provides a brief survey of some of the popular traditional learning
theories. The advanced learning theories developed for resource-constraint settings are provided
in section 4.2.2. Finally, we conclude by discussing the existing gaps and challenges in section 4.3.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 25

4.1 Formalization of the Learning problem


Before delving into the details of the learning theories, we first formalize the learning problem
for traditional machine learning algorithms. We then extend this formalization to novel machine
learning algorithms under resource constraint settings in section 4.1.2. For simplicity we focus
on supervised problems under inductive settings [24, 179]. Extensions to other advanced learning
settings like transductive, semi-supervised, universum etc., can be found in [131, 154, 179].
4.1.1 Traditional Machine Learning Problem. The supervised learning problem assumes an un-
derlying label generating process y = д(x), where y ∈ Y and x ∈ X(data domain) ⊆ Rd ; which
is characterized by a data generating distribution D. Under the inductive learning setting, the
machine learning problem can be formalized as,
Definition 1. Inductive Learning
Given. independent and identically distributed (i.i.d) training samples S = (xi , yi )ni=1 from the underly-
ing data generating distribution S ∼ D n , and a predefined loss function l : X × Y → O (output
domain).
Task. estimate a function/hypothesis ĥ : X → Y from a set of hypotheses H (a.k.a Hypothesis class),
that best approximates the underlying data generating process y = д(x) i.e. which minimizes
E D (l(y, ĥ(x)) = R(ĥ) (6)
Here,
E D (·) = Expectation operator under distribution D; 1(·) = Indicator f unction
and R(·) = True risk of a hypothesis
Some popular examples following this problem setting are,
• Binary Classification problems using 0/1 loss where Y = {−1, +1}, O = {0, 1} and l(y, ĥ(x)) =
1y,ĥ(x) .
• Regression problems with additive gaussian noise (i.e. least-square loss) where Y = R, O = R
and l(y, ĥ(x)) = (y − ĥ(x))2 .
Typically users incorporate apriori information about the domain while constructing the hy-
pothesis class. In the simplest sense, the hypothesis class gets defined through the methodologies
used to solve the above problems. For example, solving the regression problem using,
• least-squares linear regression: the hypothesis class includes all linear models i.e. H = {h :
h(x) = wT x + b; w ∈ Rd ; b ∈ R}.
• lasso L1− linear regression: the hypothesis class includes all linear models of the form:
H = {h : h(x) = wT x + b; w ∈ Rd ; b ∈ R; ||w||1 ≤ B}.
4.1.2 Resource Constrained Machine Learning Problem. In addition to the goals highlighted in the
above section, building resource constrained machine learning algorithms introduces additional
constraints to the learning problem. The goal now is not just to minimize the true risk, but also
ensure that the resource constraints discussed in section 2 are met. Typically, these resource
constraints are imposed during inference or training phase of a machine learning pipeline. A
mathematical formalization of this problem can therefore be given as,
Definition 2. Resource Constrained Machine Learning
Given. i.i.d training samples S = (xi , yi )ni=1 from the underlying data generating distribution S ∼ D n ,
a predefined loss function l : X × Y → O, and a predefined resource constraint C(·).

, Vol. 1, No. 1, Article . Publication date: July 2020.


26 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

Inference. estimate a function ĥ : X → Y that best approximates the underlying data generating process
i.e. eq. (6) and simultaneously satisfies C(ĥ).
Training. estimate a function ĥ : X → Y that best approximates the underlying data generating process
i.e. eq. (6) and simultaneously satisfies C(A(S)). Here, A : S → H is an algorithm (model
building process) that inputs the training data and outputs a hypothesis ĥ ∈ H .
As seen above the main difference of this setting with respect to the traditional inductive learning
setting is the additional constraint C(·) on the final model (during inference) or the Algorithm (for
training). As an example, consider the least squares linear regression example discussed above.
Typical resource constraints for such a problem may look like,
• Memory constraint during inference enforced by requiring the model parameters (w, b)
of the estimated model y = wT x + b be represented using float32 precision (∼ 232(d +1) bit
memory footprint) or int16 precision (∼ 216(d +1) bit memory footprint); with x ∈ Rd . This is
equivalent to a constraint on the final model Cfloat32 (ŵ, b̂) = {ŵ ∈ {−3.4E + 38, . . . , 3.4E +
38}d ; b̂ = {−3.4E + 38, . . . , 3.4E + 38}} or Cint16 (ŵ, b̂) = {ŵ = {−32768, . . . , 32767}d ; b̂ =
{−32768, . . . , 32767}}.
• Computation constraint during inference can be enforced by requiring ŵ to be k− sparse
with k < d. This, would ensure k + 1 FLOPS per sample and equivalently constraints final
model as C(ŵ, b̂) = {ŵ ∈ Rk }.
Similar, memory/computation constraint can be imposed onto the learning algorithm A(S) during
training.

4.2 Learning theories


In this section we discuss the learning theories developed for the problems discussed above.
4.2.1 Traditional Learning Theories. These theories target the traditional learning problem dis-
cussed in 4.1. Most traditional learning theories provide probabilistic guarantees of the goodness of
a model (actually guarantees for all models in the Hypothesis class) with respect to the metric in
eq.(6). Such theories typically decompose eq. (6) into two main components,
R(h) − R ∗ = (min R(h) − R ∗ ) + (R(h) − min R(h)) (7)
h ∈H h ∈H
| {z } | {z }
approximation error estimation error

Here,
Bayes Error, R ∗ = min R(h)
all measurable h
and min R(h)) = smallest in-class error
h ∈H
The approximation error in eq. (7), typically depends on the choice of the hypothesis class and
the problem domain. The value of this error is problem dependent. Traditional learning theories
then characterize the estimation error. One of the fundamental theoretical frameworks used for
such analyses is the PAC-learning framework.
Definition 3. (Efficient) Agnostic PAC-learning [131]
A hypothesis class H is agnostic PAC-learnable using an algorithm A(S), if for a given ϵ, δ ∈ (0, 1) and
for any distribution D over X × Y (with X ⊆ Rd ) there exists a polynomial function p1 ( ϵ1 , δ1 , d, |H |)
such that for any sample size n ≥ p1 ( ϵ1 , δ1 , d, |H |) the following holds,
P D (R(h S ) − min R(h) ≤ ϵ) ≥ 1 − δ ; where h S = A(S) ∈ H (8)
h ∈H

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 27

Further it is efficiently PAC learnable if the computation complexity is O(p2 ( ϵ1 , δ1 , d, |H |)) for some
polynomial function p2 .
As seen from Definition 3 there are two aspects of PAC-learning.
Sample Complexity characterizing Generalization Bounds: Although Definition 3 provides
the learnability framework for a wide range of machine learning problems like regression, multi-
class classification, recommendation etc. For simplicity, in this section we mainly focus on the
sample-complexity (and equivalently the generalization bounds) for binary classification problems.
Extensions of these theories for other learning problems can be found in [131, 154].
Definition 3 guarantees that an estimated model using the algorithm A(S) and sample size
n ≥ p1 ( ϵ1 , δ1 , d, |H |), will guarantee predictions with error tolerance of ϵ and a probability 1 − δ .
One popular class of algorithm choice is the Empirical Risk Minimization (ERM) based algorithms.
Here, the algorithms return the hypothesis that minimizes an empirical estimate of the risk function
in (6) given by,
ERM estimate h S = ERM(S) = arдmax R S (h) (9)
h ∈H
n

where, Empirical Risk R S (h) = 1y ,h(xi )
n i=1 i

A very interesting property of the ERM based algorithms is the uniform convergence property
(see [154]). This property dictates that for the ERM algorithm to return a good model from within
H , we need to bound the term |R S (h) − R(h)|; ∀h ∈ H . In fact, most popular learning theories
characterize the number of samples (a.k.a sample complexity) needed for bounding this term
|R S (h) −R(h)|; ∀h ∈ H for any given ϵ, δ ∈ (0, 1). The canonical form adopted in most such learning
theories is provided next,
Definition 4. Canonical Forms

Generalization Bound: For a given hypothesis class H , training set S ∼ D n , and ϵ, δ ∈ (0, 1), a
typical generalization bound adopts the following form,
1
with probability at least 1 − δ we have, ∀h ∈ H R(h) ≤ R S (h) + O(p1 ( , n, |H |)) (10)
δ
| {z }
confidence term

A direct implication of (10) and uniform convergence property is the sample complexity provided next,

Sample Complexity A hypothesis class H is PAC learnable using ERM based algorithms with a training
set S of sample complexity n ≥ p2 ( ϵ1 , δ1 , |H |)
The exact forms of the confidence terms and the sample complexity term depends on the way
the complexity of the hypothesis class is captured. A brief survey of some the popular theories
used to capture the complexity of the hypothesis class is provided in Table 9.
Note that, Table 9 provides a very brief highlight of some of the popular traditional learning
theories. For a more detailed coverage of advanced learning theories please see [131, 154]. In
addition, there are a few alternative theories which uses different learning mechanisms like.,
mistake bounds [131], learning by distance [9], statistical query learning [93]. Most such settings
are adapted for very specific learning tasks and have some connections to the PAC learning theory.
Interested readers are directed to the above references for further details.

, Vol. 1, No. 1, Article . Publication date: July 2020.


Table 9. Popular Traditional Learning Theories

28
Confidence term Sample Complexity
Theory Additional Remarks
p1 ( δ1 , n, |H |) p2 ( ϵ1 , δ1 , |H |)
Finite Hypothesis ( | H | < ∞)
l oд |H|+l oд 1 1
Realizable (min R(h) = 0) PAC Bound [176] δ
n ϵ (loд |H | + loд δ1 )
h∈H r
l oд |H|+l oд 2 l oд |H|+l oд 2
Non-Realizable (min R(h) , 0) Agnostic PAC [176] δ δ
2n ϵ2
h∈H
r
|d (h)|+l oд 2 l oд |d (h)|+l oд 2
Countably Infinite ( H = ∪n∈N h n ) MDL [154] δ δ d (h) = minimum description length of hypothesis.
2n ϵ2
l oдGH (n)+l oд 2
VC (dimension) The- δ
r ϵ2
ory [181] l oдGH (n)+l oд 2
– Growth Function GH (n) = sup |Hz1 . . .zn | .
– 2 δ z 1 . . .zn ∼D n
n
r – VC dim, d = max n, z i = (xi , yi ).
dl oд 2e n +l oд 2 n∈N :GH (n)=2n
– 2 d δ
n
(Sauer’s Lema)

, Vol. 1, No. 1, Article . Publication date: July 2020.


r
l oдE D [N (H, 2n)]+l oд 2 l oдE D [N (H, 2n)]+l oд 2
VC (Entropy) The- 2 δ δ
n ϵ2
Size of function class N (H, n) = |Hz1 . . .zn | .
ory [181] r
l oд(4Γ1 (2n, ε /8, F)+l oд( 1 )) l oд(4Γ1 (2n, ε /8, F)+l oд( 1 ))
Infinite Hypothesis Covering Number δ δ
2n ϵ2
[196] – Covering number Γp (n, ε, H) = max {N in (ε, H, | | · | |p, x )|x ∈ X }
– N in (ε, H, | | · | |p, x ) smallest cardinality of internal ε cover of H with
n
Í 1
| |д | |p, z = ( n1 |д(z i ) |p ) p .
i =1
Radamacher Com- r n
Í
plexity [6] l oд 1 σi h(xi )
– 2R n (H) + δ
r 2n – Empirical Radamacher Complexity, R S (H) = [sup i =1 n ]
h∈H
l oд 2 σ
– or, 2R S (H) + 3 δ
2n – Radamacher Complexity, R n (H) = R S (H). For equivalence with,
S ∈D n
Gaussian complexity, Maximum Discrepancy etc. see [6].
r
l oд( 1 )
Algorithm Stability β + (2mβ + M ) δ
2m
[131] – Applies to β stable algorithms h S = A(S ).
– A is uniformly β stable if ∀(x, y) ∈ X × {−1, +1} |l (y, h S (x)) −
l (y, h S′ (x)) | ≤ β ; S, S ′ ∼ D n .
q
4kl oд(m/δ )
Compression R S ′ (A(S )) m +
Bounds [154] 8kl oд(m/δ ) – S ′ = S I where ∃ an index set I and a compression scheme B : S I →
m H s .t . A(S ) = B(S I ) with |S | > 2|S I | .
– The bound is on the Algorithm A’s output in terms of error on the hold-
Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

out set S ′ .
On-Device Machine Learning: An Algorithms and Learning Theory Perspective 29

Computation complexity: Another aspect of the PAC learning theory is the computational com-
plexity of the algorithm. Note that, there can be several algorithms to obtain the same solution. For
example, to sort an array of numbers both merge sort and binary sort will provide the sorted output.
Although the outcome (solution) of the algorithm may be the same; the worst-case computation
complexity of each algorithm is different i.e. binary sort O(n2 ) and merge sort O(nloдn). PAC
theory captures the efficiency of an algorithm in terms of its computation complexity. However, the
computation times of any algorithm is machine dependent. To decouple such dependencies most
computation complexity analyses assumes an underlying abstract machine, like a Turing Machine
over reals etc., and provides a comparative machine independent computational analysis in big-O
notation. For example, an O(nloдn) implementation of the merge sort algorithm would mean the
actual runtime in seconds on any machine would follow (see [154]):
"there exist constants c and n 0 , which can depend on the actual machine, such that, for any value of
n > n 0 , the runtime in seconds of sorting any n items will be at most cnloд(n)".
A brief coverage of the computation times for several popular algorithms in big-O notation is
provided in Table 2. For a more in-depth discussion on several computational models readers are
directed to [93].
As seen in this section, most popular traditional learning theories mainly target the sample
complexity and in turn the generalization capability of an algorithm to learn a hypothesis class
followed by the computation complexity of the algorithm to learn such a hypothesis class. From
a resource constrained perspective although the computational (processing power) aspect of an
algorithm is (asymptotically) handled in such theories, the interplay between computation vs.
sample complexity is disjunctive. Even so, most of the existing methodologies adopted for resource
constrained machine learning (discussed in sections sections 3.3, 3.2.1, 3.2.2, 3.2.4, 3.2.3) follow this
learning paradigm.

4.2.2 Resource-Constrained Learning Theories. Although the PAC learning theory is the dominant
theory behind majority of the ML algorithms, the notion of sample complexity and resource
complexity (computation, memory, communication bandwidth, power etc.) is disjunctive in such
theories. In fact, the works in [39, 151] raise some critical questions about the adequacy of the
PAC learning framework for designing ML algorithms with limited computation complexity. These
questions have led to a substantial amount of work modifying the PAC learning paradigm to
provide an explicit tradeoff between sample and computation complexity. Good surveys of sample-
computation complexity theory can be found in [32, 33, 39, 45, 120, 153, 155, 159] and is not the main
focus of this paper. Rather, in this category we target the more exclusive class of literature which
additionally captures the effect of space complexity on the sample complexity for learnability. There
is limited research that provides an explicit characterization of such an interplay for any generic
algorithm. Most such works modify the traditional assumption of i.i.d data being presented in a batch
or streaming fashion and introduces a specific protocol of data observability. Such theories limit the
memory/space footprint through this restricted data observability. These advanced theories provide
a platform to utilize existing computationally efficient algorithms under memory constrained
settings to build machine learning models with strong error guarantees. We discuss a few examples
of such work next.
A seminal work in this line can be found in [8], where the authors introduce a new protocol
for data observation called Restricted focus of attention (RFA). This modified protocol is formalized
through a projection (or focusing) function and limits the algorithm’s memory (space) and compu-
tation footprint through selective observation of the available samples’ attributes/features. The
authors introduce a new notion of k-RFA or (k-weak-RFA) learnability, where k is the number of
observed bits (or features) per sample; and provides a framework to analyze the sample complexity

, Vol. 1, No. 1, Article . Publication date: July 2020.


30 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

of any learning algorithm under such memory constrained observation. The interplay between
space and sample complexity of any algorithm designed in this RFA framework is provided in Table
10. Several follow-up works target a class of problems like l 1 /l 2 −linear regression, support vector
regression etc., in a similar RFA framework and modified the projection function providing lower
sample complexity [12, 68, 87, 88]. They also provide computationally efficient algorithms for these
specific methods.
Another approach to characterize the sample vs. space complexity is provided in [163]. Here
the authors introduce a memory efficient streaming data observation protocol and utilizes the
statistical query (SQ) based learning paradigm (originally introduced in [93]) to characterize the
sample vs. space complexity for learnability of finite hypothesis classes. Allowing the SQ algorithm
for improper learning, the authors justify applicability to infinite function classes through an ϵ -
cover under some metric. However, such extensions have not been provided in the paper. One major
advantage of this approach is that it opens up the gamut of machine learning algorithms developed
in the SQ paradigm for resource constrained settings [46, 47]. As an example, the authors illustrate
the applicability of their theory for designing resource efficient k-sparse linear regression algorithm
following [46, 47]. Additional details on the space/sample complexity under this framework is
provided in Table 10.
More recently [133, 134] introduced a graph based approach to model the version space of a
hypothesis class in the form of a hypothesis graph. The authors introduced a notion of d-mixing of
the hypothesis graph as a measure of (un)-learnability in bounded memory settings. This notion
of mixing was formalized as a complexity measure in [133] and further utilized to show the un-
learnability property of most generic neural network architectures. In another line of work a
similar (yet complimentary) notion of separability was introduced in [135]. Rather than showing
the negative (unlearnability) results under bounded memory settings; this framework was used
to characterize the lower bound sample vs. space complexity for learnability of a hypothesis
class using an (n, b, δ, ϵ) - bounded memory algorithm. Here, the data generating distribution is
assumed to be uniform in the domain, n = number of training samples, b = bits of memory, δ, ϵ =
confidence, tolerance values for PAC learnability. However, the framework supports a very limited
set of machine learning algorithms and its applicability for modifying popular machine learning
algorithms have not yet been shown.
Finally, [145] proposes a new protocol for data access called branching program and translates
the learning algorithm under resource (memory) constraints in the form of a matrix (as opposed to
a graph in [133]). The authors build a connection between the stability of the matrix norm (in the
form of an upper bound on its maximum singular value) and the learnability of the hypothesis class
with limited memory. This work provides the interplay between the memory-size and minimum
number of samples required for exact learning of the concept class. [7] extended this work for
the class of problems where the sample space of tests is smaller than the sample space of inputs.
[53] further improves upon the bounds proposed in [7, 145]. However, most existing work in this
framework focus on analyzing the learnability of a problem, rather than providing an algorithm
guaranteeing learnability for resource constrained settings.
In addition to the above theories there are a few research works that target a niche yet interesting
class of machine learning problems like hypothesis testing [42, 71, 72, 100, 107] or function estima-
tion [5, 73, 132]. However, their extension to any generic loss function has not been provided and is
non-trivial. Hence, we do not delve into the details of such settings. However, one specific research
on estimating biased coordinates [156] needs special consideration, mainly because it covers a
set of very interesting problems like principal component analysis, singular value decomposition,
correlation analysis etc. This work adopts an information theory centric general framework to
handle most memory constrained estimation problems. The authors introduce a generic (b, nt ,T )

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 31

protocol of data observability for most iterative mini-batch based algorithms. Here, b = space
complexity of intermediate results, nt = size of a mini-batch (of i.i.d samples with dimension d) at
t ∈ T iteration, where T = number of epochs for the iterative algorithm. In a loose sense the overall
space complexity including the data and intermediate results become O(T (b + nt d)). Under such a
(b, nt ,T ) protocol the authors casts most estimation problems like sparse PCA, Covariance estima-
tion, correlation analysis etc., into a ‘hide-and-seek’ problem. Using this framework the authors
explore the limitations in memory constrained settings and provide the sample complexity for
good estimation with high probability (see Table 10). However, although this provides a framework
to analyze specific estimation problems, there are no guidelines on how to cast/modify existing
algorithms to optimize the space vs. sample complexities within this framework.

4.3 Challenges in Resource-Efficient Theoretical Research


As presented in the above section there are some major challenges underlying the resource efficient
theories discussed in section 4.2.2.
(1) Most of the new theories are mainly developed to analyze the un-learnability of a hypothesis
class H under resource constraints. Adapting such frameworks to develop resource efficient
algorithms guaranteeing the learnability of H is needed. Although there are a few theoretical
frameworks like [8, 163] which provide guidelines towards developing resource efficient
algorithms. However, the underlying assumptions rule out a wide range of hypothesis classes.
Showing the practicality of such assumptions towards developing a wide range of machine
learning algorithms is a huge challenge in such existing frameworks.
(2) In addition, as shown in Table 10 most of the existing theories deal with a specific aspect of
the space-complexity. For example, [8] mainly considers the per-sample space complexity
while [163] considers the space complexity of the intermediate state representation. A more
comprehensive analysis of the overall space/computation complexity of the algorithm A(S)
needs to be developed.
(3) A general limitation of most of the theoretical frameworks is the limited empirical analysis
of the algorithms designed using these frameworks. For example, most of the analyses are
asymptotic in nature. How such mathematical expressions translate in practicality is still an
open problem.
(4) Most of the existing theories introduces error guarantees in terms of the hypothesis class.
Selecting the optimal model through hyperparameter optimization, model selection routines
in a resource constrained setting and guaranteeing the correctness of the selected model is
missing.
(5) Finally, a comparison between the frameworks introduced in Table 10 is missing.

5 DISCUSSION
The previous sections provide a comprehensive look at the current state of on-device learning
from an algorithm and theory perspective. In this section we provide a brief summary of these
findings and elaborate on the research and development challenges facing the adoption of an edge
learning paradigm. We also highlight the effort needed for on-device learning using a few typical
edge-learning use cases and explain how research in the different areas (algorithms and theory)
has certain advantages and disadvantages when it comes to their usability.

5.1 Summary of the Current State-of-the-art in On-device Learning


We begin by briefly summarizing our findings from Sections 3 and 4. If you have read those sections,
you can skip this summary and move directly to subsection 5.2.

, Vol. 1, No. 1, Article . Publication date: July 2020.


32
Table 10. Resource Constrained Learning Theories

Theory Sample Complexity Space Complexity Additional Remarks


k −RFA 2 8r 2 V Cd im(H)
max { 4rϵ loд 2r
δ , ϵ loд 13r
ϵ } O (k ) per sample
( [8] Corollary 4.3)
– H = Ψ(Fi )ri=1 , where Composition function Ψ(Fi )ri=1 =
{ψ (f 1, . . . , f r ) |f i ∈ F } defines an ensemble of functions.
– F is PAC learnable over domain X k . k < d is a pre-selected subset
of features.
– r number of functions used in the ensemble function Ψ. This is user-
defined.
– Captures the per-sample space complexity and not the overall algo-
rithm space complexity.

Statistical Query ⌈m 0 /kl oд |H|


O( (loдloд |H | + loдm 0 + loд(1/δ ))) O (loд |H |(loдm 0 + loдloд(1/τ )) +
([163] Theorem 7) τ2
kloд(1/τ )) per state variable – Assumption : H is (ϵ, 0)− learnable with m 0 statistical queries of

, Vol. 1, No. 1, Article . Publication date: July 2020.


tolerance τ .
– k = user defined trade-off with inverse-dependence on sample com-
plexity and direct dependence on space complexity.
– |H | = size of the finite hypothesis class of probability distributions.
– Following this framework provides a k-sparse linear regression im-
nk 8 l oд(1/δ )
plementation with sample complexity Õ ( ) and space
ϵ4
complexity O (kloд 2 ( dϵ )) bits. Õ (·) is big - O which additionally
hides the log terms.

H− graph separability k
α loд |H|2 loд |H|2 + loд αk bits
([135] Theorem 7) α α
– Assumption : H− graph is (α, ϵ )− separable (see [163] for defini-
tion).
2 2
– H is PAC learnable for given (ϵ, δ = e −k α /8 + e −2k ϵ ).
– Limited to uniform distributions and targets sample complexity
lower bounds.

Hide-n-Seek
Ω(max {(d /ρb), 1/ρ 2 }) O (T (b + n t d ))
([156] Theorem 3) – Applicable only to estimation problems like PCA, SVD, CCA etc.,
falling in the hide-n-seek framework.
– b = space complexity of intermediate results.
– n t = size of a mini-batch (of i.i.d samples with dimension d ) at t ∈ T
iteration.
– T = number of epochs for the iterative algorithm.
Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah
On-Device Machine Learning: An Algorithms and Learning Theory Perspective 33

Algorithm Research (details in section 3): Algorithm research mainly targets the computational
aspects of model building under limited resource settings. The main goal is to design optimized
machine learning algorithms which best satisfies a surrogate software-centric resource constraint.
Such surrogate measures are designed to approximate the hardware constraints through asymp-
totic analysis (subsubsection 3.1.1), resource profiling (subsubsection 3.1.2) or resource modeling
(subsubsection 3.1.3). For a given software-centric resource constraint, the state-of-art algorithm
designs adopt one of the following approaches:

(1) Lightweight ML Algorithms (see subsubsection 3.2.1): This approach utilizes already available
algorithms with low resource footprints. There are no additional modifications for resource
constrained model building. As such, for cases where the available device’s resources are
smaller than the resource footprint of the selected lightweight algorithm, this approach will
fail. Additionally, in most cases the lightweight ML algorithms result in models with low
complexity that may fail to fully capture the underlying process.
(2) Reducing Model complexity (see subsubsection 3.2.2): This approach controls the size (mem-
ory footprint) and computation complexity of the machine learning algorithm by adding
additional constraints on to the model architecture (e.g. by selecting a smaller hypothesis
class). For a pre-specified constrained model architecture motivated by the available resource
constraints, this approach adopts traditional optimization routines. Apart from model build-
ing, this is one of the dominant approaches for deploying resource efficient models for model
inference. Compared to the lightweight ML algorithms approach, model complexity reduction
techniques can accommodate a broader class of ML algorithms and can more effectively
capture the underlying process.
(3) Modifying optimization routines (see subsubsection 3.2.3): This approach designs specific
optimization routines for resource efficient model building. Here the resource constraints
are incorporated during the model building (training) phase. Note that as opposed to the
previous technique of limiting the model architectures beforehand, this approach can adapt
the optimization routines to fit the resource constraints for any given model architecture
(hypothesis class). In certain cases, this approach can also dynamically modify the architecture
to fit the resource constraints. Although this approach provides a wider choice of the class of
models, the design process is still tied to a specific problem type (classification, regression,
etc.) and adopted method/loss function (linear regression, ridge regression for regression
problems).
(4) Data Compression (see subsubsection 3.2.4): Rather than constraining the model size/complexity,
this approach targets building models on compressed data. The goal is to limit the memory
usage via reduced data storage and computation through fixed per-sample computation cost.
In addition a more generic approach includes adopting advanced learning settings that ac-
commodates algorithms with smaller sample complexity. However, this is a broader research
topic and is not just limited to on-device learning. A detailed analysis of these approaches
have been delegated to other existing surveys.
(5) New protocols for data observation (see subsubsection 3.2.5): This approach completely
changes the traditional data observation protocol (like availability of i.i.d data in batch or
online settings), and builds resource efficient models under limited data observation. These
approaches are guided by an underlying resource constrained learning theory (discussed
in subsubsection 4.2.2) which captures the interplay between resource constraints and the
goodness of the model in terms of the generalization capacity. Additionally, compared to
the above approaches, this framework provides a generic mechanism to design resource

, Vol. 1, No. 1, Article . Publication date: July 2020.


34 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

constrained algorithms for a wider range of learning problems applicable to any method/loss
function targeting that problem type.
Obviously one of the major challenges in this research is proper software-centric characterization
of the hardware constraints and appropriately using this characterization for better metric designs.
Some other important challenges include applicability to a wider range of algorithms and dynamic
resource budgeting. A more detailed discussion is available in subsection 3.4.
Theory Research (details in section 4): Research into the theory underlying on-device learning
focuses mainly on developing frameworks to analyze the statistical aspects (i.e. error guarantees)
of a designed algorithm with or without associated resource constraints. There are two broad
categories into which most of the existing resource constrained algorithms can be categorized,
(1) Traditional Learning Theories (see subsubsection 4.2.1): Most of the existing resource con-
strained algorithm design (summarized in 1−4 above) follow this traditional machine learning
theory. A limitation of this approach is that such theories are built mainly for analyzing the
error guarantees of the algorithm used for model estimation. The effect of resource constraints
on the generalization capability of the algorithm is not directly addressed through such theo-
ries. For example, algorithms developed using the approach of reducing the model complexity
typically adopts a two step approach. First the size of the hypothesis class is constrained a
priori to incorporate the resource constraints. Next, an algorithm is designed guaranteeing
the best-in-class model within that hypothesis class. The direct interplay between the error
guarantees and resource constraints is missing in such theoretical frameworks.
(2) Resource constrained learning theories (see subsubsection 4.2.2): Modern research has shown
that it may be impossible to learn a hypothesis class under resource constrained settings. To
circumvent such inconsistencies in traditional learning theories, newer resource constraint
learning theories have been developed. Such theories provide learning guarantees in light of
the resource constraints imposed by the device. The algorithms designed using the approach
summarized in point 5 above follow these newer learning theories. Although such theory
motivated design provides a generic framework through which algorithms can be designed
for a wide range of learning problems for any loss function addressing the problem type, till
date, very few algorithms based on these theories have been developed.
Overall, the newer resource constrained theory research provides a generic framework for designing
algorithms with error guarantees under resource constrained settings which apply to a broader
range of problems. However, currently, very few algorithms have been developed which utilize
these frameworks. Developing additional algorithms in such advanced frameworks need significant
effort. Also, the application of such theories to a complete ML pipeline including hyperparameter
optimization, data preprocessing etc., has not yet been addressed.

5.2 Research & Development Challenges


A study of the current state-of-the-art in on-device learning also provides an understanding of
the existing challenges in the field that prevent its adoption as a real alternative to cloud-based
solutions. In this section we identify a few directions for research that will allow us to build learning
algorithms that run on the edge. As before, we limit our discussion to the algorithm and theory
levels.
Algorithm: The challenges facing the research & development of resource constrained optimized
algorithms falls into three areas:
(1) Software-centric Resource Constraints: A correct characterization of the software-centric
resource constraints that best approximates the hardware resources is absolutely critical to

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 35

developing resource constrained algorithms because the optimal algorithm design is very
specific to a particular device (i.e. computational model) and its available resources.
(2) Understanding how traditional ML algorithms can contribute to edge learning: Majority of
current on-device algorithm research focuses on adapting modern deep learning approaches
like CNN, DNN to run on the edge. Very little focus has been given to traditional machine
learning methods. Traditional machine learning methods are of huge interest for building
edge learning capability especially when memory is limited (for devices with megabytes or
even kilobytes of RAM) and computational power is low. These are precisely the areas where
modern deep architectures are wholly unsuited due to their out-sized memory and compute
requirements. Hence, there is a need to explore the traditional machine learning approaches
for learning on-device. In addition, designing advanced learning algorithms that requires
small training datasets to achieve high test accuracies is a huge challenge. Addressing this
capability can significantly improve the resource footprint (i.e low memory footprint to store
the training data).
(3) Dynamic resource budget: Most existing research assume the availability of dedicated re-
sources while training their edge-learning algorithms. However, for many applications such
an assumption is invalid. For example, in mobile phones an application’s priority may change
[44] based on user’s actions. Hence, there is a need for algorithm research incorporating the
accuracy-latency trade-off under dynamic resource constraints.

For building edge-learning capability for use-cases like object detection, user identification etc.
specific to products like., mobile phones, refrigerator etc. ; significant groundwork needs to be laid
in terms of a complete analysis of existing approaches that cater to the specific requirements of the
use-cases and the products. This analysis also forms the basis for extending existing approaches and
developing newer algorithms in those cases where current methods fail to perform satisfactorily.
Theory: The research on algorithms developed using (extensions of) advanced theoretical frame-
works provides a mechanism to characterize the performance (error) guarantees of a model in
resource constrained settings. The effort on such research needs careful consideration of the
following points,

(1) Analysis of Resource Constraints: Although advanced theories abstract resource constraints
as a form of information bottleneck, a good understanding of how such abstractions practically
impact a resource (say memory vs. computation) needs to be properly analyzed.
(2) Applicability: With the appropriate abstraction in place, applicability (of the underlying
assumptions) of the theory for practical use-cases needs to be analyzed.
(3) Algorithm development: Once newer learning theories supporting resource constraints have
been developed, there is need to develop new algorithms based on these new theoretical
frameworks.
(4) Extending existing theories: The existing theories majorly cater to a very specific aspect of
the ML pipeline (i.e. training an algorithm for a pre-specified hypothesis class). Extending
this approach to the entire model building pipeline (for e.g., Hyper-parameter optimization,
data pre-processing, etc.) remains to be developed. Also, a majority of the underlying theories
target the un-learnability of a problem. Extending such concepts to newer theories to develop
practical algorithms needs to be developed.

Overall, the theory research provides a mechanism to characterize a wide range of edge-specific
ML problems. However, extending research at this level needs significant effort as compared to the
previous algorithmic approaches.

, Vol. 1, No. 1, Article . Publication date: July 2020.


36 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

6 CONCLUSION
On-device learning has so far remained in the purview of academic research, but with the increasing
number of smart devices and improved hardware, there is interest in performing learning on the
device rather than in the cloud. In the industry, this interest is fueled mainly by hardware manufac-
turers promoting AI-specific chipsets that are optimized for certain mathematical operations, and
startups providing ad hoc solutions to certain niche domains mostly in computer vision and IoT.
Given this surge in interest and corresponding availability of edge hardware suitable for on-device
learning, a comprehensive survey of the field from an algorithms and learning theory perspective
sets the stage for both understanding the state-of-the-art and for identifying open challenges and
future avenues of research. However, on-device learning is an expansive field with connections
to a large number of related topics in AI and machine learning including online learning, model
adaptation, one/few-shot learning, resource-constrained learning etc. to name just a few. Covering
such a large number of research topics in a single survey is impractical but, at the same time,
ignoring the work that has been done in these areas leaves significant gaps in any comparison of
approaches. This survey finds a middle ground by reformulating the problem of on-device learning
as resource-constrained learning where the resources are compute and memory. This reformulation
allows tools, techniques, and algorithms from a wide variety of research areas to be compared
equitably.
We limited the survey to learning on single devices with the understanding that the ideas
discussed can be extended in a normal fashion to the distributed setting via an additional constraint
based on communication latency. We also focused the survey on the algorithmic and theoretical
aspects of on-device learning leaving out the effects of the systems level (hardware and libraries).
This choice was deliberate and allowed us to separate out the algorithmic aspects of on-device
learning from implementation and hardware choices. This distinction also allows us to identify
challenges and future research that can be applied to a variety of systems.
Based on the reformulation of on-device learning as resource constrained learning, the survey
found that there are a number of areas where more research and development is needed.
At the algorithmic level, it is clear that current efforts are mainly targeted at either utilizing
already lightweight machine learning algorithms or modifying existing algorithms in ways that
reduce resource utilization. There are a number of challenges we identified in the algorithm space
including the need for decoupling algorithms from hardware constraints, designing effective loss
functions and metrics that capture resource constraints, an expanded focus on traditional as well
as advanced ML algorithms with low sample complexity in addition to the current work on DNNs,
and dealing with situations where the resource budget is dynamic rather than static. In addition,
improved methods for model profiling are needed to more accurately calculate an algorithm’s
resource consumption. Current approaches to such measurements are abstract and focus on applying
software engineering principles such as asymptotic analysis or low-level measures like FLOPS or
MACs (Multiply-Add Computations). None of these approaches give a holistic idea of resource
requirements and in many cases represent an insignificant portion of the total resources required
by the system during learning.
Finally, current research in the field of learning theory for resource constrained algorithms
is focused on the un-learnability of an algorithm under resource constraints. The natural step
forward is to identify techniques that can instead provide guarantees on the learnability of an
algorithm and the associated estimation error. Existing theoretical techniques also mainly focus on
the space(memory) complexity of these algorithms and not their compute requirements. Even in
cases where an ideal hypothesis class can be identified that satisfies resource constraints, further
work is needed to select the optimal model from within that class.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 37

REFERENCES
[1] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2016. Fathom: Reference workloads
for modern deep learning methods. In Workload Characterization (IISWC), 2016 IEEE International Symposium on.
IEEE, 1–10.
[2] Furqan Alam, Rashid Mehmood, Iyad Katib, and Aiiad Albeshri. 2016. Analysis of eight data mining algorithms for
smarter Internet of Things (IoT). Procedia Computer Science 98 (2016), 437–442.
[3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient
SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems. 1709–1720.
[4] Zeyuan Allen-Zhu and Yuanzhi Li. 2017. First efficient convergence for streaming k-pca: a global, gap-free, and
near-optimal rate. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on. IEEE, 487–492.
[5] Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The space complexity of approximating the frequency moments.
Journal of Computer and system sciences 58, 1 (1999), 137–147.
[6] Peter L Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural
results. Journal of Machine Learning Research 3, Nov (2002), 463–482.
[7] Paul Beame, Shayan Oveis Gharan, and Xin Yang. 2017. Time-Space Tradeoffs for Learning from Small Test Spaces:
Learning Low Degree Polynomial Functions. arXiv preprint arXiv:1708.02640 (2017).
[8] Shai Ben-David and Eli Dichterman. 1998. Learning with restricted focus of attention. J. Comput. System Sci. 56, 3
(1998), 277–298.
[9] Shai Ben-David, Alon Itai, and Eyal Kushilevitz. 1990. Learning by Distances.. In COLT. 232–245.
[10] Simone Bianco, Remi Cadene, Luigi Celona, and Paolo Napoletano. 2018. Benchmark Analysis of Representative
Deep Neural Network Architectures. IEEE Access (2018).
[11] Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017. Adaptive neural networks for efficient
inference. arXiv preprint arXiv:1702.07811 (2017).
[12] Brian Bullins, Elad Hazan, and Tomer Koren. 2016. The limits of learning with missing data. In Advances in Neural
Information Processing Systems. 3495–3503.
[13] Ermao Cai, Da-Cheng Juan, Dimitrios Stamoulis, and Diana Marculescu. 2017. Neuralpower: Predict and deploy
energy-efficient convolutional neural networks. arXiv preprint arXiv:1710.05420 (2017).
[14] Han Cai, Ligeng Zhu, and Song Han. 2018. Proxylessnas: Direct neural architecture search on target task and hardware.
arXiv preprint arXiv:1812.00332 (2018).
[15] Léopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Mehran Nekuii, Oguz H Elibol, and Hanlin Tang. 2020. Shifted
and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks. arXiv preprint
arXiv:2001.05674 (2020).
[16] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An analysis of deep neural network models for
practical applications. arXiv preprint arXiv:1605.07678 (2016).
[17] Nicolo Cesa-Bianchi, Shai Shalev-Shwartz, and Ohad Shamir. 2011. Efficient learning with partially observed attributes.
Journal of Machine Learning Research 12, Oct (2011), 2857–2878.
[18] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervised learning (chapelle, o. et al., eds.;
2006)[book reviews]. IEEE Transactions on Neural Networks 20, 3 (2009), 542–542.
[19] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost.
arXiv preprint arXiv:1604.06174 (2016).
[20] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks
with the hashing trick. In International Conference on Machine Learning. 2285–2294.
[21] Xie Chen, Xunying Liu, Yongqiang Wang, Mark JF Gales, and Philip C Woodland. 2016. Efficient training and
evaluation of recurrent neural network language models for automatic speech recognition. IEEE/ACM Transactions
on Audio, Speech, and Language Processing 24, 11 (2016), 2146–2157.
[22] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127–138.
[23] An-Chieh Cheng, Jin-Dong Dong, Chi-Hung Hsu, Shu-Huan Chang, Min Sun, Shih-Chieh Chang, Jia-Yu Pan, Yu-Ting
Chen, Wei Wei, and Da-Cheng Juan. 2018. Searching Toward Pareto-Optimal Device-Aware Neural Architectures.
arXiv preprint arXiv:1808.09830 (2018).
[24] Vladimir Cherkassky and Filip M Mulier. 2007. Learning from data: concepts, theory, and methods. John Wiley & Sons.
[25] H Choi, WP Burleson, and DS Phatak. 1993. Fixed-point roundoff error analysis of large feedforward neural networks.
In Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), Vol. 2. IEEE, 1947–1950.
[26] Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Kunle Olukotun, and Andrew Y Ng. 2007.
Map-reduce for machine learning on multicore. In Advances in neural information processing systems. 281–288.
[27] Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun,
Chris Ré, and Matei Zaharia. 2017. DAWNBench: An End-to-End Deep Learning Benchmark and Competition.

, Vol. 1, No. 1, Article . Publication date: July 2020.


38 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

Training 100, 101 (2017), 102.


[28] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks
with binary weights during propagations. In Advances in neural information processing systems. 3123–3131.
[29] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks:
Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830
(2016).
[30] Koby Crammer, Alex Kulesza, and Mark Dredze. 2013. Adaptive regularization of weight vectors. Machine learning
91, 2 (2013), 155–187.
[31] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming
Wu, Yangqing Jia, et al. 2019. Chamnet: Towards efficient network design through platform-aware model adaptation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11398–11407.
[32] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. 2013. From average case complexity to improper learning
complexity. arXiv preprint arXiv:1311.2272 (2013).
[33] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. 2013. More data speeds up training time in learning halfspaces
over sparse vectors. In Advances in Neural Information Processing Systems. 145–153.
[34] Saumitro Dasgupta and David Gschwend. [n.d.]. Netscope CNN Analyzer. https://dgschwend.github.io/netscope/
quickstart.html
[35] Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. 2017. Understanding and optimizing
asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium
on Computer Architecture. 561–574.
[36] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger, Kunle Olukotun, and
Christopher Ré. 2018. High-accuracy low-precision training. arXiv preprint arXiv:1803.03383 (2018).
[37] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the wild: A unified analysis of
hogwild-style algorithms. In Advances in neural information processing systems. 2674–2682.
[38] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke
Yang, Quoc V Le, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing
systems. 1223–1231.
[39] Scott E Decatur, Oded Goldreich, and Dana Ron. 2000. Computational sample complexity. SIAM J. Comput. 29, 3
(2000), 854–879.
[40] Swarnava Dey, Arijit Mukherjee, and Arpan Pal. 2019. Embedded Deep Inference in Practice: Case for Model
Partitioning. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems. 25–30.
[41] Sauptik Dhar, Vladimir Cherkassky, and Mohak Shah. 2019. Multiclass Learning from Contradictions. In Advances in
Neural Information Processing Systems. 8400–8410.
[42] Kimon Drakopoulos, Asuman Ozdaglar, and John N Tsitsiklis. 2013. On learning with finite memory. IEEE Transactions
on Information Theory 59, 10 (2013), 6859–6872.
[43] Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. 2020.
Training with Quantization Noise for Extreme Model Compression. arXiv:cs.LG/2004.07320
[44] Biyi Fang, Xiao Zeng, and Mi Zhang. 2018. NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning
for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and
Networking. ACM, 115–127.
[45] Vitaly Feldman et al. 2007. Efficiency and computational limitations of learning algorithms. Vol. 68.
[46] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao. 2013. Statistical algorithms and
a lower bound for detecting planted cliques. In Proceedings of the forty-fifth annual ACM symposium on Theory of
computing. ACM, 655–664.
[47] Vitaly Feldman, Cristóbal Guzmán, and Santosh Vempala. 2017. Statistical query algorithms for mean vector estimation
and stochastic convex optimization. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete
Algorithms. SIAM, 1265–1277.
[48] Jiashi Feng and Trevor Darrell. 2015. Learning the structure of deep convolutional networks. In Proceedings of the
IEEE international conference on computer vision. 2749–2757.
[49] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers
to solve real world classification problems? The Journal of Machine Learning Research 15, 1 (2014), 3133–3181.
[50] M Fernández-Delgado, MS Sirsat, Eva Cernadas, Sadi Alawadi, Senén Barro, and Manuel Febrero-Bande. 2019. An
extensive experimental survey of regression methods. Neural Networks 111 (2019), 11–34.
[51] Venkata Gandikota, Raj Kumar Maity, and Arya Mazumdar. 2019. vqSGD: Vector quantized stochastic gradient
descent. arXiv preprint arXiv:1911.07971 (2019).
[52] Wanling Gao, Jianfeng Zhan, Lei Wang, Chunjie Luo, Daoyi Zheng, Rui Ren, Chen Zheng, Gang Lu, Jingwei Li, Zheng
Cao, et al. 2018. BigDataBench: A Dwarf-based Big Data and AI Benchmark Suite. arXiv preprint arXiv:1802.08254

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 39

(2018).
[53] Sumegha Garg, Ran Raz, and Avishay Tal. 2018. Extractor-based time-space lower bounds for learning. In Proceedings
of the 50th Annual ACM SIGACT Symposium on Theory of Computing. ACM, 990–1002.
[54] Chuanxing Geng, Sheng-jun Huang, and Songcan Chen. 2020. Recent advances in open set recognition: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[55] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. 2018.
SqueezeNext: Hardware-Aware Neural Network Design. arXiv preprint arXiv:1803.10615 (2018).
[56] Eugenio Gianniti, Li Zhang, and Danilo Ardagna. [n.d.]. Performance Prediction of GPU-based Deep Learning
Applications. ([n. d.]).
[57] Daniel Golovin, D Sculley, Brendan McMahan, and Michael Young. 2013. Large-scale learning with less ram via
randomization. In International Conference on Machine Learning. 325–333.
[58] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compressing deep convolutional networks using
vector quantization. arXiv preprint arXiv:1412.6115 (2014).
[59] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples.
arXiv preprint arXiv:1412.6572 (2014).
[60] Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. 2017. Highway and residual networks learn unrolled
iterative estimation. In ICLR.
[61] Alex Grubb and Drew Bagnell. 2012. Speedboost: Anytime prediction with uniform near-optimality. In Artificial
Intelligence and Statistics. 458–466.
[62] Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-efficient backpropaga-
tion through time. In Advances in Neural Information Processing Systems. 4125–4133.
[63] Renjie Gu, Shuo Yang, and Fan Wu. 2019. Distributed machine learning on mobile devices: A survey. arXiv preprint
arXiv:1909.08329 (2019).
[64] Yunhui Guo. 2018. A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752
(2018).
[65] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited
numerical precision. In International Conference on Machine Learning. 1737–1746.
[66] Karen Zita Haigh, Allan M Mackay, Michael R Cook, and Li G Lin. 2015. Machine learning for embedded systems: A
case study. Technical Report (2015).
[67] Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[68] Elad Hazan and Tomer Koren. 2012. Linear regression with limited observation. arXiv preprint arXiv:1206.4678 (2012).
[69] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[70] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. AMC: AutoML for Model Compression and
Acceleration on Mobile Devices. In The European Conference on Computer Vision (ECCV). 784–800.
[71] Martin Hellman. 1972. The effects of randomization on finite-memory decision schemes. IEEE Transactions on
Information Theory 18, 4 (1972), 499–502.
[72] Martin E Hellman, Thomas M Cover, et al. 1971. On memory saved by randomization. The Annals of Mathematical
Statistics 42, 3 (1971), 1075–1078.
[73] Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. 1998. Computing on data streams. External
memory algorithms 50 (1998), 107–118.
[74] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep
Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531
[75] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving
neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012).
[76] Markus Hoehfeld and Scott E Fahlman. 1991. Learning with limited numerical precision using the cascade-correlation
algorithm. Citeseer.
[77] JL Holt and Jenq-Neng Hwang. 1991. Finite precision error analysis for neural network learning. In Proceedings of the
First International Forum on Applications of Neural Networks to Power Systems. IEEE, 237–241.
[78] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,
Ruoming Pang, Vijay Vasudevan, et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE International
Conference on Computer Vision. 1314–1324.
[79] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto,
and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv
preprint arXiv:1704.04861 (2017).

, Vol. 1, No. 1, Article . Publication date: July 2020.


40 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

[80] Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. 2014. Fast prediction for large-scale kernel machines. In Advances in
Neural Information Processing Systems. 3689–3697.
[81] Zhiming Hu, Ahmad Bisher Tarakji, Vishal Raheja, Caleb Phillips, Teng Wang, and Iqbal Mohomed. 2019. Deephome:
Distributed inference with heterogeneous devices in the edge. In The 3rd International Workshop on Deep Learning for
Mobile Systems and Applications. 13–18.
[82] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q Weinberger. 2018. Multi-scale
dense networks for resource efficient image classification. In ICLR.
[83] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. 2018. Condensenet: An efficient densenet
using learned group convolutions. CVPR 3, 12 (2018), 11.
[84] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural
networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning
Research 18, 1 (2017), 6869–6898.
[85] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016.
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360
(2016).
[86] Andrey Ignatov, Radu Timofte, Przemyslaw Szczepaniak, William Chou, Ke Wang, Max Wu, Tim Hartley, and
Luc Van Gool. 2018. AI Benchmark: Running Deep Neural Networks on Android Smartphones. arXiv preprint
arXiv:1810.01109 (2018).
[87] Shinji Ito, Daisuke Hatano, Hanna Sumita, Akihiro Yabe, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi
Kawarabayashi. 2017. Efficient Sublinear-Regret Algorithms for Online Sparse Linear Regression with Limited
Observation. In Advances in Neural Information Processing Systems. 4099–4108.
[88] Shinji Ito, Daisuke Hatano, Hanna Sumita, Akihiro Yabe, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi
Kawarabayashi. 2018. Online Regression with Partial Information: Generalization and Linear Projection. In In-
ternational Conference on Artificial Intelligence and Statistics. 1599–1607.
[89] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and
Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only
inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.
[90] Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng, and Rong Qu. 2019. A Survey of Deep
Learning-Based Object Detection. IEEE Access 7 (2019), 128837–128868.
[91] Cijo Jose, Prasoon Goyal, Parv Aggrwal, and Manik Varma. 2013. Local deep kernel learning for efficient non-linear
svm prediction. In International conference on machine learning. 486–494.
[92] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas
Moshovos. 2015. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236
(2015).
[93] Michael J Kearns. 1990. The computational complexity of machine learning. MIT press.
[94] Minje Kim and Paris Smaragdis. 2016. Bitwise neural networks. arXiv preprint arXiv:1601.06071 (2016).
[95] Kuno Kollmann, K-R Riemschneider, and Hans Christoph Zeidler. 1996. On-chip backpropagation training using
parallel stochastic bit streams. In Proceedings of Fifth International Conference on Microelectronics for Neural Networks.
IEEE, 149–156.
[96] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems. 1097–1105.
[97] Vrushali Y Kulkarni and Pradeep K Sinha. 2012. Pruning of random forest classifiers: A survey and future directions.
In Data Science & Engineering (ICDSE), 2012 International Conference on. IEEE, 64–68.
[98] Ashish Kumar, Saurabh Goyal, and Manik Varma. 2017. Resource-efficient Machine Learning in 2 KB RAM for the
Internet of Things. In International Conference on Machine Learning. 1935–1944.
[99] Matt Kusner, Stephen Tyree, Kilian Weinberger, and Kunal Agrawal. 2014. Stochastic neighbor compression. In
International Conference on Machine Learning. 622–630.
[100] KB Lakshmanan and B Chandrasekaran. 1979. Compound hypothesis testing with finite memory. Information and
Control 40, 2 (1979), 223–233.
[101] John Langford, Lihong Li, and Tong Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine
Learning Research 10, Mar (2009), 777–801.
[102] Quoc Le, Tamás Sarlós, and Alex Smola. 2013. Fastfood-approximating kernel expansions in loglinear time. In
Proceedings of the international conference on machine learning, Vol. 85.
[103] Scikit Learn. [n.d.]. Decision Tree. http://scikit-learn.org/stable/modules/tree.html#complexity
[104] Vadim Lebedev and Victor Lempitsky. 2016. Fast convnets using group-wise brain damage. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2554–2564.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 41

[105] Jongmin Lee, Michael Stanley, Andreas Spanias, and Cihan Tepedelenlioglu. 2016. Integrating machine learning in
embedded sensor systems for internet-of-things applications. In Signal Processing and Information Technology (ISSPIT),
2016 IEEE International Symposium on. IEEE, 290–294.
[106] Seulki Lee and Shahriar Nirjon. 2019. Neuro. ZERO: a zero-energy neural network accelerator for embedded sensing
and inference systems. In Proceedings of the 17th Conference on Embedded Networked Sensor Systems. 138–152.
[107] F Leighton and Ronald Rivest. 1986. Estimating a probability using finite memory. IEEE Transactions on Information
Theory 32, 6 (1986), 733–742.
[108] Chun-Liang Li, Hsuan-Tien Lin, and Chi-Jen Lu. 2016. Rivalry of two families of algorithms for memory-restricted
streaming pca. In Artificial Intelligence and Statistics. 473–481.
[109] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. 2017. Training quantized nets: A
deeper understanding. In Advances in Neural Information Processing Systems. 5811–5821.
[110] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and
future directions. IEEE Signal Processing Magazine 37, 3 (2020), 50–60.
[111] Xiang Li, Tao Qin, Jian Yang, and Tieyan Liu. 2016. LightRNN: Memory and computation-efficient recurrent neural
networks. In Advances in Neural Information Processing Systems. 4385–4393.
[112] Zheng Li and Christopher M De Sa. 2019. Dimension-Free Bounds for Low-Precision Training. In Advances in Neural
Information Processing Systems. 11728–11738.
[113] Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih. 2000. A comparison of prediction accuracy, complexity, and training
time of thirty-three old and new classification algorithms. Machine learning 40, 3 (2000), 203–228.
[114] Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit
Niyato, and Chunyan Miao. 2020. Federated learning in mobile edge networks: A comprehensive survey. IEEE
Communications Surveys & Tutorials (2020).
[115] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep gradient compression: Reducing the
communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017).
[116] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. 2015. Neural networks with few
multiplications. arXiv preprint arXiv:1510.03009 (2015).
[117] Ziheng Lin, Yan Gu, and Samarjit Chakraborty. 2010. Tuning Machine-Learning Algorithms for Battery-Operated
Portable Devices. In Asia Information Retrieval Symposium. Springer, 502–513.
[118] Jiayi Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah. 2020. Pruning Algorithms to Accelerate Convolutional
Neural Networks for Edge Applications: A Survey. arXiv preprint arXiv:2005.04275 (2020).
[119] Zongqing Lu, Swati Rallapalli, Kevin Chan, and Thomas La Porta. 2017. Modeling the resource requirements of
convolutional neural networks on mobile devices. In Proceedings of the 2017 ACM on Multimedia Conference. ACM,
1663–1671.
[120] Mario Lucic. 2017. Computational and Statistical Tradeoffs via Data Summarization. Ph.D. Dissertation. ETH Zurich.
[121] GD Magoulas, MN Vrahatis, and GS Androulakis. 1996. A new method in neural network supervised training with
imprecision. In Proceedings of Third International Conference on Electronics, Circuits, and Systems, Vol. 1. IEEE, 287–290.
[122] Vicent Sanz Marco, Ben Taylor, Zheng Wang, and Yehia Elkhatib. 2020. Optimizing deep learning inference on
embedded systems through adaptive model selection. ACM Transactions on Embedded Computing Systems (TECS) 19,
1 (2020), 1–28.
[123] Diana Marculescu, Dimitrios Stamoulis, and Ermao Cai. 2018. Hardware-Aware Machine Learning: Modeling and
Optimization. arXiv preprint arXiv:1809.05476 (2018).
[124] Prathamesh Mayekar and Himanshu Tyagi. 2019. RATQ: A Universal Fixed-Length Quantizer for Stochastic Opti-
mization. arXiv:cs.LG/1908.08200
[125] Ian McInerney, George A Constantinides, and Eric C Kerrigan. 2018. A survey of the implementation of linear model
predictive control on fpgas. IFAC-PapersOnLine 51, 20 (2018), 381–387.
[126] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. 2017. Communication-
Efficient Learning of Deep Networks from Decentralized Data. In Artificial Intelligence and Statistics. 1273–1282.
[127] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. 2016. Federated Learning of Deep
Networks using Model Averaging. CoRR abs/1602.05629 (2016). arXiv:1602.05629 http://arxiv.org/abs/1602.05629
[128] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg,
Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint
arXiv:1710.03740 (2017).
[129] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. 2020.
Image Segmentation Using Deep Learning: A Survey. arXiv preprint arXiv:2001.05566 (2020).
[130] Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. 2013. Memory limited, streaming PCA. In Advances in
Neural Information Processing Systems. 2886–2894.
[131] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of machine learning. MIT press.

, Vol. 1, No. 1, Article . Publication date: July 2020.


42 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

[132] Robert Morris. 1978. Counting large numbers of events in small registers. Commun. ACM 21, 10 (1978), 840–842.
[133] Dana Moshkovitz and Michal Moshkovitz. 2017. Mixing implies lower bounds for space bounded learning. In
Conference on Learning Theory. 1516–1566.
[134] Dana Moshkovitz and Michal Moshkovitz. 2018. Entropy Samplers and Strong Generic Lower Bounds For Space
Bounded Learning. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 94. Schloss Dagstuhl-Leibniz-Zentrum
fuer Informatik.
[135] Michal Moshkovitz and Naftali Tishby. 2017. A General Memory-Bounded Learning Algorithm. arXiv preprint
arXiv:1712.03524 (2017).
[136] Tomoya Murata and Taiji Suzuki. 2018. Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for
Stochastic Sparse Linear Regression with Limited Attribute Observation. In Advances in Neural Information Processing
Systems. 5317–5326.
[137] MG Murshed, Christopher Murphy, Daqing Hou, Nazar Khan, Ganesh Ananthanarayanan, and Faraz Hussain. 2019.
Machine learning at the network edge: A survey. arXiv preprint arXiv:1908.00080 (2019).
[138] Apache MXNet. [n.d.]. Memory Cost of Deep Nets under Different Allocations. https://github.com/apache/incubator-
mxnet/tree/master/example/memcost#memory-cost-of-deep-nets-under-different-allocations
[139] Yurii Nesterov. 2013. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer Science & Business
Media.
[140] Jakob Nielsen. 1993. Usability Heuristics. In Usability Engineering, JAKOB NIELSEN (Ed.). Morgan Kaufmann, San
Diego, 115–163. https://doi.org/10.1016/B978-0-08-052029-2.50008-5
[141] Hang Qi, Evan R Sparks, and Ameet Talwalkar. 2016. Paleo: A performance model for deep neural networks. (2016).
[142] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design
Spaces. arXiv preprint arXiv:2003.13678 (2020).
[143] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification
using binary convolutional neural networks. In European conference on computer vision. Springer, 525–542.
[144] Sujith Ravi. 2017. Projectionnet: Learning efficient on-device deep networks using neural projections. arXiv preprint
arXiv:1708.00630 (2017).
[145] Ran Raz. 2017. A time-space lower bound for a large class of learning problems. In Foundations of Computer Science
(FOCS), 2017 IEEE 58th Annual Symposium on. IEEE, 732–742.
[146] Baidu Research. [n.d.]. Benchmarking Deep Learning Operations on Different Hardwares. https://github.com/baidu-
research/DeepBench
[147] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. 2016. vDNN: Virtualized
deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International
Symposium on Microarchitecture. IEEE Press, 18.
[148] Crefeda Faviola Rodrigues, Graham Riley, and Mikel Luján. [n.d.]. Fine-Grained Energy and Performance Profiling
framework for Deep Convolutional Neural Networks. ([n. d.]).
[149] Bita Darvish Rouhani, Azalia Mirhoseini, and Farinaz Koushanfar. 2016. Delight: Adding energy dimension to deep
neural networks. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM,
112–117.
[150] Bita Darvish Rouhani, Azalia Mirhoseini, and Farinaz Koushanfar. 2017. TinyDL: Just-in-time deep learning solution
for constrained embedded systems. In IEEE International Symposium on Circuits and Systems (ISCAS). 1–4.
[151] Daniil Ryabko. 2007. Sample complexity for computational classification problems. Algorithmica 49, 1 (2007), 69–77.
[152] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2:
Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 4510–4520.
[153] Rocco A Servedio. 2000. Computational sample complexity and attribute-efficient learning. J. Comput. System Sci. 60,
1 (2000), 161–178.
[154] Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms. Cambridge
university press.
[155] Shai Shalev-Shwartz, Ohad Shamir, and Eran Tromer. 2012. Using more data to speed-up training time. In Artificial
Intelligence and Statistics. 1019–1027.
[156] Ohad Shamir. 2014. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In
Advances in Neural Information Processing Systems. 163–171.
[157] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556 (2014).
[158] Daniel Soudry, Itay Hubara, and Ron Meir. 2014. Expectation backpropagation: Parameter-free training of multilayer
neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems. 963–971.

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 43

[159] Nathan Srebro and Karthik Sridharan. 2011. Theoretical basis for “more data less work”. In NIPS Workshop on
Computataional Trade-offs in Statistical Learning.
[160] Dimitrios Stamoulis, Ermao Cai, Da-Cheng Juan, and Diana Marculescu. 2018. HyperPower: Power-and memory-
constrained hyper-parameter optimization for neural networks. In Design, Automation & Test in Europe Conference &
Exhibition (DATE), 2018. IEEE, 19–24.
[161] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana
Marculescu. 2019. Single-path nas: Designing hardware-efficient convnets in less than 4 hours. arXiv preprint
arXiv:1904.02877 (2019).
[162] Jacob Steinhardt and John Duchi. 2015. Minimax rates for memory-bounded sparse linear regression. In Conference
on Learning Theory. 1564–1587.
[163] Jacob Steinhardt, Gregory Valiant, and Stefan Wager. 2016. Memory, communication, and statistical queries. In
Conference on Learning Theory. 1490–1516.
[164] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. 2018. Sparsified SGD with memory. In Advances in
Neural Information Processing Systems. 4447–4458.
[165] Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual
Conference of the International Speech Communication Association.
[166] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017. Efficient processing of deep neural networks: A
tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[167] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 1–9.
[168] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition.
2818–2826.
[169] Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant. 2018. Sketching Linear Classifiers over Data Streams.
In Proceedings of the 2018 International Conference on Management of Data. ACM, 757–772.
[170] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. 2018. MnasNet: Platform-Aware Neural
Architecture Search for Mobile. arXiv preprint arXiv:1807.11626 (2018).
[171] Mingxing Tan and Quoc V Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv
preprint arXiv:1905.11946 (2019).
[172] Mingxing Tan and Quoc V Le. 2019. Mixconv: Mixed depthwise convolutional kernels. In British Machine Vision
Conference.
[173] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. 2018. Communication compression for decentralized
training. In Advances in Neural Information Processing Systems. 7652–7662.
[174] Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. 2020. Communication-Efficient Distributed Deep
Learning: A Comprehensive Survey. arXiv:cs.DC/2003.06307
[175] Konstantinos I Tsianos, Sean Lawlor, and Michael G Rabbat. 2012. Consensus-based distributed optimization: Practical
issues and applications in large-scale machine learning. In 50th Annual Allerton Conference on Communication, Control,
and Computing (Allerton). IEEE, 1543–1550.
[176] Leslie G Valiant. 1984. A theory of the learnable. Commun. ACM 27, 11 (1984), 1134–1142.
[177] Ewout van den Berg, Bhuvana Ramabhadran, and Michael Picheny. 2017. Training variance and performance
evaluation of neural networks in speech. In 2017 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2287–2291.
[178] Jesper E Van Engelen and Holger H Hoos. 2020. A survey on semi-supervised learning. Machine Learning 109, 2
(2020), 373–440.
[179] Vladimir Vapnik. 2006. Estimation of dependences based on empirical data. Springer Science & Business Media.
[180] Vladimir Vapnik and Rauf Izmailov. 2019. Rethinking statistical learning theory: learning using statistical invariants.
Machine Learning 108, 3 (2019), 381–423.
[181] V. N. Vapnik and A. Ya. Chervonenkis. 1971. On the Uniform Convergence of Relative Frequencies of Events to their
Probabilities. Theory of Probability and its Applications 16, 2 (1971), 264–280.
[182] Delia Velasco-Montero, Jorge Fernández-Berni, Ricardo Carmona-Galán, and Angel Rodríguez-Vázquez. 2018. Per-
formance Analysis of Real-Time DNN Inference on Raspberry Pi. In Proc. SPIE 10670, Real-Time Image and Video
Processing. Orlando, Florida, United States, 10670 – 10670 – 9.
[183] Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Deploying Deep Neural Networks in
the Embedded Space. In 2nd International Workshop on Embedded and Mobile Deep Learning. Munich, Germany.
[184] Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu,
Kan Chen, et al. 2020. FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions. arXiv

, Vol. 1, No. 1, Article . Publication date: July 2020.


44 Sauptik Dhar, Junyao Guo, Jiayi (Jason) Liu, Samarth Tripathi, Unmesh Kurup, and Mohak Shah

preprint arXiv:2004.05565 (2020).


[185] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2018. HAQ: Hardware-Aware Automated Quantization.
arXiv preprint arXiv:1811.08886 (2018).
[186] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. 2018. Training Deep
Neural Networks with 8-bit Floating Point Numbers. In Advances in Neural Information Processing Systems. 7685–7694.
[187] Wenlin Wang, Changyou Chen, Wenlin Chen, Piyush Rai, and Lawrence Carin. 2016. Deep distance metric learning
with data summarization. In ECML PKDD.
[188] Xiaofei Wang, Yiwen Han, Victor CM Leung, Dusit Niyato, Xueqiang Yan, and Xu Chen. 2020. Convergence of edge
computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials (2020).
[189] Xin Wang, Fisher Yu, Zi-Yi Dou, and Joseph E Gonzalez. 2017. Skipnet: Learning dynamic routing in convolutional
networks. arXiv preprint arXiv:1711.09485 (2017).
[190] Zhihao Wang, Jian Chen, and Steven CH Hoi. 2020. Deep learning for image super-resolution: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2020).
[191] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient
distributed optimization. In Advances in Neural Information Processing Systems. 1299–1309.
[192] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural
networks. In Advances in Neural Information Processing Systems. 2074–2082.
[193] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary
gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems.
1509–1519.
[194] Simon Wiedemann, Temesgen Mehari, Kevin Kepp, and Wojciech Samek. 2020. Dithered backprop: A sparse and
quantized backpropagation algorithm for more efficient deep neural network training. arXiv preprint arXiv:2004.04729
(2020).
[195] Martin Wistuba, Ambrish Rawat, and Tejaswini Pedapati. 2019. A survey on neural architecture search. arXiv preprint
arXiv:1905.01392 (2019).
[196] Michael M Wolf. 2018. Mathematical Foundations of Supervised Learning. (2018).
[197] Alexander Wong. 2018. NetScore: Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural
Networks for Practical On-Device Edge Usage. (2018).
[198] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing
Jia, and Kurt Keutzer. 2019. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture
search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10734–10742.
[199] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. 2018. Error compensated quantized SGD and its
applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054 (2018).
[200] Yun Xie and Marwan A Jabri. 1992. Analysis of the effects of quantization in multilayer neural networks using a
statistical model. IEEE Transactions on Neural Networks 3, 2 (1992), 334–338.
[201] Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Christopher De Sa. 2019.
Swalp: Stochastic weight averaging in low-precision training. arXiv preprint arXiv:1904.11943 (2019).
[202] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing Energy-Efficient Convolutional Neural Networks
Using Energy-Aware Pruning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
6071–6079.
[203] Wenzhuo Yang and Huan Xu. 2015. Streaming sparse principal component analysis. In International Conference on
Machine Learning. 494–503.
[204] Yukuan Yang, Lei Deng, Shuang Wu, Tianyi Yan, Yuan Xie, and Guoqi Li. 2020. Training high-performance and
large-scale deep neural networks with full 8-bit integers. Neural Networks (2020).
[205] Yuanshun Yao, Zhujun Xiao, Bolun Wang, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2017. Complexity vs.
performance: empirical analysis of machine learning as a service. In Proceedings of the 2017 Internet Measurement
Conference. ACM, 384–397.
[206] Heiga Zen, Yannis Agiomyrgiannakis, Niels Egberts, Fergus Henderson, and Przemysław Szczepaniak. 2016. Fast,
compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. arXiv
preprint arXiv:1606.06061 (2016).
[207] Chongsheng Zhang, Changchang Liu, Xiangliang Zhang, and George Almpanidis. 2017. An up-to-date comparison
of state-of-the-art classification algorithms. Expert Systems with Applications 82 (2017), 128–150.
[208] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. 2016. The zipml framework for training models
with end-to-end low precision: The cans, the cannots, and a little bit of deep learning. arXiv preprint arXiv:1611.05402
(2016).
[209] Kai Zhang, Chuanren Liu, Jie Zhang, Hui Xiong, Eric Xing, and Jieping Ye. 2017. Randomization or Condensa-
tion?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling. In Proceedings of the 23rd ACM SIGKDD

, Vol. 1, No. 1, Article . Publication date: July 2020.


On-Device Machine Learning: An Algorithms and Learning Theory Perspective 45

International Conference on Knowledge Discovery and Data Mining. ACM, 615–623.


[210] X Zhang, X Zhou, M Lin, and J Sun. [n.d.]. ShuffleNet: An Extremely Efficient Convolutional Neural Network for
Mobile Devices. arXiv 2017. arXiv preprint arXiv:1707.01083 ([n. d.]).
[211] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. Dorefa-net: Training low bitwidth
convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
[212] Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang. 2019. Edge intelligence: Paving the last mile of
artificial intelligence with edge computing. Proc. IEEE 107, 8 (2019), 1738–1762.
[213] Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, and Gennady
Pekhimenko. 2018. TBD: Benchmarking and Analyzing Deep Neural Network Training. arXiv preprint arXiv:1803.06905
(2018).
[214] Xiaojin Jerry Zhu. 2005. Semi-supervised learning literature survey. Technical Report. University of Wisconsin-Madison
Department of Computer Sciences.

, Vol. 1, No. 1, Article . Publication date: July 2020.

You might also like