survey

Open access

A Review on the emerging technology of TinyML

Authors:

Vasileios Tsoukas,

Anargyros Gkogkidis,

Eleni Boumpa,

Athanasios KakarountasAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 10

Article No.: 259, Pages 1 - 37

https://doi.org/10.1145/3661820

Published: 22 June 2024 Publication History

PDF eReader

Abstract

Tiny Machine Learning (TinyML) is an emerging technology proposed by the scientific community for developing autonomous and secure devices that can gather, process, and provide results without transferring data to external entities. The technology aims to democratize AI by making it available to more sectors and contribute to the digital revolution of intelligent devices. In this work, a classification of the most common optimization techniques for Neural Network compression is conducted. Additionally, a review of the development boards and TinyML software is presented. Furthermore, the work provides educational resources, a classification of the technology applications, and future directions and concludes with the challenges and considerations.

1 Introduction

Machine Learning (ML) is a rapidly growing study topic in academia and industry. When combined with the fields of the Internet of Things (IoT), Data Science, and the fourth industrial revolution (Industry 4.0), ML research entails a massive development of solutions with a variety of applications in diverse and multidisciplinary fields, including medical contexts, pattern recognition, finance, and environmental science [135]. The availability of large-scale data extraction paired with advancements in hardware technology has resulted in the development of a well-known and commonly used machine learning subset, Deep Learning (DL). The pattern of DL is similar to that of the brain’s Neural Networks (NNs), and two of the most significant elements regarding NNs are the automated feature extraction and analysis of unstructured data. The most widely used DL network types are Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Long Short-Term Memory Networks (LSTMs) [190].

The high computational complexity of DL networks has led researchers to concentrate their efforts on compression techniques and optimization methods capable of increasing the efficiency of the NNs while maintaining the same levels of accuracy and performance. The aforementioned solutions are software-based. Contrastingly, other attempts are focused on the hardware design principles, to accelerate the overall process, utilizing hardware components that allow complex matrix functions [147, 186]. For example, Application Specific Integrated Circuits (ASICs) are designed with specific architecture to achieve the acceleration of ML applications workload, providing support for fixed-point calculations and floating-point (FP) calculations. Also, Field-Programmable Gate Arrays (FPGAs), are reprogrammable ASICs that provide high performance concerning energy consumption compared to Graphics Processing Units (GPUs) and low latency compared to models executed on Central Processing Units (CPUs) [104]. The most notable challenges regarding DL implementations are the high cost of the training models and the high computational requirements of the hardware required to process data and extract results.

The wide usage of both ML and DL methods in various fields allows for analyzing information from a large amount of data and extracting valuable insights. In a typical IoT ecosystem, data is gathered, transferred, processed, and stored with the goal of monitoring values and preventing or forecasting emergencies. Data is acquired from interconnected devices, such as sensors and actuators, and then sent to the cloud, where it is processed and stored. IoT gateways are used to connect the sensors’ network to the cloud network. The cloud serves as the central point of the IoT ecosystem, processing huge amounts of data and delivering intelligent ML or DL-based applications [58, 188]. The majority of IoT applications require fast response times, wireless and high-speed data transmission, high network bandwidth, low latency, and data security.

While the cloud offers necessary processing power and decision-making mechanisms, transmitting data to it carries risks concerning data privacy and security. Notably, data on the cloud may not always be secure, exposing sensitive information to potential threats such as man-in-the-middle attacks, replay attacks, eavesdropping, and rogue access points [21, 91, 137, 164, 184].

In contrast, edge computing offers a safer alternative by providing cloud-like services directly at the network’s edge. This method, named for its decentralization of computational power, enables faster data processing, greater bandwidth, and data sovereignty [108, 230]. Furthermore, over the past decades, researchers concentrated on embedding ML models on edge devices and constrained hardware [176]. The Microcontroller Unit (MCU) is the hardware utilized for achieving the aforementioned operation [45]. The high level of interest in MCUs is attributed to their specifications, which include essential performance [134], low power consumption, and a tiny form factor [58].

New emerging trends utilizing ML to provide tailored and personalized results in IoT devices are federated learning and Tiny Machine Learning (TinyML). In federated learning, participants cooperatively train ML models by exchanging model parameters, allowing for personalized training for each participant while safeguarding their sensitive data. The TinyML approach is a hardware-software hybrid focused on embedding ML models and NNs on constrained hardware capable of processing them locally [103]. The most significant advantage of the TinyML approach is the real-time analysis of the collected data while overcoming latency and bandwidth limitations [23]. Additionally, the IoT devices that embed TinyML on their MCUs require less access to the cloud services, resulting in cost and power reduction, and incremental data security and privacy [58]. All the benefits mentioned above have led to the development of several TinyML-based applications in various fields, such as healthcare, automotive, agriculture, security, and the industrial sector. Furthermore, TinyML’s popularity and quick expansion led to the introduction of several TinyML-compatible development boards, frameworks, libraries, and other toolkits.

Deep Neural Networks (DNNs) are constantly growing in size and trying to achieve the best possible performance to handle and process complicated issues in fields such as robotics, Natural Language Processing (NLP), security, and Computer Vision (CV). Two of the most common sectors that utilize DNNs are CV and NLP. Transformer-based architectures for the aforementioned sectors [54, 118, 126, 158, 177, 212, 229] typically have many layers where each layer has millions of parameters [30, 189]. The computational resources required for training, processing, and storing CNNs of that scale and complexity necessitate using powerful scientific workstations with high-end GPUs and numerous CPUs. Additionally, with the advances of IoT, mobile, and wearable devices, a new need arises that motivates researchers to compress and optimize the networks stated above to be compatible with constrained hardware with limited resources. Moreover, since networks and models keep expanding to offer better performance, even some high-end machines do not have the required power to handle them, resulting in many issues regarding the ML professionals and how the research community could provide solutions by offering better optimization techniques. This review does not cover DNNs implementation for constrained hardware due to the extension of the issue and page limitations of the journal.

The technology of TinyML is in its early stages of development, hence, it lacks a number of capabilities and has several limitations. It is difficult to provide a common framework for model training and inference compatible with every MCU. Due to the heterogeneity of the devices, the models that are built expressly for one MCU architecture may not necessarily run on another, even if they share the same hardware and specifications. Finally, researchers must comprehend the relationship between power consumption and processing speed, as well as their impact on algorithm accuracy [194].

The contribution of this study is to provide a review of the TinyML approach in IoT devices. The remainder of this work is organized as follows: Section 2 provides an overall discussion of the TinyML technology, showcasing an overview of compression ML techniques. Section 3 reveals TinyML’s impact over the years and provides useful educational resources. Section 4 consists of subsection 4.1, which presents a review of TinyML software, and subsection 4.2, which presents a review of TinyML hardware, while Section 5 provides a categorized review of TinyML applications and the future directions of the technology under consideration. Finally, Section 6 discusses the challenges of the technology of TinyML, while Section 7 concludes this study.

2 The Technology of TinyML

TinyML is a hybrid software-hardware technology with high scientific and industrial interest due to its potential for creating small, autonomous, and energy-efficient smart devices [70, 72]. TinyML can be identified as a cluster of technologies working harmoniously to provide a necessary result. It is probably the most significant paradigm of technologies requiring a hardware/software co-design flow to unleash the sector’s full potential. Inference on edge is an indispensable combination of the frameworks [49], model reduction techniques [220], libraries utilization [81, 116], architecture searches [70], and designing and deployment models to the appropriate hardware [24]. It can be easily observed that there is a high amount of complexity due to the requirements of all the aforementioned technologies. Innovation and continuous progress, especially in the hardware field, are of great importance, with the latest advancement showing encouraging and promising results [63]. The first applications utilizing the technology in consideration have already been introduced [176], and this is a positive vision of how TinyML could play a significant role in future applications. A brief description of the main benefits of the TinyML technology follows:

(1)

Information security and latency: Nowadays, handheld or wearable devices require a secure connection to a cloud service to process data and provide results [188]. The first two main issues that come to mind are latency and bandwidth constraints [2, 133, 156]. Additionally, due to their small form factor, power, and energy constraints, most wearable devices are built without a significant focus on security [184] or are in need of real-time responses [130]. Devices such as medical or activity tracking may contain sensitive and private information and may be vulnerable to malicious users and attacks [136, 143, 191, 221]. Cloud providers offer the processing power required for result and decision extraction in a reasonable time frame. Data transmission poses various risks, since private and sensitive information is prone to malicious users and attacks such as Man In The Middle (MITM) attacks, replay attacks, eavesdropping [91, 137, 184], and rogue access point attacks [21, 164]. Most of the time, the transmitted data is not encrypted, and the devices in question have weak or no wireless security features. Furthermore, connecting a user’s personal device to business networks can be harmful. Due to various device weaknesses, the device may operate as a starting point for a network backdoor when obtaining corporate data. Additionally, in sectors such as automotive or healthcare, where seconds are crucial, high latency or delays in response time are unacceptable. A TinyML device does not require any data transmission, and a blend of optimized software and hardware could be in place to offer almost instant results.

(2)

Energy efficiency and overall cost: A TinyML-based device can be a small form factor, a low-cost and low-powered autonomous device able to collect data from its sensors, perform the processing, and extract relatively fast results [49]. Moreover, Deep Reinforcement Learning (DRL) is another area of great scientific interest that has started to be explored in ways to be implemented in resource-constrained systems, as shown in recent literature [56, 117, 173, 198, 213]. The complexity of ML tasks requires sufficient process power, translating into power-demanding and expensive equipment that is most commonly met in high-end GPUs [112]. TinyML might change how we visualize those tasks by offering almost the same results by inferring models and networks into MCUs. An MCU is a low-powered device with typically low production cost that can extract decisions and offer machine intelligence with the cost of a single battery. All this is achieved due to the appropriate co-design of software and hardware and the aggressive optimization techniques implemented in the algorithms.

2.1 Optimization Methods

NNs tend to have several parameters with significant redundancy regarding the models, ultimately leading to more computation power and memory usage than required [52]. As stated above, TinyML is a paradigm that allows ML models to fit into constrained hardware without compromising their energy efficiency [55]. To achieve model inference in devices with limited resources, specifically MCUs, it is necessary to heavily optimize and compress the models being used. Optimizing algorithms and ML models is a challenging issue. As described previously, it is not just a software or hardware problem, but hardware and software co-design is a prerequisite to obtain the targeted result [58]. Over the past decades, many methods, frameworks, and techniques have been proposed for compression. O’Neill’s work [147] provides an analytical review of those methods. Furthermore, recent research also reports the first attempts to optimize deep RL designed for resource-constrained systems [198]. A brief description of the most commonly used techniques follows, aiming to highlight the preferred ones in TinyML.

2.2 Quantization

A network trained on high-end GPUs has values stored in 32-bit FP single precision. For faster inference and lower computational needs, researchers have focused on training lower-precision networks with 1-bit or 2-bit representations able to run on other types of hardware such as FPGAs and ASICs. The process of representing network values with fewer bits is known as quantization. One necessity of the aforementioned method is to retain accuracy or have the lowest possible tradeoff. A typical quantization technique lowers the values from FP32 to 8-bit representations [53, 86]. Gupta et al. [85] analyzed that stochastic rounding methods can maintain accuracy when an FP32 model is quantized to FP16. Cambier et al. [33] proposed a shifted and squeezed 8-bit (S2FP-8) to avoid stochastic rounding, while Mellempudi et al. [138] showcased a different approach where this work proposed different Floating Representations for each layer eg FP64, FP32, FP16 than the typical FP32. Park et al. [157] proposed 3-bit activations where weights are also quantized to 4-bit and 16-bit scaling factors for approximately 1% of the network, resulting in only 1% of accuracy loss. Migacz proposed different approaches [140], which used relative entropy to measure information loss, and Banner et al. [26], who used noise and clipping distortion. Merolla et al. [139] tried to understand how different distortions affect the networks. Their results revealed that NNs are robust to distortions but have more extensive convergence times. References [54, 111] suggested that post-training quantization on small networks can hit accuracy when utilizing 8-bit formats or lower. In an attempt to weather the aforementioned issue, Zhou et al. [236] proposed the quantization of gradients to 6-bit and stochastically propagated using estimators. By minimizing the loss regarding binarized weights, Hou et al. [98] proposed a Newton algorithm combined with Hessian approximation. Explicit Loss-aware Quantization (ELQ) is another method of quantization proposed by Zhou et al. [235], which focuses on minimizing the loss for very low precision. Similarly, Park et al. [157] achieved reduced precision by intensively narrowing the range weight values. Stock et al. [197], to overcome quantization drift, used iterative quantization starting with low layers and performed gradient updates to the rest. On the contrary, Jacob et al. [54] used an estimator to backpropagate through activations and weights during the training process instead of exploiting the aforementioned iterative approach. Finally, Fan et al. [69] suggested that utilizing the previous two approaches is not advisable when operating with ternary or binary precision representations. They proposed a simulation of quantization noise, which is random for a subset of the network, and then the performance of backward weights passes on to the rest.

Quantization is one of the most often used optimization strategies for TinyML compression on MCUs. As described previously, the former attempts to map 64-bit or 32-bit weights to a smaller bit width to store the required models in MCUs [58], while there is no focus or strategies in the initial architecture and instead relies on techniques such as transfer learning after the final model is ready [3]. In Reference [168], the Quantised Latent Replay-based Continual Learning method is proposed, which relies on low bandwidth quantization that can reduce the memory requirements of the layer but also increase the overall speed of the network. This work is one of the first attempts at creating a hardware and software platform for TinyML continual learning.

2.3 Pruning

Network pruning is one of the oldest and most used techniques for optimization. It is based on the human brain, where irrelevant and unimportant past experiences are removed to make room for newer [219]. Network pruning removes synapses and neurons that fall under a certain weight threshold [90] to improve performance and reduce the computational needs of a network without sacrificing a significant amount of accuracy. Han and Qiao [92] and Narasimha et al. [146] suggested a different perspective by combining different techniques, such as the addition of new neurons and a predefined percentage of weights that need to be pruned instead of the standard version of setting a threshold. Magnitude-based Pruning (MBP) is used due to its high performance while it keeps its simplicity for ML and other tasks [185] and even tends to outperform layer-wise MBP [90, 107, 120, 171]. In References [94, 119, 171], the authors measure the importance of the pruned units and try to remove only those with minor importance that ultimately will lead to the least loss. Mozer and Smolensky [144] employed skeletonization, a method to eliminate the least important units during training. Karnin [107] measured the sensitivity of the loss function so that the network can be efficiently pruned, while Engelbrecht [66] recommended assessing whether the variance of sensitivity is statistically different from zero before proceeding with weight removal. Additionally, LeCun et al. [119] proposed that weight importance can be estimated utilizing the Taylor series.

Based on this technique, Molchanov et al. [142] used a Taylor expansion to prune the weights with the lowest change regarding the cost function. Theis et al. [205] extended Molchanov’s work by providing cost estimates for network pruning. Mallya and Lazebnik [132] suggested that a dynamic mask can learn to adapt a dense network to a different sparse network. A different approach is structured sparsity learning, which can be found in Reference [224]. Louizos et al. [129] suggested that Bayesian methods can also be utilized for structured pruning. Dai et al. [47] proposed an alternative method known as pruning via variational information bottleneck, where the authors aimed to remove mutual information between layers.

Another approach was introduced by Lin et al. [124] by applying a soft mask to minimize the mean squared error. This work is an extension of a previous work [123], in which the authors achieved optimization using binary masks and hard thresholding. Genetic Algorithms (GAs) are another approach that can be utilized for the pruning method by keeping the best performance parameters they generate and mixing them until they achieve the desired result. Several researchers attempted the usage of GAs to prune a network, and some examples can be studied in References [34, 100, 225].

Noy et al. [150] tried to reduce the required time for searching neural architecture by implementing pruning via simulated pruning. Their work is based on the DARTS approach suggested in Reference [125]. Particle filtering for pruning is another method where sequential Monte Carlo estimation is utilized to identify the crucial weights [9]. Particle swarm optimization for the pruning method was also proposed in Reference [210]. Furthermore, AutoML [95] was proposed to improve the compression performance and the overall efficiency in a more automated pruning approach. Recently, pruning before training where the architecture is adequately initialized and designed shows that the network utilizing this method can achieve the same accuracy as an entire network [73, 127].

2.4 Weight Sharing

One of the first attempts to reduce network size is associated with layer weight sharing. One key point of this method is that the number of weights that should be shared is not always clear before it starts affecting the overall performance and accuracy of the network. To overcome this issue, some recent works attempted to apply the technique after the initial training instead of prior to training [22, 36, 211]. Nowlan and Hinton [149] proposed a different approach, where the authors tried a Gaussian mixture to assign the weights.

An extension of this work was provided later by Ullrich et al. [211], where the optimization method was based on soft-weight sharing. A different approach was seen in the works of Zhang et al. [231] and Plummer et al. [162], where they tried to learn which weights or parameters must be shared. Parameter hashing is another option where parameters are grouped to share weights randomly [36] or share weights of the same value [187, 223]. Another option for weight sharing is using transformers, as Dabre and Fujita [46] and Xiao et al. [227] proposed. Bai et al. [22] proposed deep equilibrium models to find the network’s point where backpropagation can be utilized. One more type of parameter hashing is when recursively reusing layers by feeding again into the input the result of the output layer [64, 110, 178]. The work of Kim et al. [109] showcased the usage of residual connection among the output and input layers. Tai et al. [202] proposed an extension of this work, where the authors used an element-wise addition regarding the intermediate outputs before passing the result to the final layer. Relevant works that seek to overcome Vanishing or Exploding Gradients (VEGs) issues were proposed by Zhang et al. [233] and Guo et al. [84].

2.5 Neural Architecture Search

The Neural Architecture Search (NAS) is another method utilized by data scientists who attempt to build NNs under stringent limitations, ready to be implemented in MCUs. It is a process used to select the model with the highest possible accuracy from a predefined space of CNNs [65]. The typical procedure of a NAS system includes the search algorithm, the search space, and an evaluation strategy [65]. According to Heim et al. [96], NAS techniques could be divided into four different categories: hardware and software co-design, hardware-aware and usage of perceptible metrics, no hardware influence, and finally, the usage of proxies characterized in advance.

Fedorov et al. introduced a NAS method, SpArSe, capable of finding dense and sparse CNNs for MCUs [70]. Another NAS with the name TinyNAS was proposed by Lin et al. [65]. This method includes a search approach that is used to optimize the search space to fit constrained hardware and then continues with the optimization for the search space. One more example was proposed by Heim et al. [96], where their proposed method utilizes generic and hardware-based optimizations to decrease the memory requirements of NNs and speed up the inference latency.

2.6 Hardware-based Optimizations

Despite software-oriented optimizations, hardware acceleration is also employed when specialized hardware could be optimized to improve speed and parallel processing [38]. The main focus of hardware accelerators is accelerating mathematical matrix operations [186].

The authors, in Reference [198], presented a block-based SC architecture focusing on improvements in energy and latency by dividing inputs into blocks that will be executed next in parallel. The experiments show significant savings in power consumption while retaining high accuracy.

The researchers of Reference [39] proposed PRIME framework, where NN applications could be accelerated by utilizing RAM due to its efficiency and potential to perform matrix-vector multiplications. With this architecture and the design provided by the authors, the overall performance is enhanced by 2,360 \(\times\) while the consumption is decreased by 895 \(\times\) .

Another team [122] introduced SmartShuttle for off-chip memory accesses to run as an accelerator for DL applications. The scheme can switch among different data reuse schemes and achieve a match for different layers dynamically. The results revealed a better performance than other state-of-the-art approaches, such as Eyeriss [37], which is also an accelerator for CNNs and can optimize energy efficiency by including a hardware accelerator and a chip for Dynamic RAM (DRAM).

One more proposal regarding hardware can be read in Reference [145], where the authors try to overcome three main design issues: data access, energy consumption, and data movement. Those three key components, especially the third one, data movement, can result in difficulties with bandwidth and latency. Their main focus is to bring computation closer to data by utilizing a technique called Processing In Memory (PIM) [74]. By implementing this method, the overall data movement between memory and the computation units could be reduced or eliminated by a great margin.

Exploiting FPGAs acceleration for matrix multiplications, the authors in Reference [160] presented an implementation of the transformer model and proper hardware scheduling for DL models utilizing also the method of weight pruning. The results reveal low latency regarding the inference on the hardware with speeds up to 10.96 times compared to CPUs and 2.08 when compared to GPUs. Some more examples related to hardware acceleration designs and architectures can be studied in References [105, 163, 228].

2.7 Frameworks, Libraries, Tools, and Other Techniques

As mentioned earlier, another popular way for training, optimizing, implementing, or even all the procedures could be achieved by utilizing frameworks, libraries, and other toolkits. Software-based solutions such as Google’s TensorFlow (TF) Lite for MCUs [49] and Microsoft’s Embedded Learning Library (ELL) [79] enable the design and deployment of ML models on power-efficient devices and even single-board computers such as Arduino and Raspberry Pi [165]. Additionally, the research community has already developed several software-based solutions, such as libraries and diverse sets of tools, for automatically converting pre-trained AI algorithms, such as NNs and ML models, and integrating the generated optimized code into researchers’ boards. A more detailed overview of the software-based solutions exploited to implement models into MCUs will be provided in Section 4. The infrastructure of the laboratory and datasets provided by the project “ParICT-CENG” were leveraged to evaluate the frameworks described in this work. In future development, benchmarking results will be presented.

Figure 1 showcases a summary of the technology of TinyML, with information regarding the technology’s main optimization methods, the main benefits and challenges, a framework example, some of the most valuable educational resources, and finally, three different boards fully compatible with TinyML.

Fig. 1.

3 TinyML Impact

TinyML has witnessed a significant increase in research interest and output in recent years. In the following paragraphs, a bibliometric analysis of the rising number of TinyML research publications from 2019 to 2022, demonstrating the expanding influence of this field, will be provided [1].

Thirteen research papers were published on the nascent subject of TinyML in 2019, indicating initial interest and exploration in this area. The following year, 2020, saw a significant increase in TinyML-related publications, with 88 published papers. This sevenfold increase in just one year demonstrates the rapid recognition of TinyML’s potential and applicability in various fields.

In 2021, the number of TinyML papers increased to 346, continuing the upward trajectory. This nearly fourfold increase from the previous year is indicative of the expanding significance of TinyML within the scientific community.

Finally, in 2022, the remarkable amount of TinyML publications reached 724. This doubling of TinyML research output in just one year demonstrates the accelerating rate of innovation and the expansion of TinyML research frontiers.

This bibliometric analysis demonstrates TinyML’s rising impact, as evidenced by the exponential growth of research papers over the past four years. The expanding corpus of literature in TinyML highlights the rapid advancements in this field and demonstrates its growing importance in driving the future of intelligent devices. The increase in TinyML publications from 2019 to 2022, which was discussed previously, is depicted in Figure 2.

Fig. 2.

3.1 Educational Materials

Another interesting statistic is that MCUs are currently used in vehicles, consumer electronics, home devices, and industrial equipment. It is expected to have a rise in sales by the end of 2021, followed by higher increases by the year 2023 [101]. This translates to an environment that is mature and prepared to implement the necessary software to create even more innovative devices using TinyML technology. Additionally, one of the most common applications of TinyML is keyword spotting [232]. Many technological champions such as Apple, Google, and Amazon already use a hybrid of keyword spotting applications by inferring ML models into their smart home devices but also depend on the cloud for better results [6, 11, 83]. Another example of utilizing the technology is anomaly detection, a valuable mechanism for security applications [35]. Moreover, since the need for TinyML operations hardware is publicly available, as stated above, and the recipe for ML applications is very close to the technology of TinyML, a new road for jobs and opportunities is shining. Educational programs for TinyML are extremely necessary to achieve the training mentioned earlier. One of the first collaborations to create educational material was conducted between Harvard University [93] and Google [82] on the edX platform [61]. The course aims to train and prepare students on ways to overcome many challenges, such as the insufficient hardware resources required to run the ML models, and close the growing gap between industry and academia due to the industry’s rapid progress. The course teaches the fundamentals of TinyML technology, trains students on real-world TinyML applications, and also informs them about ethical and life-cycle considerations regarding development and deployment.

Additionally, a TinyML kit was co-designed with Arduino [16] to offer low-cost project-based learning to students. The kit contains the Arduino Nano 33 BLE Sense [14], a camera module [13], and a TinyML shield for simplified sensor integration [62, 170]. Another course worth mentioning is hosted on the Coursera learning platform [43]. The course was developed with some of the most valuable key actors of the TinyML sector: Edge Impulse [60], Arm [20], Arduino [16], and the TinyML Foundation [206]. The course starts with an introduction to ML and embedded systems, continues with NNs and how to train them, and concludes with teaching how to deploy those networks to MCUs. The concepts, demonstrations, and tutorials can be done with the user’s smartphone device or Arduino Nano 33 BLE Sense board [44, 59]. Warden and Situnayake offer another great educational resource in the form of a book. The authors introduce ML and embedded systems, while the rest of the book enables the reader to create a series of TinyML projects. The book is ideal for hardware or software developers who want to infer ML models in MCUs [207].

In conclusion, TinyML is the miniaturization of intelligence machinery that can collect data, process it, and extract valuable information in a safe, energy-efficient, reliable, and low-cost way; and has applications in most scientific and industry sectors. There is still significant interest in improving NN capabilities and accuracy [51, 131, 168]. However, with the advances and opportunities TinyML is capable of offering, it is evident that there is now a need to keep accuracy on high levels while also optimizing and compressing the models to be utilized by power constraint hardware, identified as MCUs. Finally, there is a great need to democratize the technology of AI. Nowadays, current AI implementations are developed with businesses and end-consumers in mind and are built to run on research workstations without considering how this could also be achieved in real-world applications and hardware, and even more in constrained hardware. TinyML could bring AI algorithms to more sectors, connect communities, and achieve a digital revolution with more tailored results extracted from data collected near the fountain of each sector and user. As mentioned above, MCUs are among the best-suited hardware devices due to their low cost and worldwide accessibility from everyone [161, 217].

Table 1 provides additional educational material. These educational materials are categorized according to their type, book, course, tutorial, and so on. The content the development boards utilize and their creators are also mentioned.

Table 1.

Name	Type	Content	Development board	Creators
Computer Vision with Embedded Machine Learning [42]	Online course	CV, image classification, deployment, projects, MCUs	Raspberry Pi, Arduino Portenta	Edge Impulse, Coursera, OpenMV, Seed, TINYML
Hello World of AI [181]	Online course	Codecraft, Edge Impulse, projects, data acquisition, hardware, model training, deployment	Wio terminal	Seed, Benjamin Cabé
Everything About TinyML – Basics, Courses, Projects & More [180]	Collection	Basics, courses, projects	Arduino Nano RP2040, Wio terminal, Seeduino XIAO	Seed, Jonathan Tan
TinyML Study Group [80]	Collection	Basics, optimization, hardware, research paper reading	Neural Compute Stick, Edge TPU board	Archana Vaidheeswaran, Soham Chatterjee
TinyML Example: Anomaly Detection [77]	Tutorial	Data collection, training, Python, ML models, anomaly detection	Adafruit Feather ESP32	Shawn Hymel
Handwriting Recognition [88]	Tutorial	Model training, hardware, evaluation, inference	SparkFun 9DoF	Naveen
CurrentSense- TinyML [78]	Tutorial	Python, Tensorflow, MCU behavior detecting	Arduino Nano 33 BLE Sense	Daniel Cuthbert, Thomas Roth, Mark C.
TapLock - A bike lock with machine learning [89]	Tutorial	Project, MCUs, Edge Impulse, applications	Arduino Nano 33 BLE Sense	Team TapLock
Building a TinyML Application with TF Micro and SensiML [204]	Tutorial	Sensors’ utilization, data collection, hardware, deployment	Arduino Nano 33 BLE Sense	Chris Knorowski
Easy TinyML on ESP32 and Arduino [87]	Tutorial	Project, Python, MCU, libraries, deployment	Arduino Nano 33, SparkFun Edge or STM32F746G discovery kit	Eloquent Arduino
Cough Detection with TinyML on Arduino [17]	Tutorial	Data collection, impulse creation, training, deployment	Arduino Nano 33 BLE Sense	UN, Hackster, Edge Impulse
AI Speech Recognition [41]	Tutorial	ML on MCUs, compile, deployment, Python	SparkFun Edge	Dansitu, Mnatraj
TinyML Cookbook [102]	Book	ML on MCUs, Arduino, applications, Python	Arduino Nano 33 BLE Sense, Raspberry Pi Pico	Gian Marco Iodice

Table 1. Additional Educational Materials about TinyML Technology

4 Hardware and Software Implementations with TinyML

4.1 TinyML-based Software

TinyML offers tangible solutions to the urgent energy consumption problems and computational limitations plaguing conventional ML deployment. TinyML employs tiny, energy-efficient processors to perform complex computations on the device itself, unlike conventional methods that often require powerful servers and significant energy resources. This eliminates the need for constant communication with central servers, thereby decreasing latency and energy consumption. Additionally, sensitive data can remain on the device, allowing for increased privacy and security. TinyML represents a significant paradigm shift in implementing and deploying ML, bridging the divide between cutting-edge technology and practical, real-world applications. To achieve this, various frameworks and techniques have been employed to allow the deployment of ML models on constrained hardware devices, specifically MCUs. These tools seek to optimize and efficiently execute ML models, considering these devices’ limited memory and power. The frameworks include model conversion, inference execution, hardware-specific optimizations, and support for multiple DL frameworks. Figure 3 is an infographic that visually classifies and illustrates these components of the TinyML ecosystem, providing a comprehensive overview of the various categories and frameworks. The available software can be categorized according to the primary functionality or purpose of each framework/tool as follows:

Fig. 3.

4.1.1 All-in-one Frameworks.

This subsection explores various frameworks designed for training, optimizing, and deploying ML models to MCUs, enabling embedded devices to perform deep learning tasks within specific memory constraints.

TensorFlow Lite for Microcontrollers is a port of Google’s TensorFlow Lite developed specifically for running DL on embedded devices with a few kilobytes of memory, hence significantly expanding the scope of ML.

The Embedded Learning Library (ELL) enables the development and deployment of ML models on platforms with limited resources. A prerequisite is a computer with sufficient resources to generate the machine code the embedded device will execute.

Edge Impulse [60] is a cloud-based framework that offers a complete pipeline to integrate ML models on MCUs. For starters, it allows the user to collect data directly from the device, label them, and create a dataset on the cloud. Edge Impulse has prebuilt ML blocks that can be trained on the provided dataset or even create new ML blocks based on the project’s needs. The user has the ability to test the model’s performance on a virtual simulation before deploying the model on a physical device. Moreover, while evaluating the model’s performance on devices, users can access live classification data to check if the model behaves as expected in real time. Another great feature of this framework is the versatile support of boards and MCUs.

The NXP® eIQ® [152] ML software development environment supports the implementation of ML algorithms on NXP EdgeVerseTM MCUs and microprocessors, including i.MX RT crossover MCUs and i.MX family application processors. The eIQ ML software suite contains the eIQ Toolkit, an ML workflow tool, as well as inference engines, NN compilers, and optimized libraries. The eIQ Toolkit offers graph-level profiling with runtime insights to optimize NN topologies and simplifies ML development using the eIQ Portal and command-line host tools. The eIQ Portal streamlines ML development by providing an intuitive Graphical User Interface (GUI) that allows users to create, optimize, debug, convert, and export ML models.

A comprehensive, all-encompassing framework emerges as the best option for researchers making their initial forays into TinyML and edge devices. The primary justification for this recommendation is the framework’s streamlined efficiency: Researchers are equipped with a single tool that integrates their models’ training, optimization, and deployment processes. Inspecting these frameworks in greater detail reveals two distinct categories: code-based and UI-based frameworks.

TFLM necessitates that users possess the necessary coding skills, as each phase, from initial model construction to final deployment, must be carried out programmatically. In contrast, frameworks such as Edge Impulse and NXP simplify the procedure and provide a more user-friendly approach. With these platforms, researchers can navigate multiple stages, from training to deployment, with a series of mouse clicks. Such intuitive interfaces can significantly reduce the technical barriers typically encountered in the early stages of research, enabling a more inclusive exploration of TinyML on-edge devices.

4.1.2 Model Conversion.

This section focuses on the frameworks and tools that facilitate the conversion of previously trained models into formats suitable for constrained hardware, addressing the gap between complex ML algorithms and the limited resources available on embedded systems.

AIfES (Artificial Intelligence for Embedded Systems) is an open-source artificial intelligence (AI) software framework that can aid in the training and deployment of artificial neural networks (ANN) on a broad variety of hardware. It was created as a Maker project at the Fraunhofer Institute for Microelectronic Circuits and Systems IMS. Feedforward Neural Networks (FNN) are supported in the current release, and it is proposed that Convolutional Neural Networks (ConvNet) will also be implemented in the near future.

Tinymlgen is a Python library that allows TensorFlow models to be exported to C format with minimal code requirements. It accepts a TensorFlow model and returns the necessary C code to be integrated into an Arduino sketch.

Sklearn-porter is another tool capable of converting scikit-learn estimators to C, Java, and JavaScript, among others, to be utilized in embedded systems. Similarly, m2cgen is a lightweight library that is utilized to convert trained ML models into code that may be deployed to constrained devices. Additionally, another tool, weka-porter, is used to transpile trained models from Weka to ready-to-be-deployed code.

Python-based EmbML converts off-board-trained models into C++ or even C source code that can be then compiled and executed on low-power microcontrollers. The primary objective of EmbML is to generate classifier source codes that run specifically on hardware systems with limited resources, using bare metal programming.

FANN-on-MCU is a toolkit based on the Fast Artificial Neural Network (FANN) library for deploying efficient NNs on MCUs based on the ARM Cortex-M series as well as the novel RISC-V-based Parallel Ultra-Low-Power (PULP) platform.

Apache TVM [10] is an open-source compiler framework that optimizes existing and new ML models to any hardware platform. The two main features of this framework are the compilation of the model to the minimum deployable modules and the optimization of more backend runtimes with better performance. As it is a compiling framework, the supported platforms include CPUs, GPUs, MCUs, FPGAs, and more. Furthermore, this framework is flexible and adaptable to various performance techniques such as quantization and memory planning, optimizing the compiled model to be used on as many platforms as possible.

ScaleDown [179] is an open-source platform for optimizing and deploying NN models to TinyML devices. This is accomplished by providing a framework-independent API that supports widely used DL frameworks. The framework is under development, and new features are constantly added. The optimization techniques that are being offered are pruning, quantization, and knowledge distillation. Additionally, the authors claim to offer conversion of models between different frameworks such as TF, PyTorch, OpenVivo, and ONNX.

AIfES, Tinymlgen, Sklearn-porter, and Apache TVM are crucial in bridging the gap between advanced ML models and hardware environments with limited resources. These tools may be proven useful in areas such as automation, robotics, mobile applications, industrial control systems, and energy-efficient computing. They have facilitated the deployment of neural networks and ML models on a wide range of platforms, from Arduino devices to microcontrollers based on the ARM architecture. Essentially, they encourage innovation and facilitate the rapid integration of ML capabilities into embedded systems with limited resources.

4.1.3 Hardware Accelerators and Hardware-specific Tools.

This section examines a variety of TinyML-specific hardware accelerators and tools, focusing on optimization techniques that cater to constrained hardware requirements and enable the deployment of ML models on edge devices.

Arm NN is an ML inference engine for Android and Linux that accelerates ML on Arm Cortex-A CPUs and Arm Mali GPUs by leveraging a number of Arm architecture-related optimizations. This ML inference engine is an open-source software development kit that fills the difference between current NN frameworks and energy-efficient Arm IP.

CMSIS NN software library is a compilation of efficient NN kernels designed to optimize the performance and memory requirements of NNs on Cortex-M processors. Included among the core functions of the aforementioned library are the Convolution Functions, Activation Functions, Softmax Functions, and Basic Math Functions.

In STM32 microcontroller platforms, NanoEdge AI Studio enables the deployment of ML models with on-device learning capabilities. Following step-by-step instructions to collect, validate, and generate the C-code to be incorporated into the final project enables the development of multiple applications, including anomaly detection and classification.

X-CUBE-AI is an STM32Cube Expansion Package that contributes to the STM32Cube.AI ecosystem. It can automatically convert pre-trained models into STM32CubeMX executables.

HANNAH is a hardware accelerator and NN search framework designed to solve the hardware-software co-design prerequisite of TinyML and meet constrained hardware requirements. As the authors of the framework suggest [29], HANNAH aims to automate the training, deployment, and optimization process of NN architecture and bring intelligent data processing to edge devices. The NPU architecture that is based on UltraTrail [27] and utilized at the deployment step embodies an array of multiply and accumulate units for convolutional layers and provides analytical models to avoid long times during simulation and synthesis of the hardware. During the optimization process, a joint search space is used to combine the NN and the NPU design space. There is the option to restrict the number of design choices, and the optimization function is assembled by metrics such as the estimated power and latency.

Hls4ml [68] is an open-source framework that provides software and hardware co-design for ML algorithms targeting FPGA devices and ASIC technology. The framework’s workflow can automatically translate NNs with specifications such as the model’s architecture and weights into a hardware accelerator ready to be processed with High-Level Synthesis (HLS) tools. The network can be designed with (Q)Keras and PyTorch before being translated into an HLS project. The workflow also includes quantization and pruning aware training for the essential optimization to prepare the network for device implementation.

As presented in Reference [201], the PULP framework can run non-neural ML algorithms such as K-means, KNN, and SVM into the constrained hardware of PULP MCUs. The authors developed the system to target the commercial chip GAP8 and the research platform PULP-OPEN, which is able to run on FPGA emulators. The main focus of the framework in consideration is to achieve the same level of performance when compared to NNs due to the parallel design of the algorithms. This design allows speedups up to 7.64 \(\times\) . Finally, the 8-core clusters of the PULP-OPEN platform can achieve up to 15.85 \(\times\) improvement regarding performance compared to Cortex-M4.

Accelerators and tools, such as Arm NN, CMSIS NN, NanoEdge AI Studio, and HANNAH, are revolutionizing the deployment of ML models on peripheral devices. From real-time gesture recognition in wearables to anomaly detection in industrial machinery, the aforementioned tools provide optimization techniques tailored to constrained hardware requirements for applications such as real-time gesture recognition in wearables and anomaly detection in industrial machinery. They have contributed to the expansion of TinyML, which has enabled the efficient execution of DL algorithms on platforms such as Android devices, Cortex-M processors, STM32 microcontrollers, and FPGA devices. By aligning with the demands for energy efficiency and real-time processing, they are establishing a new standard for intelligent data processing at the edge by creating novel possibilities for ML applications in scenarios where hardware with limited resources is required.

4.1.4 Tools for Deployment.

This section describes the specialized tools that facilitate the deployment of ML models on microcontrollers and embedded devices, expediting the deployment process and ensuring that models are executed efficiently on constrained platforms.

emlearn is an inference engine for Machine Learning on Microcontrollers and Embedded Devices. The tool enables the training of the model using scikit-learn or Keras and then generates a portable C99 code that can be deployed to constrained devices with a simple header file include.

uTensor is a Tensorflow-based, Arm-optimized ML inference framework with a minimal footprint. It comprises a runtime library and an offline utility that performs most of model translation work.

DORY [31] is a framework for automatically deploying DNNs in low-cost MCUs with less than 1 MB of available SRAM. The framework maximizes L1 memory use within the limitations given by each DNN layer. It generates ANSI C code to manage transfers and calculation phases. Authors’ experiments on GreenWaves technologies GAP8 revealed that DORY outperforms the GreenWaves solution by up to 2.5 MAC/cycle and the result on an STM32-H743 MCU by 18.1 MAC/cycle. GAP-8, when utilizing the framework in consideration, can achieve end-to-end inference of a 1.0-MobileNet-128 network for an average of 63 pJ/MAC @ 4.3 frames per second, which translates into 15.4 times faster than an STM32-H743.

The aforementioned tools are primarily designed for developers with higher expertise, especially those who have already trained and optimized their ML models. In addition, for professionals and researchers who choose well-known libraries, such as Scikit-learn or Keras, for their model training endeavors, the second-mentioned tool appears to be particularly appropriate. Emlearn has received much attention and appears to be establishing itself as a preferred option among the academic community. This claim is supported by the growing number of scholarly works that cite or employ Emlearn in their methodologies and findings.

4.1.5 Tools for Optimization.

This section explores the open-source libraries and frameworks that provide optimization techniques for deploying quantized neural networks on MCUs. It highlights innovations that reduce computational and resource costs, making them ideal for hardware environments with limited resources.

CMix-NN is an open-source mixed-precision library for MCU deployment of quantized neural networks. The library supports convolutional kernels with any bit precision* in the range of 8, 4, and 2 bits for any of the convolution operands.

PhiNets [155] is a scalable framework based on residual blocks designed to reduce the computational and resource cost for image processing on constrained hardware. The framework builds the network with a sequence of inverted blocks where each one is followed by a swish activation function. Additionally, the first layer is represented by 24a, where a is a hyperparameter, and the multiplication factor is doubled each time the feature map is downsampled. After a convolutional block, Squeeze-and-Excitation blocks [99] are inserted, and skip connections are utilized among the same resolution bottleneck layers. Across the network, five stridden convolutions are implemented for the required down-sampling of the feature maps with a reduction of a factor of 32 \(\times\) between the input and output tensor. In the last convolutional block, a neck of a 2 \(\times\) up-sampling layer exists, and a skip connection to help with the performance degradation.

AutoML [192] is a framework that utilizes hybrid-block structured and Pattern Pruning (PP) to enable the efficient execution and reconfiguration of NLP models based on transformers on constrained hardware. This reconfigurability is critical for energy savings in battery-powered devices, which frequently employ the Dynamic Voltage and Frequency Scaling (DVFS) technique for hardware reconfiguration to extend battery life. The optimization strategy comes on two levels. First, it compresses the data using an efficient Block Pattern (BP) and then heuristically shrinks the search space based on the initial findings.

As described in Section 2, where various optimization techniques are presented, the libraries under consideration have exhibited remarkable proficiency. Specifically, they have been able to make significant progress in the field of model and neural network (NN) compression. These enhancements are crucial, because they facilitate the deployment of these models and NNs on edge devices, which are frequently characterized by severe resource limitations. This capability illustrates their potential to facilitate the wider adoption of ML solutions in environments with limited resources.

Moreover, it is worth mentioning that one of the first attempts at online learning and on-device training already exists. TinyOL (TinyML with Online-Learning) permits incremental training on-device for streaming data. TinyOL is founded on the concept of online learning and is appropriate for IoT devices with limited resources.

Table 2 provides a summary of the aforementioned frameworks, tools, and libraries, including the algorithms and platforms utilized by each software, the supported programming languages and compatible libraries, as well as the software’s creators and open source status.

Table 2.

Name	Algorithm	Platforms	Languages	Libraries	Applications	Open Source	Creator	Type
TensorFlow Lite	NN	Cortex-M, Cadence Tensilica	C++, Java, Python, Swift, Objective-C,	TF, TF Hub	image & text classification, object detection, pose estimation, question answering	Yes	Google	Framework
ELL	NN	Cortex-M, Cortex-A micro:bit	C++, Python	CNTK, ONNX, Darknet	image & audio classification	Yes	Microsoft	Framework
ARM-NN	NN	Cortex-A, Mali GPU, Ethos NPU	C	TF Lite, ONNX	variety of applications	Yes	ARM	Tool
CMSIS-NN	NN	Cortex-M	C	TF, Caffe, PyTorch	N/A	Yes	ARM	Library
NanoEdge AI Studio	NN	Cortex-M	C	N/A	anomaly & outlier detection, classification, regression	No	STMicroelectronics	Framework
X-Cube AI	NN, classic ML	STM32	C	Keras, Caffe, ConvnetJS, Lasagne, ONNX	isolation forest, SVM, K-means, etc.	No	STMicroelectronics	Library-Tool
AIfES	FNN	Windows (DLL), Raspberry Pi, Arduino UNO, Nano 33 BLE Sense & Portenta H7, STM32 F4, ATMega32U4	C++, C#, Python, Java, VB.NET	TF, Keras, PyTorch	IoT sensors, medical wearables, smart environments, condition monitoring	Yes	Fraunhofer IMS	Framework
TinyMLgen	NN	Arduino Nano 33 BLE Sense, STM32, SparkFun Edge, ESP32	C	TF Lite	variety of applications	Yes	Individual	Library-Tool
MicroMLGen	Decision Tree, Random Forest, XGBoost, SEFR, GaussianNB, SVC, OneClassSVM, Relevant Vector Machines, PCA	Arduino Uno, Nano, Micro, etc., ESP32, any MCU supporting C	Python	SciKit-learn	classifiers	Yes	Individual	Library-Tool
emlearn	Decision trees, NNs, Naive Gaussian Bayes, Random Forest	AVR Atmega, ESP8266, ESP32, Cortex-M	Python	SciKit-learn, Keras	classifiers, outlier & anomaly detection	Yes	Individual	Library-Tool
sklearn- porter	Decision trees, NNs, Naive Gaussian Bayes	constrained platforms	Java, JS, C, Ruby, Go, PHP	SciKit-learn	classifiers, regression	Yes	Individual	Library-Tool
m2cgen	Linear & Logistic regression, NNs, SVM, Decision tree, Random Forest, LGBM, classifiers	constrained platforms	C, C#, F#, Dart, Go, Haskell, Java, JavaScript, R, PHP, Python, Ruby, Rust, Visual Basic	SciKit-learn	classification, regression	Yes	Individual	Library-Tool
weka- porter	Decision trees	constrained platforms	C, Java, Javascript	Weka	classification	Yes	Individual	Library-Tool
EmbML	Logistic regression, Decision trees, MLP, SVM	Arduino, Cortex-M4	C++, Java, Javascript	SciKit-learn, Weka	classification	Yes	Research group	Library-Tool
uTensor	NN	Arduino, Cortex-M4	C++	TF	N/A	Yes	Individual	Framework
TinyOL	NN	Cortex-M	C++	any NN	N/A	No	Siemens - Research group	Framework
FANN-on- MCU	NN	Cortex-M, RISC-V	C	FANN	variety of applications	Yes	Research group	Library-Tool
CMix-NN	NN	Cortex-M	C	MobileNet	N/A	Yes	Research group	Library-Tool
Edge Impulse	NN	mcu, CPU, accelerators	C++	N/A	Anomaly detection, classification	No	Edge Impulse	Framework
Apache TVM	NN	CPU, FPGA, GPU, MCU	C++, Rust, Java	Keras, CoreML, PyTorch, TF, MXNet, DarkNet	variety of applications	Yes	Apache	Framework
NNTOOL	NN	Cortex-M, Cortex-A	C, C++	TFLite, ONNX, DeepView RT	Classification	No	NXP	Library-Tool
DORY	NN	Cortex-M, STM32	ANSI C	MobileNet	classification	Yes	Research group	Framework
AutoML RT	NN	N/A	N/A	N/A	NLP	N/A	Research group	Framework
ScaleDown	NN	Constrained platforms	Python	TF, PyTorch, ONNX, OpenVivo	classification	Yes	Research group	Library
PULP	ML algorithms	PULP- OPEN hardware	C	N/A	N/A	Yes	Research group	Library
hls4ml	NN	FPGA	Python	Keras, Onnx, TF, QKeras, PyTorch	N/A	Yes	Individual	Library
PhiNets	NN	Constrained platforms	Python	PyTorch, Keras	image & signal processing	Yes	Research group	Framework
HANNAH	NN	Constrained platforms	C	N/A	wearable healthcare devices, key-word spotting	N/A	Research group	Framework
OctoML	NN	ARM, Xilinx, NVIDIA, Intel, Qualcomm	C	N/A	NLP, CV	No	OctoML	Framework

Table 2. TinyML Software

4.2 TinyML-based Hardware and Development Boards

The advantages of embedding ML models on hardware are undeniably numerous, and as mentioned earlier, security, instant, tailored results, and the creation of autonomous devices are some of the most resounding examples. Several researchers are also exploring this new sector and analyzing why the industry should start inferring models on hardware, how it is achievable, and finally, why the technology of ML must become more accessible to everyone [49, 176, 217, 232]. Two of the most popular boards ready to run any type of software, specifically the technology in consideration, are Raspberry Pi and Jetson Nano from Nvidia [151]. Jetson is a great example of a tiny computer able to fit in a palm. It is powerful and most commonly used in automotive and robotics. In those two sectors, any application built requires a vast amount of power source, which can also power the demanding board from Nvidia. For smaller, innovative, and not that demanding applications such as wearable devices, the need for an external power supply makes using the board prohibitive [222]. The solution is the use of another constrained hardware, the MCU. An MCU is a hardware platform that operates below 1 mWatt and has a few kilobytes of RAM, typically with the same amount of flash memory for data storage. Nowadays, most MCUs switched to 32-bit CPUs, thanks to ARM. To be able to operate at the cost of a single small battery, MCUs do not always have an operating system, are single-threaded, and avoid the usage of dynamic location functions [24, 72, 199, 222]. With the advances in technology and by understanding the need for more ingenious IoT devices, intelligent machines, and wearable health-related devices, many manufacturers have developed small-factor and low-cost boards specifically designed for ML inference.

A popular board is from Sparkfun called Edge, utilizing an Ambiq Apollo3 MCU (the MCU’s schematic is visualized in Figure 4), which is powered by Arm Cortex-M4, operates at 48 MHZ, and has a TurboSPOT burst mode for reaching 96 MHZ. The board requires an ultra-low supply current at 6 \(\mu\) A/MHz executing from flash at 3.3 V; it has a flash memory of 1 MB and up to 384 KB of RAM [7]. Nano 33 BLE Sense [14] is the board of preference for many educational courses, since it has a unique TinyML Kit equipped with components and hardware required for students to start embedding ML models and is also used for research and small projects [18, 19, 48, 170, 172].

Fig. 4.

Table 3 summarizes the development boards mentioned above, as well as additional boards that use TinyML technology. The type of MCU and CPU, the CPU clock, the capacity of RAM and flash memory, and the dimensions of each board are reported.

Table 3.

Board	MCU	CPU	CPU Clock	Flash memory	SRAM	Dimensions	Applications
Seeeduino XIAO [182]	SAMD21G18	ARM Cortex-M0+	up to 48 MHz	256 KB	32 KB	20 \(\times\) 17.5 \(\times\) 3.5 mm	wearable devices, rapid prototyping
B-L475E-IOT01A Discovery kit [196]	STM32L4	ARM Cortex-M4	80 MHz	1 MB	128 KB	61 \(\times\) 89 \(\times\) 9 mm	applications with direct connection to the cloud
Syntiant TinyML [200]	NDP101	Cortex-M0+	48 MHz	256 KB	32 KB	24 \(\times\) 28 mm	speech and sensor applications
Arduino Nano 33 BLE Sense [14]	nRF52840	Cortex-M4	64 MHz	256 KB	1 MB	45 \(\times\) 18 mm	wake word and motion detection
Portenta H7 [15]	STM32H747	Cortex M7 and Cortex M4	480 MHz and 240 MHz	16 MB NOR Flash	8 MB SDRAM	62 \(\times\) 25 mm	computer vision, robotics controller, laboratory equipment
Sony Spresense [193]	CXD5602	ARM Cortex-M4F \(\times\) 6 cores	156 MHz	8 MB	1.5 MB	50 \(\times\) 20.6 mm	sensor analysis, image processing
Raspberry Pi 4 Model B [166]	BCM2711	Quad-core Cortex-A72	1.5 GHz	N/A	2 GB, 4 GB, or 8 GB SDRAM	56.5 \(\times\) 86.6 mm	robotics, smart home
Raspberry Pi 4 Pico [167]	RP2040	Dual-core ARM Cortex-M0+	up to 133 MHz	2 MB	264 KB	51 \(\times\) 21 mm	wake-up words
Jetson Nano [151]	N/A	Quad-core ARM A57	1.43 GHz	N/A	4 GB LPDDR4	70 \(\times\) 45 mm	robotics, computer vision
SparkFun Edge [195]	Apollo3	ARM Cortex-M4F	up to 96 MHz	1 MB	384 KB	40.6 \(\times\) 40.6 mm	motion sensing
Adafruit EdgeBadge [4]	ATSAMD51J19	ARM Cortex-M4F	120 MHz	512 KB	192 KB	86.3 \(\times\) 54.3 mm	image processing
Wio Terminal [183]	ATSAMD51P19	ARM Cortex-M4F	120 MHz	4 MB	192 KB	72 \(\times\) 57 mm	remote control, monitoring
Himax WE-I [97]	HX6537-A	ARC 32-bit DSP	400 MHz	2 MB	2 MB	40 \(\times\) 40 mm	image processing, voice and ambient sensing
ESP32-S3-DevKitC [67]	ESP32-S3-WROOM-1	32 bit Xtensa dual core	240 MHz	N/A	512 KB	N/A	rapid prototyping
Arducam Pico4ML-BLE [12]	RP2040	Dual-core ARM Cortex-M0+	133 MHz	2 MB	264 KB	51 \(\times\) 21 mm	image processing, data collection

Table 3. Development Boards that Support TinyML

5 Applications with TinyML

This section provides a chronological overview of publications linked to TinyML. These works are classified according to their field of study.

5.1 Healthcare

Paul et al. [159] developed a real-time TinyML-based American sign language system. The proposed system is a highly efficient CNN-based fingerspelling recognition system that embedded TinyML in a Cortex-M7 with a 158 KB size board. The proposed solution reduces quantization’s accuracy drop and generalizes the model via interpolation augmentation. TinySpeech [226] are low-precision DNNs that are designed for limited-vocabulary speech recognition. The experimental results of the proposed TinyML NN showed significantly lower architectural and computational complexity compared with other DNNs for limited-vocabulary speech recognition.

The authors of Reference [71] implemented recurrent NNs for hearing aid hardware. Several TinyML techniques, such as pruning and integer quantization of weights/activations, were utilized.

A wearable system based on regular shoes for foot gesture recognition was implemented by Orfanidis et al. [154]. The system can track the user’s specific foot gestures in the environment of a smart city, and when the user is in danger, the system notifies his/her familiars. The overall process was implemented with an embedded NN in an MCU. The evaluating results have shown a 98% accuracy among five different activities and foot gestures.

A wearable TinyML-based wristband for hand gesture recognition was proposed in Reference [28]. The experimental results show that the proposed TinyML-based wristband can recognize seven hand gestures with 96.4% accuracy. Also, a hand gesture recognition approach for wearable devices is presented in Reference [234]. The proposed approach uses accelerometer and electromyogram signals for a dual-stage classification in a memory-efficient way. The TinyML-based approach achieved a 93.34% classification accuracy, a 17.79% improvement of the trained model, and a significant decrease in the device’s memory footprint.

A cloud computing system for high-level supporting systems combined with TinyML for prognostics and health management was proposed in Reference [237]. The authors of this work investigate sensor-based applications and predict health status while combining TinyML with cloud computing to adapt the system’s diverse requirements, such as power, latency, and communication.

A non-invasive TinyML-based system for real-time activity tracking for elderly people and their nurses was proposed by T’Jonck et al. [208]. The system provides real-time elderly concerns information such as incontinence, night wandering, and pressure ulcers. Furthermore, a healthcare body-pose estimation platform-agnostic framework was proposed in Reference [215]. The proposed application monitors patients and alerts when they fall off their bed, have an accident, or experience restricted movement.

Ooko et al. [153] used TinyML-based NN models to predict in real-time chronic obstructive pulmonary disease. The proposed approach improves the inference accuracy of portable and non-invasive self-diagnostic kits for respiratory diseases. Also, a low-cost mechanical ventilator, namely, A-Vent, was developed by Cabacungan et al. in the study [32]. The proposed ventilator uses TinyML models to detect patient-ventilator asynchrony. The experiment results showed that TinyML could be trained to detect breathing anomalies and provide low-cost and real-time remote monitoring.

Most of the aforementioned healthcare systems reported improved system accuracy because of the use of TinyML. Furthermore, healthcare devices and systems are crucial to providing real-time and tailored results. Finally, due to the local data processing, sensitive and private data are kept secure and private.

5.2 Automotive

In Reference [115] a wearable TinyML-based alcohol sensor was proposed. The usage of the proposed wearable system is to detect the consumed level of alcohol. The system’s ML model was trained by the sensors’ collected data, and the TinyML model was uploaded to an nRF52480 MCU for the prediction of test data to account for the impact of environmental conditions on the alcohol sensor. The results indicate that the method is useful for refining sensor response implementation in a variety of heat and humidity conditions.

The authors of Reference [174] presented a novel approach for automating traffic scheduling based on the density of vehicles waiting in line. The proposed system detects vehicles with embedded sensors across the road’s lanes, and a TinyML model was built to predict the green signal timings and to control the traffic system efficiently. In Reference [148] a TinyML-based sensor for the classification of the vehicle’s type and the speed range was designed. The presented sensor is powered by the battery and mounted in the pavement for the vehicles’ classification with the use of recurrent NNs. The experimental results show an accuracy of 96% in vehicle’s type classification and 89% in vehicle’s speed classification.

A TinyML approach for detecting road anomalies (potholes, bumps, and obstacles) on vehicles was proposed in Reference [8]. The unsupervised TinyML technique is embedded in an Arduino Nano BLE 33.

Raza et al. [169] presented a novel approach for more considerable autonomy and intelligence in micro aerial vehicles. This is achieved using TinyML technology via OpenMV for lower latency, energy efficiency, offline inference, and data security in drones. The experimental results of this study revealed that integrating TinyML-based MCU into the drones makes the system viable in a practical context.

The authors of Reference [50] presented the deployment of TinyML models for autonomous driving mini-vehicles. The authors, with the use of high-throughput TinyCNNs having access to an on-board camera, control mini-vehicles and succeed in minimizing the inference energy consumption.

In the automotive field, TinyML-based systems succeed in significant improvements in systems energy efficiency, which are meaningful advantages in the modern automotive.

5.3 Agriculture

Vuppalapati et al. [216] presented a novel framework for TinyML-based agriculture sensors. The proposed framework exploits Azure DevOps automation techniques and TinyML to provide high-quality products, in the most cost-efficient way, ensuring farmers’ cost savings. Also, Reference [218] proposed a TinyML-based framework for rural areas and small farmers. The system’s TinyML IoT edge devices collect real-time data from a real production environment, contributing to creating a sustainable food future.

The authors of Reference [5] presented a real-time embedded prediction weather system. The system was implemented using tiny DNNs in an STM32 MCU to predict real-time environmental conditions. The main advantage of this system, rather than the others that have been proposed, is the performance of prediction close to the environmental sensor to avoid data traffic to the cloud.

Reference [209] illustrates an end-to-end strategy for enhancing the security of the food supply chain and, consequently, boosting the credibility of the food sector. The system seeks to increase the transparency of food supply chain monitoring systems by securing their constituent parts. A universal information monitoring strategy based on blockchain technology protects the integrity of gathered data, while a self-sovereign identification approach for all supply chain participants reduces single points of failure. Finally, monitoring devices are fitted with a security mechanism based on TinyML’s fledgling technology to mitigate a major amount of harmful supply-chain actor activity.

TinyML-based agriculture systems provide farmers with tailored results and autonomous, low-cost systems. Moreover, using TinyML in this area reduces access to cloud services, resulting in low network bandwidth and lower costs.

5.4 Security

Dutta and Kant [57] designed a TinyML-based framework to predict potential threats that propagate to smart devices. The proposed framework copes with the various security challenges in the different protocols’ layers of IoT devices. Furthermore, a TinyML-based framework for detecting devices’ physical anomalies was proposed in Reference [128]. With the use of an Arduino Nano 33 BLE that embedded DL TinyML models mounted in a washing machine, physical anomalies were detected via a battery-powered embedded device with no network connection.

In the security field, the TinyML approach can shield IoT devices in an automated way while requiring low power consumption.

5.5 Industry

Acharjee and Deb, in Reference [3], used TinyML strategies, such as post-training quantization, to generate cartoonized versions of real-world images. The proposed method’s testing results showed that the implemented model with post-training quantization achieves a high compression level. An embedded system for TinyML-based audio classification was presented in Reference [114]. The proposed technique improves the operation speed and decreases the time and the energy consumption. Also, the experiments have proved that co-designing both hardware and software reduces execution overhead when compared to optimizations used only on the software side.

Kamal et al. [106] proposed an architectural design for checking and quality auditing machine objects using a TinyML DNN. The proposed design consists of (i) the machine object module, (ii) the edge computing unit to connect, configure, and control the module, (iii) the edge server that embeds the TinyML, and (iv) the cloud services for the quality audit and predictive maintenance of the machine objects. Giordano et al. [76] developed a TinyML-based wireless camera for face recognition. The TinyML algorithm for face recognition was hosted in an ARM Cortex-M4F MCU for the onboard data processing, while the information on recognized faces was sent via a long-range LoRa communication module. A working prototype of the system was evaluated both for the capability of battery-less and self-sustainability, with a high accuracy of up to 97%.

TinyML-based industry systems improve the system’s operational speeds while providing tailored and real-time results. Also, as communication with cloud services is nearly eliminated, local data processing provides low power consumption and more privacy.

5.6 Future Directions

Despite the wide variety of TinyML applications present in both research and industrial domains, the technology is not without its challenges. These challenges, which will be elaborated upon in the subsequent section, have been the subject of considerable efforts aimed at their resolution. In the forthcoming paragraph, a succinct overview of various prospective trajectories for TinyML will be presented. These trajectories encompass potential enhancements to the technology itself and ways in which the technology could be employed to augment deployed systems and applications.

(1)

Reformable TinyML - Static TinyML refers to the process of performing model inference once exclusively. After the TinyML model is trained and deployed, there is no capability to update it through a network or future training [121]. The study [214] discusses Reformable TinyML, which aims to resolve limitations in static TinyML and offers strategies to update embedded models for optimal performance in certain environments. The approaches mentioned to improve the models include On-device Offline Learning approaches, Online Learning Approaches, and Network Reliant approaches.

(2)

Autonomic Computing - The field of autonomic computing [75] is dedicated to exploring how systems can autonomously attain user-specified control outcomes, thus obviating the necessity for human involvement. Control theory has substantially influenced the foundational principles of autonomic computing, particularly in relation to closed- and open-loop systems. Incorporating AI and ML methodologies can facilitate realizing such precise autonomic and self-managed systems. Therefore, the combination of TinyML and DRL models has the potential to accelerate the development and adaptation of autonomic systems across a diverse array of sectors.

(3)

Blockchain Integration - In the context of updating or retraining an existing TinyML model via a network, the security attributes and capabilities provided by blockchain technology appear well-suited to address the majority, if not all, potential security threats presented by malicious users. In References [25, 40, 113], the peer-to-peer propagation of firmware updates in IoT devices is investigated. In addition, Reference [203] provides an overview of how blockchain could be adapted to various IoT implementations.

(4)

Edge offloading - The provision of sufficient computational capacity for intelligent edge applications has emerged as a formidable obstacle. Intelligent Edge, which propels intelligence to the Internet’s Edge, has been instrumental in facilitating intelligent decision-making across multiple aspects of edge computing, such as task offloading [175]. Edge offloading is a paradigm of distributed computing that provides computing services for edge caching, edge training, and edge inference. Incorporating techniques such as Distributed Machine Learning (DML), DRL, and Collaborative Machine Learning (CML) into edge computing is advantageous for managing the escalating communication and computational demands of emergent IoT applications [187]. Given the findings of these studies, blockchain could serve as a secure method of updating without the possibility of third-party interference or the necessity for local retraining of any TinyML model currently in use.

6 The Challenges of TinyML Technology

TinyML is found in the early stage of its development, and, since it is a technology meant to be implemented on constrained hardware, the first issue noted is the resource and computational limits of the applications it may face. Additionally, the lack of universal frameworks, device heterogeneity, lack of models, datasets, and benchmarking tools, along with several security issues, are some of the most critical challenges to overcome. A brief description of the aforementioned challenges follows in the next paragraphs.

6.1 Resource and Computational Constraints

Until now, a typical ML model or NN was meant to be trained and run on powerful research workstations or in the cloud [176]. Implementing and deploying models, though, in a device that not only lacks powerful CPUs and GPUs but, on the contrary, relies on an MCU with limited memory, storage, and CPU capabilities, is a different story. The majority of the devices, as mentioned earlier, have an average clock speed of 100 MHz, with less than 1 MB of flash memory. As indicated above, optimizing a model to be suitable for deployment in such a device can be a complex process that mostly affects the model’s overall accuracy. There is a need for new and more aggressive optimization methods along with advances regarding memory and CPU speeds in the hardware sector.

6.2 Device Heterogeneity

Software and hardware heterogeneity can be identified as the second major issue. The diversity of hardware and algorithms makes prohibitive the development of a universal framework capable of training, optimizing, and deploying a model to several different devices with a single compilation. Since every system is different, despite the fact that two different systems may share the same components, a model trained for a specific device may not work with another one [172]. Furthermore, as stated many times, the ecosystem of TinyML requires careful software and hardware co-design. Building models universally could hinder accuracy, power consumption, and storage requirements.

6.3 Lack of Datasets, Models, and Benchmarking Tools

The aforementioned issue continues and impacts important principles of ML, such as the development of universal datasets, models, and tools used for testing and benchmarking. Before creating ML applications, the significant first step for data scientists is to acquire the appropriate data from sensors or datasets. There is a vast amount of publicly available datasets, but, since the technology is still in an evolving stage, those datasets are not altered to reflect the energy and hardware constraints found on MCUs. The community also must create pre-trained models that require low-to-moderate changes to be deployed on different devices. Finally, an essential step of the ML procedure is testing and benchmarking. Models and algorithms must be compared and evaluated to get a better knowledge of when and how to implement models for different situations and sectors [194]. MLCommons is one of the first attempts to accelerate ML innovation by providing the correct rules and requirements for benchmarking and datasets and also provides the best practices to help or, as stated on their site, empower researchers by exchanging models experiments and applications [141].

6.4 Security

Constrained devices are not built with a focus on security due to their limited resources. Even more complex models and data storage may still force developers to upload some of the data to an external source, server, or the cloud. Data transmission, of course, shares the same dangers found on simple IoT devices that depend on external sources for processing demanding operations such as AI and NN applications.

7 Conclusion

This work provides a brief review of the most common optimization techniques that led to the development of TinyML. Furthermore, due to the fact that TinyML is a hybrid of hardware and software, a taxonomy of the development boards as well as the required frameworks, tools, and libraries is presented. Finally, this work includes educational resources for the technology discussed as well as a brief overview of TinyML-based applications organized into five categories. Finally, the future directions regarding the technology under consideration are presented. This study seeks to demonstrate the benefits of TinyML technology and provide useful information to anyone interested in researching and working on the subject.

Incorporating ML into tiny, resource-constrained embedded devices is becoming increasingly important in light of future applications. Improving the current context of ML at the edge devices, TinyML has been introduced as a way to build autonomous and secure devices that can gather, process, and provide results or decisions without having to share data with third parties. TinyML can be integrated into low-cost, low-power smart devices such as smartphones, microcontrollers, and IoT-edge systems. The presented technology intends to democratize AI by making it available to a wider variety of industries and communities, allowing everyone to participate in the digital revolution of intelligent devices.

Although TinyML technology has many applications in a variety of fields and revolutionizes the thought process and democratization of ML and AI applications, it is introducing many challenges for researchers to unlock its full potential and embed new features. TinyML is in its early stages of development, and being a technology intended for implementation on constrained hardware, the first issue identified is the resource and computational limitations that applications may encounter. In addition, the absence of universal frameworks, device heterogeneity, lack of models, datasets, and benchmarking tools, as well as various security vulnerabilities, are among the most difficult obstacles to surmount.

Acknowledgments

We would like to thank Georgios Giannakas for assisting us in the research and the preparation of this work.

References

[1]

TinyML in Publications - Dimensions. 2022. Retrieved from https://app.dimensions.ai/discover/publication?search_mode=content&search_text=TinyML&search_type=kws&search_field=full_search

Abstract

1 Introduction

2 The Technology of TinyML

2.1 Optimization Methods

2.2 Quantization

2.3 Pruning

2.4 Weight Sharing

2.5 Neural Architecture Search

2.6 Hardware-based Optimizations

2.7 Frameworks, Libraries, Tools, and Other Techniques

3 TinyML Impact

3.1 Educational Materials

4 Hardware and Software Implementations with TinyML

4.1 TinyML-based Software

4.1.1 All-in-one Frameworks.

4.1.2 Model Conversion.

4.1.3 Hardware Accelerators and Hardware-specific Tools.

4.1.4 Tools for Deployment.

4.1.5 Tools for Optimization.

4.2 TinyML-based Hardware and Development Boards

5 Applications with TinyML

5.1 Healthcare

5.2 Automotive

5.3 Agriculture

5.4 Security

5.5 Industry

5.6 Future Directions

6 The Challenges of TinyML Technology

6.1 Resource and Computational Constraints

6.2 Device Heterogeneity

6.3 Lack of Datasets, Models, and Benchmarking Tools

6.4 Security

7 Conclusion

Acknowledgments

References

Index Terms

Recommendations

A Review of Machine Learning and TinyML in Healthcare

The Potential of Emerging Technology for Social Change

A TinyML-based System For Smart Agriculture

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations