research-article

Open access

TinyNS: Platform-aware Neurosymbolic Auto Tiny Machine Learning

Authors: Swapnil Sayan Saha, Sandeep Singh Sandha, Mohit Aggarwal, Brian Wang, Liying Han, Julian De Gortari Briseno, Mani SrivastavaAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 3

Article No.: 43, Pages 1 - 48

https://doi.org/10.1145/3603171

Published: 11 May 2024 Publication History

PDF eReader

Abstract

Machine learning at the extreme edge has enabled a plethora of intelligent, time-critical, and remote applications. However, deploying interpretable artificial intelligence systems that can perform high-level symbolic reasoning and satisfy the underlying system rules and physics within the tight platform resource constraints is challenging. In this article, we introduce TinyNS, the first platform-aware neurosymbolic architecture search framework for joint optimization of symbolic and neural operators. TinyNS provides recipes and parsers to automatically write microcontroller code for five types of neurosymbolic models, combining the context awareness and integrity of symbolic techniques with the robustness and performance of machine learning models. TinyNS uses a fast, gradient-free, black-box Bayesian optimizer over discontinuous, conditional, numeric, and categorical search spaces to find the best synergy of symbolic code and neural networks within the hardware resource budget. To guarantee deployability, TinyNS talks to the target hardware during the optimization process. We showcase the utility of TinyNS by deploying microcontroller-class neurosymbolic models through several case studies. In all use cases, TinyNS outperforms purely neural or purely symbolic approaches while guaranteeing execution on real hardware.

1 Introduction

Tiny machine learning (TinyML) refers to hardware and software suites that enable always-on, ultra-low-power ( \(\le\) 1 mW), and on-device sensor data analytics on low-end ( \(\le\) 1–2 MB of SRAM and eFlash) Internet of Things (IoT) platforms [51, 126, 136, 148]. TinyML holds the key to making on-board intelligent inferences from unstructured data for time-critical and remote applications, such as aerial robotics [127], underwater navigation [134], picosatellite machine inference [45], and wildlife monitoring [47]. It is expected that 2.5 billion TinyML platforms will ship in 2030 [4].

An integral component in the TinyML workflow is neural architecture search (NAS) or AutoML, which automatically constructs the most performant neural network (NN) from a set of lightweight ML blocks [79, 81, 94, 162, 167, 173] and connection rules given target platform SRAM, eFlash, energy, and latency constraints [14, 60, 101, 103, 136, 139]. The NAS-generated model is compiled to the target device using TinyML compiler suites [30, 43, 65, 67, 95, 103], which perform operator and inference engine optimizations [27, 42, 103, 177], model compression [76], and code generation [136, 171]. After deployment, periodic fine-tuning of the model accounts for feature distribution shifts using on-device training [24, 100, 129] and federated learning [89, 110]. AutoML is preceded by data acquisition and analytics [131], and feature projection for dimensionality reduction [164] in the TinyML workflow [136].

The first generation efforts in TinyML focused on the exploration (lightweight model blocks), optimization (NAS, AutoML), and integration (compiler suites) of standalone NNs within the device platform constraints [136]. However, IoT applications in the wild need to obey specific rules, physics, and heuristics for provably correct operation, context awareness, and explainability [106, 136, 142, 174]. Examples include:

•

A localization ML model regressing position from motion sensor data should not output displacements when rotational artifacts dominate translational movements [134].

•

An aerial vehicle should not exceed a certain bank angle to remain stable [44].

•

In nurse care settings, certain atomic events (e.g., washing hands) must precede other events (e.g., administering medicine to a patient) to comply with sanitary protocols [174] and not vice-versa.

•

Certain spectral features (e.g., peak frequency) in the embedding manifold improve the accuracy and interpretability of wearable human activity recognition models [8].

While ML models have achieved superior performance on unstructured, multimodal, and noisy sensor inputs over human-engineered symbolic techniques, three issues plague the deployment of standalone ML models for context-aware sensor data analytics. First, even with large datasets, ML models cannot guarantee the learned feature representations obey all the rules, symmetries, and physics of the underlying system [37, 85, 134, 152]. Second, the contextual field of ML models (even transformers) is limited to a few minutes, making them unsuitable for high-level reasoning on atomic events that can span several hours (if not days) with spatial and temporal constraints [7, 114, 128, 166, 174]. Third, ML models lack transparency and interpretability, with the decision trace (e.g., causation versus correlation) and learned features difficult to understand [63, 109, 114, 121, 147, 176].

Neurosymbolic artificial intelligence (AI) is a potential bridge to connect the interpretability, verifiability, data efficiency, and context awareness of symbolic techniques with the scalability, flexibility, robustness, and performance of NNs [64, 70, 106, 108, 109, 118, 142, 146, 149, 174, 175]. Neurosymbolic AI integrates NNs with expert principles expressed as probabilistic reasoning modules, logical reasoning modules, knowledge graphs, question/answering engines, and constraint satisfaction functions [64, 142]. Concatenation of neural and symbolic reasoning has been successful in a broad spectrum of challenging problems. These include complex event recognition [7, 128, 166, 169, 174], commonsense reasoning [20, 141], visual question answering [109, 176], oceanographic forecasting [35, 58], autonomous driving [71, 150, 157], business management [19, 36], and bioinformatics [5, 92]. Thereby, neurosymbolic AI can enable rich, complex, and intelligent inferences at the extreme edge beyond the perception of atomic events [128, 136, 165]. However, real-time adoption of neurosymbolic frameworks on extremely resource-constrained platforms such as microcontrollers is challenging, as discussed next.

1.1 Challenges

Given the ultra-resource constraints of TinyML platforms, manually finding the optimal synergy between the hyperparameters of the NN and the symbolic program is arduous and challenging [128]. Deployment of hybrid programs requires AutoML platforms that can perform neurosymbolic optimization.

•

Absence of Platform-aware AutoML Tools for Neurosymbolic Optimization: While AutoML and NAS frameworks have been proposed for optimizing NNs for TinyML platforms [14, 60, 83, 101, 103, 123, 124, 139], existing AutoML tools are not designed to perform platform-aware joint optimization of neural and symbolic components [136]. Platform-aware neurosymbolic optimization is necessary to not only fit the highest-performing program within the platform resource constraints but also discover previously unknown high-utility symbolic subroutines as seen in AlphaTensor [57].

•

Fitting Neural and Symbolic Components Within Platform Constraints: TinyML hardware platforms have tight memory, power, and compute budget [103]. A typical ARM Cortex-M4 microcontroller has only 128 kB of SRAM and 1 MB of eFlash, while a smartphone or cloud server can have RAM and storage in the order of tens of gigabytes and terabytes, respectively [14, 136]. While standalone NNs and standalone symbolic logic are capable of running on TinyML platforms [136], directly porting existing neurosymbolic frameworks on microcontrollers, in-sensor processors [31], and field-programmable gate arrays [83] is not computationally tractable.

1.2 Contributions

We introduce TinyNS, a platform-in-the-loop framework for automatic optimization and deployment of neurosymbolic programs on commodity microcontrollers. Given a search space containing the hyperparameters, logical association rules, and constraints of symbolic and ML (neural or non-neural) model operators, TinyNS automatically finds the best combination of symbolic and ML operators and hyperparameters within the target device memory, latency, and energy constraints. The ML models may be feedforward, residual, or recurrent. The framework provides recipes to map neurosymbolic program atoms from a prototyping language (e.g., Python) to a deployment language (e.g., C). To guarantee program deployability, TinyNS communicates with the target hardware during the optimization process to receive hardware and program runtime metrics instead of relying on proxies. The framework builds on top of a state-of-the-art, gradient-free, black-box Bayesian optimizer [138, 139] designed to optimize non-gradient-friendly and expensive objective functions within a few iterations. Using TinyNS, we showcase several previously unseen applications on microcontrollers. These include physics-aware inertial navigation [134], yielding adversarially robust TinyML models, picking the best model from a zoo of neural and non-neural models [135], and co-optimizing features, Kalman filters and NNs [50]. Our contributions are summarized as follows:

•

Fast, Gradient-Free, and Black-Box Bayesian Optimizer: We present a fast, parallel, gradient-free, and application-agnostic Bayesian optimizer that can handle non-gradient friendly objectives, categorical and conditional search spaces, and expensive objective functions, all while converging to near-global optima within few iterations [138, 139]. The optimizer forms the basis for our search algorithm.

•

Platform-in-the-Loop Neurosymbolic Architecture Search: To the best of our knowledge, we are the first to showcase a platform-in-the-loop neurosymbolic architecture search framework for microcontrollers. Our framework automatically synthesizes the most performant neurosymbolic program from a symbolic and ML operator search space within the target platform constraints.

•

Recipes for Deploying Neurosymbolic Programs on Microcontrollers: Using case studies, we showcase recipes for defining the neurosymbolic program synthesis search space for all five neurosymbolic program categories [142]. Our framework includes parsers that automatically write microcontroller code according to these recipes.

•

Pushing the Boundaries of Handcrafted Neurosymbolic Programs: We showcase several unseen TinyML applications made possible by joint optimization of neural and symbolic components.

TinyNS is available open-source at https://github.com/nesl/neurosymbolic-tinyml.

1.3 Organization

The rest of the article is organized as follows: Section 2 presents related work and background on porting ML models onto microcontrollers and neurosymbolic AI. Section 3 describes the Bayesian optimization algorithm. Section 4 details the platform-in-the-loop neurosymbolic architecture search space formulation and the recipes for deploying neurosymbolic programs. Afterward, Section 5 presents extensive experimental evaluations of our framework through six case studies. Finally, Section 6 provides concluding remarks and future directions.

2 Background AND Related Work

In this section, we first discuss the workflow for porting ML models onto microcontrollers [136], which we modify to realize neurosymbolic TinyML (Section 2.1). Next, we discuss the features of existing NAS frameworks and their shortcomings in performing joint optimization of ML and symbolic operators (Section 2.2). Afterward, we provide a brief overview of the taxonomy, languages, and recent trends in neurosymbolic AI (Section 2.3). Finally, we provide a brief overview of existing Python to microcontroller code parsers (Section 2.4).

2.1 Machine Learning on Microcontrollers

Figure 1 illustrates the typical workflow for porting ML models to commodity microcontrollers [136]. First, in the model development phase, data engineering frameworks collect, analyze, clean, label, and store raw sensor data to produce an application-specific dataset suitable for training ML models [131]. These frameworks also include tools for targeted augmentation, outlier identification, unit tests, class balancing, and heuristic-assisted automated labeling. The additional tools ensure the trained models are free from bias and shortcuts while generalizing well on edge cases and unseen scenarios [111, 136]. Afterward, optional feature projection applies linear methods, non-linear methods, or domain-specific feature extraction for dimensionality reduction while preserving data variance [56]. Linear methods include matrix factorization [46, 99] and principal component analysis (PCA) [12, 34]. Non-linear methods are suitable for minimizing the distance between non-linear high-dimensional input space and the prototype manifold. Common non-linear methods include autoencoders [132], t-distributed stochastic neighbor embedding [163], and kernel PCA [144]. Domain-specific feature extraction applies signal processing, statistical, and time-series functions to the input data depending on the application area [75]. Next, a model backbone is picked from a zoo of lightweight models geared toward embedded deployment, based on application and platform specifications. Examples include decision trees and k-nearest neighbor blocks with sparse projection matrices [74, 93], lightweight spatial convolution (e.g., squeeze and excitation modules [81] and depthwise-separable convolution [79]), low-rank, stabilized, and quantized recurrent networks [94, 158, 167], temporal convolutional networks [97, 162], and attention condensers [173]. The hyperparameters of the backbone are optimized using neural architecture search given a cost function and the hyperparameter search space based on target device constraints [11, 130, 180]. Search space representation includes layer-wise, cell-wise, and hierarchical [130]. Search strategies include reinforcement learning (RL), differentiable NAS, evolutionary algorithms (with or without weight sharing), or Bayesian optimization [54]. The hardware metrics can come from real measurements (slowest), lookup tables, prediction models, or analytical proxies (fastest) [54, 130].

Fig. 1.

The model deployment phase begins by generating embedded code to run the best-performing candidate model from the NAS algorithm on the device. This is done by compiler suites, some of which provide inference engines for resource management and model graph realization during execution [43, 103]. Compiler suites also perform operator fusion [30, 103], loop transformations [27, 42], data reuse [95], and model compression (pruning, quantization and encoding) [76] to improve memory usage and runtime latency [136]. Afterward, the model file system is flashed onto the microcontroller and occasionally fine-tuned to account for data distribution shifts using on-device training (e.g., transfer learning, incremental training, or continual learning) [24, 100, 129] or federated learning techniques [110].

Variations of the closed loop workflow have been applied to varying applications domains, including image recognition, audio keyword spotting, visual wake words, anomaly detection, navigation, gesture recognition, mHealth, and face recognition [13, 136]. However, these applications assume decisions being made by a standalone ML model, with no symbolic programs (apart from optional feature projection) present on the microcontroller for high-level reasoning [136]. TinyNS modifies the workflow to incorporate symbolic atoms from which programs can be constructed and optimized jointly with the model backbones.

2.2 Neural Architecture Search for Microcontrollers

Table 1 compares prominent NAS frameworks for microcontrollers against TinyNS. In particular, TinyNS adopts a black-box, Bayesian, gradient-free, and platform-in-the-loop search strategy to balance training infrastructure cost, NAS convergence time, guaranteed execution, application support, and neurosymbolic search space characteristics. iNAS [112] uses RL to formulate the NAS multi-objective optimization process as a Markov decision process, with the ability to support complex and discontinuous search spaces with thousands of dimensions [136]. However, RL has a long convergence time (e.g., 5 GPU years) with additional fine-tuning costs [23, 136]. MCUNet [102, 103] and \(\mu\) NAS [101] use evolutionary search on RL search spaces to achieve faster convergence. In particular, MCUNet uses weight-sharing to decouple training from search, mutating, and crossing Pareto-optimal sub-network populations from a “once-for-all” supernetwork [23]. This allows networks for several target hardware to be optimized together. Nevertheless, evolutionary NAS with weight sharing requires GPU infrastructure capable of supernetwork training, suffers from fine-tuning costs, and has a convergence time of 3–8 GPU weeks [23, 136]. MicroNets [14] and UDC [60] use differentiable NAS (DNAS), which performs continuous gradient descent relaxation of weights and architectural encodings jointly with approximate gradients via path binarization [25, 104]. This reduces the convergence time to 1–3 GPU weeks [23]. However, DNAS cannot directly model loss contour discontinuities (e.g., categorical or conditional hyperparameters) and have high GPU memory usage owing to the over-parametrized network formulation [112, 136]. Bayesian optimization can handle discontinuous search spaces and cost functions while being executable on commodity GPU workstations [134, 135], further reducing the convergence time to 1–10 GPU days [59]. However, vanilla Bayesian optimization struggles in search spaces beyond a dozen hyperparameters and assumes dense distribution of performant models in the search space [41, 60]. Since neurosymbolic search space dimensions can be orders of magnitude higher than NN search spaces, TinyNS uses Monte Carlo sampling with Upper Confidence Bound (UCB) as the acquisition function instead of the gradient-based approach of SpArSe [59] to perform exploration and exploitation similar to UDC [60]. This prevents TinyNS from being stuck to local optima or evaluating invalid configurations [134, 138] even in complex RL search spaces. Moreover, TinyNS adopts a black-box approach similar to RL or evolutionary NAS. The black-box approach allows optimization of any scalar term beyond model performance and hardware metrics in the cost function and eventually permits the inclusion of both symbolic and any Tensorflow Lite Micro supported ML operators in the search space beyond convolutional operators. Further, TinyNS talks to the target hardware during the NAS process to get resource metrics instead of relying on proxies. Platform-in-the-loop not only guarantees the deployability of the neurosymbolic code but also allows TinyNS to ignore neurosymbolic programs that induce faults, runtime errors, compilation errors, or flash overflow, saving on convergence time. In fact, TinyNS automatically writes the C code of the neurosymbolic program from Python constructs using proposed neurosymbolic recipes without user intervention.

Table 1.

Method	Search Strategy	Profiler	Search Space	Cost Function Parameters	Inference Engine	Compression Awareness	Open Source
SpArSe [59]	Gradient-driven Bayesian	Analytical	Conv2D (regular, depthwise, downsampled)	Error, SRAM, Flash	uTensor	Pruning (structured, unstructured)	No
MCUNet [102, 103]	Evolutionary (with weight sharing)	Lookup tables, prediction models	Conv2D (elastic)	Error, SRAM, Flash, Latency	TinyEngine [103]	None	No
MicroNets [14]	One-shot DNAS	Analytical	Conv2D (MbNetv2, DS-CNN)	Error, SRAM, Flash, Latency	TFLite Micro [43], CMix-NN [26]	Quantization (sub-byte)	No
\(\mu\) NAS [101]	Evolutionary (no weight sharing)	Analytical	Conv2D (regular, depthwise)	Error, SRAM, Flash, Latency	TFLite Micro [43]	Structured Pruning	Yes
iNAS [112] \(^\wedge\)	Reinforcement Learning	Lookup tables, analytical	Conv2D, tile size, loop order, preservation batch size	Error, Flash, Latency \(^*\) , Volatile Buffer, Power-Cycle Energy \(^@\)	Accelerated intermittent	Quantization (2 bytes)	Yes
UDC [60]	DNAS with exploration and exploitation	Analytical	Conv2D, sparsity, bitwidth	Error, Flash	Vela NPU	Unstructured pruning, quantization (sub-byte)	No
TinyNS	Gradient-free Bayesian with exploration and exploitation	Real measurements, analytical	Any supported ML operator and symbolic program atoms	Any scalar term	TFLite Micro [43]	Quantization (1 byte)	Yes

Table 1. Qualitative Comparison of Existing NAS Frameworks for Microcontrollers Versus TinyNS

\(^\wedge\) intermittent-aware NAS.

\(^*\) sum of progress preservation, progress recovery, battery recharge, and compute cost.

\(^@\) sum of progress preservation, progress recovery, and compute cost.

2.3 Neurosymbolic Artificial Intelligence

Over the past decade, deep learning (DL) has been extensively used to make complex inferences from unstructured, noisy, and high-dimensional data, such as in computer vision, LIDAR point clouds, speech processing, drug discovery, time-series processing, and genetics [98]. However, traditional DL is data-hungry even for simple tasks, lacks interpretability and explainability, does not guarantee to follow rules, physics, and constraints, fails on feature distribution shifts, and struggles to learn long-range temporal patterns [37, 63, 64, 121, 147]. The flipside is symbolic AI, which was once the dominant trend of AI research several decades ago before the prevalence of DL [116, 153]. Symbolic programs are data efficient, interpretable, and good at reasoning over the long-term, but suffer when solving NP-hard problems and dealing with spatial and temporal uncertainties in the input data [142]. Neurosymbolic AI couples DL with symbolic methods to have fast computation time, deal with unstructured data and uncertainty effortlessly, maintain explainable models, and capture complex relations [64, 70, 86, 108, 118, 142]. Neurosymbolic learning is analogous to the two types of human reasoning [84]: type 1 reasoning is fast and intuitive, corresponding to pattern recognition in DL, and type 2 is slower and logical, corresponding to symbolic algorithms and logical reasoning.

2.3.1 Taxonomy of Neurosymbolic AI.

Neurosymbolic AI systems are categorized into five groups [86, 142], as illustrated in Figure 2:

Fig. 2.

•

Symbolic Neuro Symbolic or Neural-after-Symbolic: This is the most common paradigm [86]. The inputs are symbolic, while the processing is purely neural. The neural component either learns the relations between the symbols or learns to focus on some specific symbols based on needs. Examples include inference over human-engineered features [87] and graph NN inference with pre-processed graph nodes [143]. While this technique allows applying human-engineered functions on the inputs, the synergy between neural and symbolic components is weak, with no high-level reasoning possible over the outputs.

•

Neuro \(\rightarrow\) Symbol or Symbolic-after-Neural: In this approach, NNs process raw inputs and output structured data, which are fed to symbolic programs for further reasoning. Examples include DUA [114] and DeepProbLog [108]. In DUA, a symbolic meta-policy learning module with common sense background knowledge combines primitive actions from a deep RL agent. In DeepProbLog, NNs are trained to output probabilistic predicates, which are fed to a logic program to evaluate user-defined logic rules. The technique allows the flow of gradients from the symbolic output through the network but suffers from the high compute cost of the reasoning module.

•

Neuro \(\cup\) Compile (Symbolic) or Symbolically constrained Neural: This technique adds a symbolic component to the learning process of a neural model to follow constraints, norms, or rules, which are compiled away during training [96]. An example includes Pylon [3], where user-defined constraints on the output are converted to an additional loss added to the traditional error cost. While constraints are simple to express using this method, the network is not guaranteed to satisfy hard thresholds.

•

Symbolic[Neuro] or Neurosymbolic Aggregation: In this method, a neural model and symbolic program aggregate their results to achieve more robust inference. The neural component models errors resulting from uncertainties of the symbolic program, or the symbolic program forces the NN to follow some constraints or rules. In STLnet [106], a neural student model learns to predict succeeding output sequences by learning temporal logic relations, while a symbolic teacher model generates an output sequence most similar to that prediction within the given relational constraints.

•

Neuro[Symbolic] or Neurally accelerated Symbolic or Symbolically structured Neural: This is the preferred neurosymbolic paradigm [86], where the NN architecture is generated using (or has layers embedded with) symbolic reasoning. A neural model replaces slow or non-differentiable symbolic programs while keeping the latter’s functionality. Examples include logic Tensor Networks [146], which generates a first-order logic language into TensorFlow computational graphs. Pix2rule [33] embeds a differentiable linear layer in a deep NN, which is biased to capture the semantics of AND and OR to extract spatial symbolic rules. Neuroplex [174] adopts a knowledge distillation approach to train a neural model that can replace the logic reasoner for complex event pattern detection. While allowing pure type 2 reasoning, this method may include special ML operators unsupported on TinyML hardware.

2.3.2 Neurosymbolic Language Tools.

Neurosymbolic language tools synthesize programs from user-defined rules. DeepProbLog [108] is a probabilistic logic programming language where users can define logical rules and network architectures. The symbolic reasoning module is differentiable, allowing backpropagation of target labels at the output of the logic program through the NN. Pylon [3] is a PyTorch framework that learns deep NNs with constraints. It automatically converts constraints defined by users into a constraint loss, and the NN is trained using the summation of this constraint loss and a regular loss function. Gen [40] is a probabilistic programming language designed for general-purpose neurosymbolic program synthesis. It can build generative models to represent data-generating processes, supports flexible DL and differentiable programming, and can make probabilistic inferences.

2.3.3 Recent Trends in Neurosymbolic Artificial Intelligence.

Recent research in neurosymbolic AI focuses on handling domain shifts, performing error correction, increasing data efficiency, and improving the interpretability of ML systems [64, 142]. Symbolic background knowledge allows extrapolation when dealing with input distribution different from training data [105]. Error correction designs robust ML systems enabling streamlined recovery from wrong outputs without retraining on new data [18]. Symbolic reasoning allows NNs to be trainable with less data [142]. Improving the interpretability of ML systems makes NN decisions more transparent and explainable [115]. Unfortunately, the deployment of neurosymbolic programs on IoT platforms or for real-time inference has received little attention. \(\mu\) CEP [128] is the only framework that allows complex event processing on neural outputs using logical rules on commodity microcontrollers. However, \(\mu\) CEP is hard-coded for a single application (complex activity detection), few network architectures (fully connected and convolutional), and a specific neurosymbolic AI category (Neuro \(\rightarrow\) Symbol), with no notion of co-optimization of neural and symbolic components or platform-awareness. In contrast, our framework allows platform-aware automatic co-design of ML (neural or non-neural) and symbolic components regardless of application, choosing the best synergy of ML operators and symbolic hyperparameters within the tight resource bounds of TinyML platforms.

2.4 Python to Microcontroller Code Parsers

Parsers automate the porting of code written in a high-level language (e.g., Python) to a deployment-time language (e.g., C). There are two kinds of parsers relevant to this work.

2.4.1 TinyML Compiler Suites.

These software suites take an ML model trained in a high-level ML framework to generate embedded code and perform operator optimizations, model compression, and inference engine optimizations. The embedded file system is then flashed onto the microcontroller for inference. Some of these frameworks provide memory planners, intermittent computing, runtime interpreters, and operator resolver functionalities in the form of inference engines [136]. The frameworks use a template file system to map tensor manipulation operations, logging, and input/output handling from the high-level model schema to objects. TensorFlow Lite Micro (TFLM) [43], uTVM [30], Microsoft EdgeML [68, 69, 74, 93, 94, 133], CMSIS-NN [95], and EON compiler [80] are popular frameworks that automatically parse TensorFlow [1] and PyTorch [120] neural networks to C code mainly for deploying on ARM Cortex-M processors. STM32Cube.AI,¹ Eloquent ML,² and Sklearn Porter³ parse support vector machines, decision trees, naive Bayes, k-nearest neighbors, random forest, XGBoost, and regressors from Scikit-Learn [122] to C [136]. For model parsing, we adopt and modify TFLM for parsing neural networks to C. First, we add scripts to check for use of unsupported ML operators and detect compilation and memory overflow faults during neurosymbolic program optimization by talking to the target hardware. Second, our parser can automatically modify the TFLM file system to invoke only the necessary operators, take care of quantization and dequantization, assign appropriate arena and buffer sizes, and place .c and .h files in the appropriate directories. Third, our parser invokes the embedded C compiler directly from Python and flashes the compiled program on the target hardware.

2.4.2 General Purpose Parsers.

These parsers convert general-purpose Python code to C. Shed Skin,⁴ Nuitka,⁵ Pyrex,⁶ Cython [16], SWIG [15], and BoostPython [90] are popular Python-to-C source-to-source translators. Most of these frameworks convert implicitly statically typed Python programs to C/C++, write boilerplate code using interface files through a shared library, perform compiler optimizations, and transmute data structures and types. However, these parsers lack support for runtime-interpreted program aspects and functions, cross-compilation, standard library, and unrestricted function definitions. Recently, large conversational language models such as ChatGPT⁷ are being used as code translation assistants [172]. The generated code is not error-free most of the time but helps save manual code conversion time for programmers. MicroPython [161] and Zerynth⁸ are software implementations of Python written in C for 32-bit microcontrollers. MicroPython supports features in the most popular Python modules, allows code portability due to the use of the hardware abstraction layer, offers modular programming, provides access to low-level hardware, and immediately executes commands. Similar to TFLM, MicroPython includes a runtime interpreter to interpret the bytecode. Unfortunately, MicroPython is \(10^1 \text{--} 10^2\) orders of magnitude slower than pure C/C++ [82], preventing its adoption in time-critical systems. In contrast, instead of providing direct source-to-source translation, TinyNS provides recipes to map the symbolic component for four of the five neurosymbolic paradigms from Python to pure C/C++. We assume the user has implemented the symbolic code in C either manually or using an existing source-to-source translator, and instead focuses on activating and passing arguments to the C objects from Python. For symbolic neuro symbolic, we use the concept of an array of over-parametrized function pointers selected using a binary mask. For neuro \(\rightarrow\) symbol, we use ANTLR to port program trees from Python to C. For neuro \(\cup\) compile (symbolic), a physics extraction function is activated. For symbolic [neuro], the function arguments are sent to a Kalman update step. The recipes call for the use of CMSIS libraries for mathematical, tensor, and signal processing operations.

3 Mango: Fast, Parallel, AND Gradient-Free Bayesian Optimizer

TinyNS adopts Mango [138, 139], which is an efficient realization of Bayesian optimization. Bayesian optimization provides a state-of-the-art approach to optimize expensive objective functions in a few iterations, approximated by a surrogate model.

3.1 Surrogate Model

Typical surrogate models used in Bayesian optimization libraries are Gaussian processes (GP), tree-structured Parzen estimators, and random forests. Among the available surrogate models, Mango uses the GP surrogate ( \(\mathcal {GP}\) ) over the search space ( \(\mathbf {\Omega }\) ) due to its ability to provide a tractable assessment of prediction uncertainty incorporating the effect of data scarcity [154]. The GP is a non-parametric machine learning model specified using a mean ( \(\mu\) ) and a kernel function ( \(k\) ):

\(\begin{equation} \hat{f}(\mathbf {\Omega }) \sim \mathcal {GP}(\mu (\mathbf {\Omega }), k (\mathbf {\Omega }, \mathbf {\Omega }{^{\prime }})). \end{equation}\)

(1)

Vanilla GP models work well on continuous search spaces but struggle to deal with the discontinuity in the search spaces induced by categorical, mixed, and hierarchical search spaces. Naive rounding or one-hot encoding causes the GP to get stuck to the same candidate model. Thereby, Mango adopts the solution proposed by Garrido-Merchan et al. [66], which modifies the GP covariance function to account for regions in the search space where the objective function becomes constant due to one-hot encoding or rounding inside the objective function evaluator wrapper. The constant behavior cannot be modeled by GP. We use a transformation of the input variables that rounds real-valued hyperparameters and performs one-hot encoding of categorical variables, causing the Cartesian distance between the sample points with the same configuration becoming 0. This allows the GP to indirectly model the expected constant behavior, as the transformation enforces maximum correlation between the function evaluations at the sample points with the same configuration under the GP.

3.2 Acquisition Function

The exploration-exploitation is handled using the UCB [155, 156] as the acquisition function. In UCB the next sample ( \(\mathbf {\Omega }_t\) ) at iteration \(t\) is sampled from the search space ( \(\mathbf {\Omega }\) ) using the predicted mean ( \(\mu _{t-1}\) ) and the corresponding variance ( \(\sigma _{t-1}^2\) ) at iteration \(t-1\) . The exploration factor ( \(\beta\) ) balances the contributions of the mean and variance:

\(\begin{equation} \mathbf {\Omega } _t = \arg \max _{\mathbf {\Omega }}(\mu _{t-1}(\mathbf {\Omega }) + \beta ^{0.5}\sigma _{t-1}(\mathbf {\Omega })). \end{equation}\)

(2)

The first term (mean) in the acquisition function refers to the goodness of the current sampled point (exploitation), while the second term refers to the uncertainty of the sampled point (exploration). Mango adopts UCB because of four reasons. First, UCB is robust to uncertainty and noise in the function evaluations without pre-processing. Second, UCB allows efficient sampling for cases where picking a suboptimal point may cause a time-consuming and expensive function evaluation. Third, UCB balances exploration and exploitation by sampling points that are not just likely to improve the final score (exploitation) but also sampling points that have high uncertainty (exploration). This not only prevents the optimizer from getting stuck in a local optimum but also provides both a coarse and a fine-grained view of the objective plane, allowing the score to achieve theoretical optimal values at the boundary of violating deployability constraints. Last, UCB uses of an adaptive \(\beta\) with theoretical convergence guarantees within 90% of the optimal value [48, 155, 156]. \(\beta\) is heuristically decided based on the complexity of the search space (domain size) \(|\mathbf {\Omega }|\) , the current iteration count \(t\) , and the variance (uncertainty) \(\sigma _{t-1}^2(\mathbf {\Omega })\) at iteration \(t-1\) :

\(\begin{equation} {\beta = \alpha \cdot \exp (2\cdot C), \hspace{5.69046pt} \alpha = \sqrt {2\log (0.6\cdot |\mathbf {\Omega }|\cdot t^2\cdot \pi ^2)}, \hspace{5.69046pt} C = \frac{8}{\log (1+\frac{1}{\delta + \sigma _{t-1}(\mathbf {\Omega })})}, \hspace{5.69046pt} \delta = 1e^{-6}}. \end{equation}\)

(3)

First, if the search space is bigger, then \(\alpha\) will increase logarithmically, leading to a bigger \(\beta\) . This will cause the acquisition function to be dominated by exploration. Second, as the search progresses, \(\alpha\) increases logarithmically. This impels the acquisition function to be exploration dominant in the later iterations. Third, sample points near already explored regions will return a lower value of \(\sigma _{t-1}^2(\mathbf {\Omega })\) , leading to a lower value of \(\beta\) . Last, if a region is invalid or bad, then \(\mu _{t-1}(\mathbf {\Omega })\) will be higher, causing the acquisition function to be dominated by exploration. If a region is valid or good or near the theoretical optimal boundary, then \(\mu _{t-1}(\mathbf {\Omega })\) will be lower, causing the acquisition function to be dominated by exploitation. The four factors cause Mango to perform what is known as sampling to find the boundaries in the objective plane. \(t\) ensures that exploration never stops in case Mango has not found a “hidden” region where global optima may reside. However, exploration dependent on \(t\) is logarithmic, leading to only a small increase in the \(\beta\) with each passing iteration. \(\sigma _{t-1}^2(\mathbf {\Omega })\) ensures that as more regions of the objective plane are explored, Mango moves from primarily exploration-driven to exploitation-driven sampling, which allows Mango to perform fine-grained sampling at later iterations. \(\mu _{t-1}(\mathbf {\Omega })\) ensures that this fine-grained sampling is being performed at the boundaries close to the theoretical optimal value with 90% probability. The entire formulation makes Mango explore all unexplored boundaries (coarse-grained sampling), and then find the points close to the theoretical optimal value (fine-grained sampling).

3.3 Handling Mixed Search Spaces

Traditionally, gradient-driven optimizers (e.g., GpyOpt [9] and Skopt [10]) are used to find the next promising sample, such as in SpArSe [59]. Sandha et al. [138, 139] showed that gradient-driven optimization in complex search spaces having discrete or categorical values can provide sub-optimal solutions by evaluating gradients at invalid configurations of the search space. Mango realizes a gradient-free optimizer for handling non-gradient-friendly values. Mango directly supports discrete integer values and continuous values and converts pure categorical to the one-hot encoding. However, this comes with the challenge that the decision boundary of the acquisition function becomes discontinuous due to the discrete values. Further, one-hot encoding of categorical variables increases the dimensionality of the search. To handle the discontinuous decision boundary, Mango adopts a gradient-free optimizer that does not assume the continuity of gradient in the acquisition function search space. This is based on the Monte Carlo optimization of the acquisition function. Since the evaluation of the acquisition function is very cheap, this approach is scalable to search decision boundaries extensively to parallelly select the next optimal points. The acquisition function is evaluated at thousands of valid samples in the search space; thus, there is no mismatch between the proposed and actual evaluations. This approach also works directly for the one-hot encoded spaces by doing evaluations only at the valid regions of the one-hot encoding without sampling the intermediate regions between 1 and 0 where no valid real sample exists. It is to be noted that in a gradient driven approach, the optimal point is finally converted to the correct sample either by rounding-off that can degrade the search results, which is not the case in Mango. This sampling-based approach also reduces the computational complexity [139] of the optimizer compared to the gradient-based methods used in other Bayesian optimization libraries [9, 10, 59].

To reduce the search space complexity even further, TinyNS proposes the use of slider matrices, enumerated trees, and ordinal masks. Instead of exposing Mango directly to the heterogeneous variables, for high-dimensional search spaces, TinyNS exposes Mango to the normalized slider matrix, inspired by the wrapper-based approach proposed in Garrido-Merchan et al. [66]. The slider matrix is a continuous formulation of the mixed parameter space normalized between 0 and 1. The one-hot encoding or rounding is performed inside the objective function evaluator wrapper as proposed in Reference [66] via a mapping that maps the terms in the slider matrix to the mixed parameter space. For even more complicated search spaces, TinyNS uses tree enumeration algorithms to generate program tree candidates and exposes TinyNS to an ordinal mask that selects one of the trees.

3.4 Parallelization

Another challenge in solving Equation (2) is parallelizing the sequential search process, selecting a batch of values to ensure exploration or diversity in the batch. The straightforward approach of ranking the search choices according to the acquisition function and then selecting the top picks is sub-optimal due to limited exploration [48]. To enable parallel search, Mango provides a clustering search algorithm on the samples drawn from the acquisition function. The clustering search selects promising domain samples from different clusters based on their distance in the search space. The different clusters are far from each other in the hyperparameter space to enable exploration or diversity. The number of clusters is equal to the batch size and is flexible.

3.5 Addition to Mango

TinyNS expands the state-of-the-art Bayesian optimizer to perform neurosymbolic architecture search in three ways. First, while Mango internally handles categorical and continuous variables, the optimizer alone cannot deal with complex neurosymbolic search spaces on its own. We provide recipes to show how Mango can deal with neurosymbolic search spaces through the intelligent use of slider matrices, Boolean masks, and enumerated trees. This significantly increases the types of problems Mango can handle. Second, to prevent wasting valuable GPU hours and improve convergence time, we use a guided optimization strategy. Specifically, we do not train programs that violate deployability constraints or induce faults. We penalize Mango by a constant number when it makes wrong choices. Yet, we design the optimization function in such a way that Mango is still able to find the boundaries in the objective plane even in complex search spaces and achieve near-optimal results. Third, we make Mango platform-aware by allowing it to talk to the target hardware during deployment time. This allows guaranteed program deployment and accurate profiling. We discuss these additions in more detail in Section 4.

3.6 Evaluation: Parallel Search in Mango

We visualize the parallel search enabled by Mango in Figure 3 (Left). Four iterations of the clustering search algorithm are shown for a 1D function having multiple optimal points. The ground-truth function is represented by objective. The samples are the points that have been evaluated, and hence the true objective function values are known. A batch size of three is used, representing the parallel evaluation of three samples in each iteration. The Surrogate function shows the internal approximation of the ground-truth objective based on the evaluated samples. The acquisition function is based on the UCB. The three clusters created in different regions of the acquisition function are shown. The next sampling locations represent the points selected from each cluster for evaluation in the next iteration. We observe that the ground-truth max optimal is found by Mango in the fourth iteration, which occurs at \(-1.0\) and has a value of 4.72.

Fig. 3.

3.7 Evaluation: Comparison Against Other Bayesian Optimizers

We compare Mango for hyperparameter tuning with existing state-of-the-art Bayesian optimization libraries using the multiple criteria methodology proposed by Dewancker et al. [49]. Specifically, we measure the performance of an optimizer by considering the solution’s proximity to the optimal point (accuracy) and the number of iterations required to reach the optima (speed). We compared the performance for hyperparameter tuning of three ML classifiers: Xgboost, K-Nearest Neighbor (KNN), Support Vector Machines (SVM) to maximize the 3-way cross-validation accuracy for the iris plants dataset, wine recognition dataset, and breast cancer Wisconsin (diagnostic) dataset taken from Scikit-learn [122], i.e., a total of 9 tuning tasks (three classifiers trained using three datasets). The search space includes continuous, integer, and categorical hyperparameters with the exact definitions available [137]. We tune each classifier for 80 iterations and repeat each tuning experiment 30 times. Results are shown in Figure 3 (Right). Mango performs better than all other libraries in six or more tasks of nine in hyperparameter tuning for classifiers with mixed hyperparameters (continuous, integer, and categorical) spaces. Specifically, Mango outperforms HyperOpt (TPE surrogate), SMAC (random forest surrogate), Optuna (TPE surrogate), and GPyOpT (vanilla GP surrogate). Overall, Mango offers state-of-the-art optimization capabilities for handling complex search spaces.

4 Platform-aware Neurosymbolic Optimization

TinyNS treats neurosymbolic architecture search as nonlinear programming [17] over the search space \(\mathbf {\Omega }\) :

\(\begin{equation} \min {\bf f}(\mathbf {\Omega }), \hspace{5.69046pt} \text{s.t.}\hspace{5.69046pt} {\bf f}(\mathbf {\Omega }) \le {\bf b}, \end{equation}\)

(4)

where

\(\begin{equation} {\bf f}(\cdot) = \lambda _k\sum _{n}g_k(\mathbf {\Omega }), \hspace{5.69046pt} \mathbf {\Omega } = \lbrace \lbrace V,E\rbrace , [\theta _m, m, w], [\theta _s, s, u]\rbrace , \hspace{5.69046pt} \sum _n \lambda _k = 1, \hspace{5.69046pt} k \in [1,n]. \end{equation}\)

(5)

\(\mathbf {\Omega }\) contains both ML components and symbolic components. The ML components include the ML hyperparameters \(\theta _m\) , trainable ML parameters \(w\) (e.g., NN weights and biases), and ML operators \(m\) (e.g., convolution, pooling, support vector kernel, fully connected, etc.). The ML operators may be feedforward, residual, or recurrent. The symbolic components include the symbolic hyperparameters \(\theta _s\) , numerical parameters to be optimized \(u\) (e.g., Kalman filter gain), and symbolic program atoms \(s\) (e.g., predicates, terms, features, etc.). Candidate neurosymbolic programs constructed from \(\mathbf {\Omega }\) can be thought of as directed acyclic graphs \(q^{\mathbf {\Omega }}(\mathbf {X})\) with edges \(E\) , vertices \(V\) and input tensor X. The goal is to find a neurosymbolic program that satisfies the aggregate constraint \({\bf f}(\mathbf {\Omega }) \le {\bf b}\) . In other words, the objective function seeks a Pareto-frontier configuration \(\Omega ^*\) under competing objectives [59] such that

\(\begin{equation} {\bf f}_k(\mathbf {\Omega } ^*) \lt = {\bf f}_k(\mathbf {\Omega }) \hspace{5.69046pt} \forall k,\mathbf {\Omega } \hspace{5.69046pt} \wedge \exists j: {\bf f}_j(\mathbf {\Omega } ^*) \lt {\bf f}_j(\mathbf {\Omega }) \hspace{5.69046pt} \forall \mathbf {\Omega }\ne \mathbf {\Omega }^*. \end{equation}\)

(6)

The aggregate constraint function \({\bf f}(\cdot)\) is a linear combination of individual objectives \(g(\cdot)\) weighted by random scalarizers \(\lambda\) . Let \(\mathcal {A}\) be a complete Boolean algebra, \(\omega _{\omega }\) be the ordinal set, and \(\mathbb {A}\) be a fixed set of names. Then, \(g(\cdot)\) and \(\mathbf {\Omega }\) have the following properties:

•

\(d\vee \lnot d, \hspace{5.69046pt}\underbrace{d = \left(\exists g_k(\cdot) \wedge \exists c \in \mathbf {\Omega } \right)\Rightarrow \left(\nexists \lim _{x \rightarrow c} g_k(x) \vee \nexists g(c) \vee \lim _{x \rightarrow c} g_k(x) \ne g(c) \right)}_{\text{discontinuity condition}}\)

•

\(\\begin{aligned}\exists z \in \mathbf {\Omega } \Rightarrow \big [\underbrace{z \in \mathbb {R}}_{\text{continuous,}\\ \text{numeric}} \vee \underbrace{\left[ z \in \mathbf {B}, \mathbf {B} \subseteq \mathbb {R}, f:\mathbf {B} \rightarrow \mathbb {N} \right]}_{\text{discrete, numeric}} \vee \underbrace{\left[ ((\forall q \in \bar{q}) \pi q = q) \Rightarrow \pi \cdot z = z, \pi \in \text{Perm} \hspace{2.84544pt}\mathbb {A} \right]}_{\text{categorical, nominal}} \vee \underbrace{z \in \omega _{\omega }}_{{c}\text{categorical,}\\ \text{ordinal}} \big ]\end{aligned}\)

•

\(\underbrace{\begin{aligned}\exists x|a \in {\bf X}, x \in \mathbf {\Omega }, a, b \in \mathcal {A} \Rightarrow \big [(a = b \Rightarrow x|a = y|b) \wedge (x|b = y|b \Rightarrow x|a = y|a) \wedge \\ \big (\forall (a_i)_{i \in I} \in \mathcal {A}, \forall (x_i)_{i \in I} \in \mathbf {X}, \forall i\in I \Rightarrow \exists !x(x|a_i = x_i|a_i)\big)\big ]\end{aligned}}_{\text{conditional inclusion}}\)

The base formulation of Equations (4) and (5) is given as

\(\begin{equation} \text{min}\hspace{2.84544pt}f_{\text{opt}}, \hspace{5.69046pt} f_{\text{opt}} = \lambda _1f_{\text{error}}(\mathbf {\Omega }) + \lambda _2f_{\text{flash}}(\mathbf {\Omega }) + \lambda _3f_{\text{SRAM}}(\mathbf {\Omega }) + \lambda _4f_{\text{latency}}(\mathbf {\Omega }), \end{equation}\)

(7)

where

\(\begin{equation} f_{\text{flash}}(\mathbf {\Omega }) = {\left\lbrace \begin{array}{ll} \gamma _f \Leftrightarrow \left(|\gamma _f| \lt 1 \wedge \underbrace{\epsilon _{\text{flag}} = 0}_{\text{fault flag}}\right),\hspace{5.69046pt} \gamma _f = \left(\underbrace{-\frac{||h_{\text{FB}}(w,\lbrace V,E\rbrace)||_0}{\text{flash}_{\max }}}_{\text{model proxy}} + \underbrace{\xi _f}_{\text{slack for}\\ \text{symbolic}} \vee \underbrace{-\frac{\text{Compiler-reported flash}}{\text{flash}_{\max }}}_{\text{real measurement}}\right), \\ \alpha _f, \hspace{5.69046pt} \alpha _f \gg \text{flash}_{\max } \end{array}\right.} \end{equation}\)

(8)

\(\begin{equation} f_{\text{SRAM}}(\mathbf {\Omega }) = {\left\lbrace \begin{array}{ll} \gamma _s \Leftrightarrow \left(|\gamma _s| \lt 1 \wedge \underbrace{\epsilon _{\text{flag}} = 0}_{\text{fault flag}}\right),\hspace{5.69046pt} \gamma _s = \left(\underbrace{ -\frac{\max _{l \in [1, L]} \lbrace ||x_l||_0 + ||a_l||_0 \rbrace }{\text{SRAM}_{\max }}}_{\text{model proxy}} + \underbrace{\xi _s}_{\text{slack for}\\ \text{symbolic}} \vee \underbrace{-\frac{\text{Compiler-reported SRAM}}{\text{SRAM}_{\max }}}_{\text{real measurement}}\right), \\ \alpha _s, \alpha _s \gg \text{SRAM}_{\max } \end{array}\right.} \end{equation}\)

(9)

\(\begin{equation} f_{\text{latency}}(\mathbf {\Omega }) = {\left\lbrace \begin{array}{ll} \underbrace{\frac{\text{FLOPS}}{\text{FLOPS}_{\text{target}}}}_{\text{model proxy}} \vee \underbrace{\frac{\text{RTOS-reported latency}}{\text{latency}_{\text{target}}}}_{\text{real measurement}} \Leftrightarrow \underbrace{\epsilon _{\text{flag}} = 0}_{\text{fault flag}}. \\ \alpha _l, \alpha _l \gg \text{FLOPS}_{\text{target}} \vee {\text{latency}_{\text{target}}} \end{array}\right.} \end{equation}\)

(10)

The goal of the base formulation is to find a Pareto-optimal neurosymbolic program with the lowest possible runtime latency but maximizes the device’s full SRAM and flash capacity without inducing overflow or faults. The performance of a candidate neurosymbolic program on the validation dataset at each iteration in the search provides \(f_{\text{error}}(\mathbf {\Omega })\) . When the target hardware is connected to the training server, the compiler provides the program SRAM consumption \(f_{\text{SRAM}}(\mathbf {\Omega })\) and flash consumption \(f_{\text{flash}}(\mathbf {\Omega })\) , while the onboard real-time operating system (RTOS) reports the program runtime latency \(f_{\text{latency}}(\mathbf {\Omega })\) . The measurements are conditioned on the absence of faults, indicated by \(\epsilon _{\text{flag}}\) . Based on prior work [134, 135], we set \(\lambda _1\) to 1.0, \(\lambda _2\) to 0.01, \(\lambda _3\) to 0.01, and \(\lambda _4\) to 0.05. TinyNS has the following fault detection capabilities:

•

Flash, SRAM, or model arena buffer overflow (the program is too big to fit).

•

Use of unsupported ML operators.

•

Compilation errors.

•

Runtime RTOS faults.

If \(\epsilon _{\text{flag}} = 0\) , then the hardware metrics are normalized by the device SRAM and flash capacities ( \(\text{SRAM}_{\max }\) , \(\text{flash}_{\max }\) ), and target latency ( \(\text{latency}_{\text{target}}\) ) to a common scale. If \(\epsilon _{\text{flag}} \ne 0\) , then the hardware metrics are set to a value much larger than the device capacity or target latency. We set \(\alpha _f\) = 125, \(\alpha _s\) = 125, \(\alpha _l\) = 50, resulting in \(f_{\text{opt}}\) being 5.0 whenever deployability constraints are violated. This policy, called hard thresholding, achieves full device capability exploitation. Since violating deployability constraints always returns an \(f_{\text{opt}}\) of 5, after sufficient iterations, TinyNS can observe and exploit the small but valid linear region of SRAM and flash usage between \(-1\) and 0 ( \(\gamma _f\) and \(\gamma _s\) are valid between \(-1\) and 0), striving to move \(\gamma _f\) and \(\gamma _s\) toward \(-1\) . Yet, TinyNS is aware that certain choices of ML operators and symbolic atoms would make \(\gamma _f\) and \(\gamma _s\) more negative (hence, the objective should ideally be minimized even further) but are invalid. In other words, the optimizer is penalized by a large constant number when it picks candidate models that do not fit within the device or induce faults and instead encourages the acquisition function to not pick too many points in the regime where the violation may occur. After sampling sufficient points in the small but valid linear region and the invalid regions, the surrogate function smooths out sufficiently to match the linear region in the objective plane where the accuracy improvement is proportional to memory usage without inducing faults. Hard thresholding is possible thanks to the adoption of parallel version [48] of GP-UCB [155, 156]. During exploitation, GP-UCB picks candidate models that are likely to minimize \(f_{\text{opt}}\) . The sample points in this phase will be close to one or more of the “successful” points in the linear/valid region found during previous iterations. Exploitation, thereby, provides a finer-grained view of the objective plane. During exploration, GP-UCB will either pick points in the valid or invalid region to make sure the optimizer is not stuck in local optima. Exploration, thereby, provides a coarse-grained view of the objective plane. With sufficient iterations, the acquisition function moves from being exploration driven to exploitation driven, converging near theoretical optimal value at the boundary of violating deployability constraints. The parallel implementation allows the optimizer to have access to more “batches of sample points” at each iteration. The policy of hard thresholding is not possible to implement with gradient-based optimizers due to discontinuous penalization. For those optimizers, one would have to train the model to get the accuracy even if GPU hours are wasted, calculate the memory usage, and penalize in a continuous fashion proportional to the memory usage (referred to as coupling of deployability and performances). Since we do not train a candidate model once deployability constraints have been violated, hard thresholding (combined with fault detection) also prevents TinyNS from training a candidate model that does not satisfy all the constraints, saving valuable GPU hours by as much as 50% over gradient-based optimizers.

Note that SpArSe [59] treats \(\lambda\) as a super-hyperparameter bring drawn from a random distribution at each iteration. However, realizing \(\lambda\) as a super-hyperparameter in complex neurosymbolic search spaces with a gradient-free and black-box optimizer is challenging as compared to the gradient-based optimizer in SpArSe. For the same program candidate, different values of \(\lambda\) will yield different values of \(f_{\text{opt}}\) at each iteration, resulting in a large number of iterations needed to achieve acceptable performance. We are aware that our choices of \(\lambda\) and \(\alpha\) may not provide the most optimal neurosymbolic program for each application, but, as we will showcase, are able to guarantee high-utility and deployable neurosymbolic programs that significantly outperform the state-of-the-art.

When the target device is absent, TinyNS relies on well-known analytical proxies to provide device resource usage estimates. \(f_{\text{flash}}(\mathbf {\Omega })\) is given by the size of the flatbuffer model schema \(h_{\text{FB}}(\cdot)\) [43]. \(f_{\text{SRAM}}(\mathbf {\Omega })\) is given by the standard NN SRAM usage model, with intermediate layer-wise activation maps and tensors stored in the SRAM [59]. \(f_{\text{latency}}(\mathbf {\Omega })\) is provided by the FLOPS count [14]. Assuming the ML component dominates resource usage over symbolic components, a static slack constant \(\xi\) is added to the SRAM and flash proxies to account for SRAM and flash usage by the symbolic program. There are, however, several issues with this profiling approach:

•

Proxies are inaccurate and do not work for a wide variety of ML operators (e.g., well-known proxies were developed only for convolutional models) [134, 135]. Proxies do not even exist for symbolic programs.

•

Model proxies tend to overestimate device capabilities without considering overhead from symbolic programs, runtime inference engines, RTOS, or data stacks [134, 135].

•

Proxies cannot capture all the faults that the platform-in-the-loop approach can. Hence, the correctness of the neurosymbolic program is not guaranteed.

•

Proxies cannot take into account compiler suite optimizations at the execution level, often yielding sub-optimal models compared to the platform-in-the-loop approach.

For each candidate neurosymbolic program, TinyNS automatically writes embedded C code for microcontrollers from Python constructs using parsers. The recipes used by the parsers are discussed next.

4.1 Symbolic Neuro Symbolic

Problem Formulation (Symbolic). Consider a vector of independent domain-engineered functions \({\bf z}(\cdot)\) constructed from \(s\) in \(\mathbf {\Omega }\) that operate on \({\bf X}\) . During the search process, each function in \({\bf z}(\cdot)\) can be accessed through a binary mask \(c\) , signifying the activation and deactivation of a collection of elements of \({\bf z}(\cdot)\) :

\(\begin{equation} {\bf X}_i^{\text{feat}} = z_i^{{\bf U}_i}({\bf X}) \Leftrightarrow c_i = 1, \hspace{5.69046pt} i \in [1,n], \hspace{5.69046pt} c_i \in 0 \vee 1, \end{equation}\)

(11)

where \({\bf U}\) is a 2D hyperparameter data structure for \({\bf z}(\cdot)\) . The \(i{\text{th}}\) row of \({\bf U}\) corresponds to the hyperparameters for \(z_i\) . The number of columns of \({\bf U}\) is the number of optimization hyperparameters for that \(z_i\) , which takes the maximum number of hyperparameter arguments, \(e\) . Each element in \({\bf U}\) corresponds to the range of possible floating point numbers in the search space for the \((i,j){\text{th}}\) hyperparameters, expressed as a list. Boolean hyperparameters are converted to (0.0, 1.0), and nominal variables are converted to ordinal choices (e.g., 1.0, 2.0, 3.0, 4.0, 5.0). The length of each element in \({\bf U}\) varies:

\(\begin{equation} {\bf U} = \begin{bmatrix} [\alpha _1^{1,1}, \alpha _2^{1,1}, \ldots , \alpha ^{1,1}_{\gamma ^1_1}] & [\alpha _1^{1,2}, \alpha _2^{1,2}, \ldots , \alpha ^{1,2}_{\gamma ^1_2}] & ... & [\alpha _1^{1,e}, \alpha _2^{1,e}, \ldots , \alpha ^{1,e}_{\gamma ^1_e}] \\ {[}\alpha _1^{2,1}, \alpha _2^{2,1}, \ldots , \alpha ^{2,1}_{\gamma ^2_1}] & [\alpha _1^{2,2}, \alpha _2^{2,2}, \ldots , \alpha ^{2,2}_{\gamma ^2_2}] & ... & [\alpha _1^{2,e}, \alpha _2^{2,e}, \ldots , \alpha ^{2,e}_{\gamma ^2_e}] \\ . & . &... &. \\ . & . & ...&. \\ . & . &... &. \\ {[}\alpha _1^{n,1}, \alpha _2^{n,1}, \ldots , \alpha ^{n,1}_{\gamma ^n_1}] & [\alpha _1^{n,2}, \alpha _2^{n,2}, \ldots , \alpha ^{n,2}_{\gamma ^n_2}] & ... & [\alpha _1^{n,e}, \alpha _2^{n,e}, \ldots , \alpha ^{n,e}_{\gamma ^n_e}] \\ \end{bmatrix}. \end{equation}\)

(12)

An example of U is shown below. There are three feature functions in z. The first feature takes 4 hyperparameter arguments, the second feature takes one hyperparameter argument, and the third feature takes two hyperparameter arguments. All the functions are programmed to accept four arguments, but each function may not use all four arguments. The arguments are internally processed by each function to the correct form:

\(\begin{equation} {\bf U}_{\text{sample}} = \begin{bmatrix} [0.0, 1.0] & \text{range}(3.0,64.0) & \text{uniform}(-5.0, 10.0) & [1.2,5.2] \\ {[}0.2,0.5,0.8,1.5,2.3] & [0.0] & [0.0] & [0.0] \\ {[}1.0,2.0,3.0,4.0] & \text{linspace}(-22.0,22.0,100) & [0.0] & [0.0] \\ \end{bmatrix}. \end{equation}\)

(13)

To normalize each element in U to the same scale and make the search tractable, TinyNS uses a slider matrix \({\bf U}_{\text{slider}}\) during the search process instead of being directly exposed to \({\bf U}\) :

\(\begin{equation} {\bf U}_{\text{slider}} = \begin{bmatrix} \zeta _{1,1} & \zeta _{1,1} & ... & \zeta _{1,e} \\ \zeta _{2,1} & \zeta _{2,2} & ... & \zeta _{2,e} \\ . & .& ... & . \\ . & .& ... & . \\ . & .& ... & . \\ \zeta _{n,1} & \zeta _{n,2} & ... & \zeta _{n,e} \\ \end{bmatrix}, \hspace{5.69046pt} \zeta _{i,j} = {\left\lbrace \begin{array}{ll} \text{linspace}(0, 1, \delta) \Leftrightarrow \left|\left[\alpha _1^{i,j}, \alpha _2^{i,j}, \ldots , \alpha ^{i,j}_{\gamma ^i_j}\right]\right| \ne 1,\\ 0 \end{array}\right.} \end{equation}\)

(14)

where \(\delta\) represents the granularity factor, which controls how finely each element in \({\bf U}\) can be chosen. Ideally, \(\delta\) should be equal to the length of the largest array in \({\bf U}\) . Let \(\eta _{i,j}\) be a value in an array element in \({\bf U}\) . The mapping between \(\zeta _{i,j}\) and \(\eta _{i,j}\) is

\(\begin{equation} \eta _{i,j} = \alpha ^{i,j}_{\kappa }, \hspace{5.69046pt} \kappa = \text{round}\left(\zeta _{i,j} \cdot \left|\left[\alpha _1^{i,j}, \alpha _2^{i,j}, \ldots , \alpha ^{i,j}_{\gamma ^i_j}\right]\right|\right), \hspace{5.69046pt} \mu _{i,j}\in [0,1]. \end{equation}\)

(15)

The search space for the symbolic components, thereby, is composed of the binary mask \(c\) and \({\bf U}_{\text{slider}}\) .

Problem Formulation (Neural). Consider a collection of \(k\) model backbones \(\mathbf {\phi }\) constructed from \(m\) in \(\mathbf {\Omega }\) . During each iteration in the search process, only one of the models is considered via an ordinal mask \(d\) :

\(\begin{equation} \text{model}_{\text{iteration}_t} = \phi _i, \hspace{5.69046pt} i \in d, \hspace{5.69046pt} d = [1,2,\ldots ,k]. \end{equation}\)

(16)

Each model will have its own optimization hyperparameters (e.g., number of convolutional layers, kernel size, support vector kernel type, etc.). We modify the concept of hyperparameter data structure and slider matrix from the symbolic search space to account for ordinal model choice. Let \({\bf V}\) be the 2D hyperparameter data structure for \(\mathbf {\phi }\) . The structure of \({\bf V}\) remains the same as that of \({\bf U}\) , now with \(k\) rows of hyperparameters. The number of columns of \({\bf V}\) is equal to the number of optimization hyperparameters for that \(\phi _i\) , which takes the maximum number of arguments \(f\) :

\(\begin{equation} {\bf V} = \begin{bmatrix} [\beta _1^{1,1}, \beta _2^{1,1}, \ldots , \beta ^{1,1}_{\gamma ^1_1}] & [\beta _1^{1,2}, \beta _2^{1,2}, \ldots , \beta ^{1,2}_{\gamma ^1_2}] & ... & [\beta _1^{1,f}, \beta _2^{1,f}, \ldots , \beta ^{1,f}_{\gamma ^1_f}] \\ {[}\beta _1^{2,1}, \beta _2^{2,1}, \ldots , \beta ^{2,1}_{\gamma ^2_1}] & [\beta _1^{2,2}, \beta _2^{2,2}, \ldots , \beta ^{2,2}_{\gamma ^2_2}] & ... & [\beta _1^{2,f}, \beta _2^{2,f}, \ldots , \beta ^{2,f}_{\gamma ^2_f}] \\ . & . &... &. \\ . & . & ...&. \\ . & . &... &. \\ {[}\beta _1^{k,1}, \beta _2^{k,1}, \ldots , \beta ^{k,1}_{\gamma ^k_1}] & [\beta _1^{k,2}, \beta _2^{k,2}, \ldots , \beta ^{k,2}_{\gamma ^k_2}] & ... & [\beta _1^{k,f}, \beta _2^{k,f}, \ldots , \beta ^{k,f}_{\gamma ^k_f}] \\ \end{bmatrix}. \end{equation}\)

(17)

An example of \({\bf V}\) is shown below. The first row corresponds to the hyperparameters for a temporal convolutional network (TCN) [162], and the second row corresponds to the hyperparameters for Bonsai [93]:

\(\begin{equation} {\bf V}_{\text{sample}} = \begin{bmatrix} \underbrace{\text{range}(2,64)}_{\text{kernel size}} & \underbrace{[1.0,2.0,5.0]}_{_{\text{stack count}}} & \underbrace{[[1,2,4],[1,2,4,8],[1,4,8,32]]}_{\text{dilation factors}} & \underbrace{\text{uniform}(0.0,1.0)}_{\text{dropout}} \\ \underbrace{\text{range}(40,60)}_{\text{prototype count}} & \underbrace{\text{range}(1,4)}_{\text{sigmoid parameter}} & \underbrace{\text{range}(1,6)}_{\text{depth}} & [0.0] \end{bmatrix}. \end{equation}\)

(18)

Since \(d\) is ordinal, \({\bf V}_{\text{slider}}\) takes a vector form:

\(\begin{equation} {\bf V}_{\text{slider}} = \begin{bmatrix} \chi _{1,1} & \chi _{1,2} & ... & \chi _{1,f} \\ \end{bmatrix}, \hspace{5.69046pt} \chi _{i,j} = {\left\lbrace \begin{array}{ll} \text{linspace}(0, 1, \delta) \Leftrightarrow \left|\left[\beta _1^{i,j}, \beta _2^{i,j}, \ldots , \beta ^{i,j}_{\gamma ^i_j}\right]\right| \ne 1.\\ 0 \end{array}\right.} \end{equation}\)

(19)

The search space for the neural components, thereby, is composed of the ordinal mask \(d\) and \({\bf V}_{\text{slider}}\) . Note that when \(k=1\) , the elements in \({\bf V}\) are directly fed to the search algorithm.

Parsing (Symbolic). The python constructs for each function in \({\bf z}(\cdot)\) have equivalent C constructs, declared in a .h file and defined in a .cc file. The .cc file also includes an extract_symbolic(raw_data[], output_feat[], mask[], params[]) function, which takes the windowed and raw sensor data as input (raw_data[]), picks functions according to a binary mask array (mask[]), applies the corresponding hyperparameters to the chosen functions (params[]), and outputs the processed data (output_feat[]). TinyNS writes the Pareto-optimal mask \(c^*\) as mask[], the Pareto-optimal values in the 2D hyperparameter data structure \({\bf U}^*\) as flattened array params[], and the maximum number of arguments each function can take MAX_PARAM_COUNT to the .cc file. Algorithm 1 provides example implementation for the extract_symbolic() function. All of the functions are programmed to take a hyperparameter array of length MAX_PARAM_COUNT, internally processing the arguments to the correct form like in Python. An array of function pointers of type f allows flexible addition, removal, and access to functions, retaining the same order of functions from Python and allowing sequential application of each function to the raw input data. The output channel count for each function is variable and defined in func_output_size[].

Parsing (Neural). TinyNS uses the TensorFlow Lite Micro (TFLM) [43] Mbed RTOS C file system for real-time model inference on microcontrollers. Algorithm 2 shows the main.cc file of the file system. We choose TFLM as the runtime inference engine due to its widespread public use, portable design philosophy, heterogenous hardware support, memory efficient paradigms, static memory allocation, and pathways for easy model replacement [43, 136]. First, the model backbone in Python is constructed using Keras [72] or Keras/TensorFlow wrappers for Scikit-learn [122] with TensorFlow backend [1]. Next, the Keras model is converted to a .tflite model, with appropriate quantization schemes applied during conversion (e.g., no quantization or full integer quantization using a representative dataset). The parser now needs to check if the operators in the .tflite file are present in the TFLM operator resolver list. The steps are:

•

Read the .tflite file as a flatbuffer byte array.

•

Decode the value at the start of the flatbuffer using packer type flatbuffers. packer.uoffset to create a model object.

•

Unpack the model object into a graph of flatbuffer objects.

•

Convert the hierarchy of flatbuffer objects to a nested opcode dictionary.

•

Match the opcode keys in the model to the opcode names in the BUILTIN_OPCODE2NAME dictionary provided with the TFLite API.

•

Check if the resulting set of names is present in the AVAILABLE_TFLM_OPS list.

If all the operators in the model are supported by TFLM, then the .tflite file is converted to a flatbuffer model schema using Linux hex dump, generating .cc file of the model. The parser opens the main.cc file and makes the following changes:

•

Declare the TFLM arena size depending on target hardware constraints. The arena is a stack in the SRAM used for initialization and runtime variable storage.

•

Declare the arrays for storing raw data and processed output from extract_symbolic, which is also the input to the model. The arrays can be float or int depending on model quantization. In TFLM, flattened input arrays are internally reshaped to match the input tensor shape of the model.

•

Declare a TFLM interpreter instance (MicroModelRunner), which resolves the model graph during runtime. The data types should be the input and output data types of the model, and the last number indicates the number of unique ML operators that need to be called by the operator resolver.

•

Declare the TFLM operator resolver instance (MicroMutableOpResolver), which links only the essential ML operators to the model graph.

•

Add the operators necessary to resolve the graph from the intersection of the set of model opcode names and the AVAILABLE_TFLM_OPS list.

•

Pass the flatbuffer model schema, the operator resolver, and the arena to the interpreter.

•

Dequantize the outputs if the model output is quantized.

Figure 4 summarizes the parser operation between the Python file system and the TFLM Mbed RTOS C file system. Examples. An example includes finding the best set of features for on-device wearable human activity recognition. Another example includes finding the best model among a set of models for on-device wearable fall detection under 2 kB of memory. We showcase the examples in Sections 5.2 and 5.3. In the first example, the search algorithm is given a model backbone and several temporal, statistical, and spectral features that can operate on the raw, windowed data. The goal is to find the best model hyperparameters and features that work well to give maximal activity detection accuracy within the hardware constraints. In the second example, the goal is to find the best model and its corresponding hyperparameters that can detect falls within a tight memory budget.

Fig. 4.

4.2 Neuro \(\rightarrow\) Symbol

Problem Formulation. There are two ways to realize this paradigm. First, if a static domain-engineered function \(z(\cdot)\) with hyperparameter data vector u operates on the output of the model to produce high-level reasoning, then the symbolic search space only contains u:

\(\begin{equation} {\bf u} = \begin{bmatrix} \left[\alpha _1^{1,1}, \alpha _2^{1,1}, \ldots , \alpha ^{1,1}_{\gamma ^1_1}\right] & \left[\alpha _1^{1,2}, \alpha _2^{1,2}, \ldots , \alpha ^{1,2}_{\gamma ^1_2}\right] & ... & \left[\alpha _1^{1,e}, \alpha _2^{1,e}, \ldots , \alpha ^{1,e}_{\gamma ^1_e}\right] \\ \end{bmatrix}. \end{equation}\)

(20)

u is similar in form to \({\bf U}\) from Section 4.1, but only corresponds to the optimization hyperparameter space for a single function. The neural search space is the same as that shown in Section 4.1.

Second, consider a collection of logical (e.g., AND, OR, NOT) operators \(\Lambda\) , relational (e.g., equivalence, less than or equal to, greater than or equal to) operators \(\Re\) , arithmetic (e.g., add, multiply) operators \(\Xi\) , and conditional (e.g., if else then) operators \(\Upsilon\) , expressed in a Domain-Specific Language (DSL) [118]. Given maximum tree depth \(\wp\) and a finite number of trees \(N\) , the symbolic atoms can be combined to synthesize candidate program graphs (or program decision trees) that can perform high-level reasoning over several neural output timesteps:

\(\begin{equation} {\bf G}= \text{GenerateProgramTree}(\lbrace \Lambda , \Re , \Xi , \Upsilon \rbrace , \wp , N). \end{equation}\)

(21)

Figure 5 shows an example program supergraph generated from the DSL operator space, from which candidate trees can be extracted. The GenerateProgramTree() is an enumeration algorithm [118, 160] that generates all possible combinations of program graphs \({\bf G}\) given \(\wp\) and \(N\) using context-free grammar. The rules of connection are fixed by the DSL. Ideally, the path cost of the program graph should be low for interpretability and resource savings, yet have high accuracy. In other words, in Figure 5, the goal is to find the top-performing shortest path to Decision A and Decision B. The symbolic search space is an ordinal mask \(j\) that represents one of \(N\) program subgraphs extracted from the program supergraph:

\(\begin{equation} \text{program}_{\text{iteration}_t} = G_i, \hspace{5.69046pt} G_i \in {\bf G}, \hspace{5.69046pt} i \in j, \hspace{5.69046pt} j = [1, 2, \ldots , N]. \end{equation}\)

(22)

The neural search space is the same as that shown in Section 4.1. Parsing. Neuro \(\rightarrow\) Symbol follows the same model parsing strategy discussed in Section 4.1, Algorithm 2 and Figure 4. For symbolic parsing, in the first case, the symbolic parser passes the Pareto-optimal \({\bf u}^*\) as hyperparameter_vector[] to the main.cc file, where the function \(z({\cdot })\) is defined as symbolic_function(). This function operates on the output of the model. An example of this case is shown in Algorithm 3. In the second case, the program decision tree along with the grammar and the parser runtime are ported as header files. The steps to port a program tree generated using ANTLR [119] are:

Fig. 5.

•

Port the graph as a .txt or .h file, expressed in DSL.

•

Define the lexer rules in a .g4 files. The lexer rules are necessary to tokenize the DSL program tree.

•

Run the ANTLR runtime engine with the lexer.g4 file in the target language (Python or C) to create the necessary lexer files.

•

Define the grammar in another .g4 file. The grammar defines the relations between the class of tokens, assigning labels using the DSL operator space.

•

Run the ANTLR runtime engine again, but with the grammar.g4 file to create the parser files, which processes the program graph to create a hierarchical abstract syntax tree. Specify the -visitor flag when running the engine to have control over the query traversal.

•

Create a visitor, which will traverse the tree according to the parser grammar.

•

Pass the DSL graph from the .txt or .h file to the lexer as a string argument. The tokenized tree is passed to the parser to generate the syntax tree, which is finally passed to the visitor for traversal.

Examples. An example of the first approach includes joint optimization of a symbolic object tracker with a neural object detector using the CenterNet algorithm [179]. We showcase this example in Section 5.4. The object detector backbone is a ResNet-34 + Deformable Convolutional Network, with the optimization hyperparameters being the number of convolutional stacks, the kernel size, whether to use layer-wise activations or not and the head convolutional value. Given an input image \(I^t \in \mathbb {R}^{W\times H\times 3}\) , the model outputs the center points \(\hat{D}_{{\bf p}_i}\) and bounding box dimensions \(\hat{S}_{{\bf p}_i}\) of the detected objects, as well as a heatmap of the centroid of the objects \(\hat{Y}_{xyc}, \hat{Y} \in [0,1]^{ \frac{W}{R} \times \frac{H}{R} \times C}\) ) based on the rendering function \(\mathcal {R}\) with Gaussian Kernel \(\sigma _i\) for each class \(c \in \lbrace 0, 1, \ldots , C-1\rbrace\) :

\(\begin{equation} \mathcal {R}_q(\lbrace {\bf p}_0, {\bf p}_1, \ldots \rbrace) = \max _i \exp \left(\frac{({\bf p}_i - {\bf q})^2}{2\sigma _i^2} \right), {\bf q} \in \mathbb {R}^2, {\bf p} \in \mathbb {R}^2. \end{equation}\)

(23)

q is a position on the image. To track and associate objects across frames, the network is also fed the previous frame \(I^{t-1}\) and prior detection heatmaps \(\mathcal {R}({\bf p}^{t-1})\) . The network then outputs the 2D offset of the object \({\bf d}^t\) , with associations performed using greedy matching. Thus, the network is trained via a weighted sum of the focal loss \(\mathcal {L}_k\) (based on ground-truth heatmap \(Y_{xyc}, Y \in [0,1]^{ \frac{W}{R} \times \frac{H}{R} \times C}\) ), the size \(\mathcal {L}_{\text{size}}\) (based on ground-truth bounding box dimensions \({\bf s}\) ), and the local location regression \(\mathcal {L}_{\text{off}}\) (based on ground-truth object positions \({\bf p}_i\) ):

\(\begin{equation} \mathcal {L}_k = \frac{1}{N}\sum _{xyc}{\left\lbrace \begin{array}{ll} (1-\hat{Y}_{xyc})^2 \log (\hat{Y}_{xyc}) \Leftrightarrow Y_{xyc} = 1\\ (1-Y_{xyc})^4(\hat{Y}_{xyc})^{2} \log (1-\hat{Y}_{xyc}) \end{array}\right.}, \end{equation}\)

(24)

\(\begin{equation} \mathcal {L}_{\text{size}} = \frac{1}{N}\sum _{i=1}^N|\hat{S}_{{\bf p}_i}- {\bf s}_i|, \end{equation}\)

(25)

\(\begin{equation} \mathcal {L}_{\text{off}} = \frac{1}{N}\sum _{i=1}^N \left| \hat{D}_{{\bf p}_i^t} - ({\bf p}_i^{t-1} - {\bf p}_i^t) \right|. \end{equation}\)

(26)

A filter is used to discard heatmaps below a certain rendering threshold \(\tau\) or objects whose detection confidence scores \(w, w \in [0,1]\) are below a certain threshold \(\theta\) . These thresholds form the optimization hyperparameters for the symbolic component (the filter). The error metric is the sum of the multi-object tracking accuracy (MOTA) and the minimal cost change from the predicted identification of objects to the correct identification (IDF1) [179].

4.3 Neuro \(\cup\) Compile (Symbolic)

Problem Formulation. There are two ways to realize this paradigm. First, if the rules are non-differentiable, the rules are characteristic of certain architectural encodings post-training, or the rules cannot be explicitly expressed in the model learning algorithm, then the constraints can be expressed as regularizer terms in Equation (7):

\(\begin{equation} \begin{array}{l} \text{min}\hspace{2.84544pt}f_{\text{opt}}, \hspace{5.69046pt} f_{\text{opt}} = \lambda _1f_{\text{error}}(\mathbf {\Omega }^{\prime }) + \lambda _2f_{\text{flash}}(\mathbf {\Omega }^{\prime }) + \lambda _3f_{\text{SRAM}}(\mathbf {\Omega }^{\prime }) + \lambda _4f_{\text{latency}}(\mathbf {\Omega }^{\prime })\\ \qquad \qquad +\ \lambda _5f_{\text{rule 1}}(\mathbf {\Omega }^{\prime }) + \lambda _6f_{\text{rule 2}}(\mathbf {\Omega }^{\prime }) + \cdots \end{array} \end{equation}\)

(27)

\(\mathbf {\Omega }^{\prime }\) contains only the ML components (i.e., \(\mathbf {\Omega }^{\prime } = \lbrace \lbrace V,E\rbrace , \theta _m, m, w\rbrace\) ), reducing the neurosymbolic architecture search to a NAS problem, regularized by additional scalar rules. The rules can form soft constraints that do not form piecewise penalization functions, or hard constraints like SRAM and flash consumption to strongly penalize the search algorithm beyond a small, valid region of \(\mathbf {\Omega }^{\prime }\) . Second, if the rules are differentiable, or the rules can be compiled away during training as input-output pairs, then the constraints can be included as physics metadata channels in the learning algorithm as inputs to the model graph \(q\) :

\(\begin{equation} \text{min}\hspace{2.84544pt}f_{\text{opt}}, \hspace{5.69046pt} f_{\text{opt}} = \lambda _1f_{\text{error}}(\mathbf {\Omega }^{\prime }) + \lambda _2f_{\text{flash}}(\mathbf {\Omega }^{\prime }) + \lambda _3f_{\text{SRAM}}(\mathbf {\Omega }^{\prime }) + \lambda _4f_{\text{latency}}(\mathbf {\Omega }^{\prime }), \end{equation}\)

(28)

where

\(\begin{equation} f_{\text{error}}(\mathbf {\Omega }^{\prime }) = \mathcal {L}_{\text{validation}}({\bf Y}^{\prime }, {\bf Y}), \hspace{5.69046pt} {\bf Y}^{\prime } = q^{\mathbf {\Omega }^{\prime }}({\bf X}, {\bf x}_{\text{physics metadata channel}}). \end{equation}\)

(29)

Parsing. In the first case, the parsers only need to map the model from Python to C, following the recipe of model parsing in Section 4.1, Algorithm 2, and Figure 4. In the second case, since the rules and hyperparameters are static and operate on the input data, there is no concept of symbolic optimization or symbolic parsing. Rather, there exists a function called extract_physics() in main.cc that operates on the raw data to generate the physics metadata channel, shown in Algorithm 5. The channel is appended to the end of the raw data, which is then fed to the model as an input tensor.

Examples. An example of the first technique includes finding adversarially robust TinyML models, where \(f_{\text{rule 1}}(\mathbf {\Omega }^{\prime })\) denotes the white-box adversarial robustness score from RobustBench [38] or AutoAttack [39] benchmarks on a perturbed validation set (e.g., perturbed using fast gradient sign method (FGSM) or projected gradient descent (PGD)) versus the clean validation set:

\(\begin{equation} f_{\text{rule 1}}(\mathbf {\Omega }^{\prime }) = 1-\frac{1}{N}\sum _{i=0}^Nq_i, \hspace{5.69046pt} q_i ={\left\lbrace \begin{array}{ll}1 \Leftrightarrow y^{\prime x_i} = y^{\prime x_i, \text{perturbed}} \\ 0 \end{array}\right.}\!\!\!\!\!\!\!, \end{equation}\)

(30)

where

\(\begin{equation} x_{i, \text{perturbed}} = \underbrace{\left[x_i + \varepsilon \cdot \text{sign} \left(\nabla _{x_i} \mathcal {L}_{\text{validation}}(q^{\mathbf {\Omega }^{\prime }}(x_i),y^i)\right)\right]}_{\text{FGSM}} \vee \underbrace{\left[ \text{clip}_{\varepsilon }\left(x_i^t + \alpha \cdot \text{sign} \left(\nabla _{x_i} \mathcal {L}_{\text{validation}}(q^{\mathbf {\Omega }^{\prime }}(x_i)^t,y^i)\right)\right) \right]}_{\text{PGD}}. \end{equation}\)

(31)

\(\alpha\) and \(\varepsilon\) are attack strength hyperparameters in Equation (31). An example of the second technique includes supplying a neural inertial navigation model with local-variance step detector binary mask or mean Fourier transform coefficients of accelerometer readings \({}^{I}{}{\hat{{\bf a}}}\) , signifying transportation modes. The goal is to prevent the network from outputting invalid displacements when the object is static [134]:

\(\begin{equation} {\bf x}_{\text{physics metadata channel}} = c({}^{I}{}{\hat{{\bf a}}}), \hspace{5.69046pt} c_j({}^{I}{}{\hat{{\bf a}}}) = \underbrace{{\left\lbrace \begin{array}{ll} 1 \Leftrightarrow \hat{{\bf a}}^I_{L,\Delta t} \gt \zeta \cdot \sqrt {\frac{\sum _{k \in \Delta t}\left(\hat{{\bf a}}^I_{L,k} - \overline{\hat{{\bf a}}^I_{L,\Delta t}}\right)^2}{n}}\\ 0 \end{array}\right.}}_{\text{step detector}} \vee \underbrace{|\overline{\text{FFT}(|\hat{{\bf a}}^I_{\Delta t}|)|}|}_{_{\text{Fourier transform}}}, \end{equation}\)

(32)

where \(j\) is the measurement epoch, \(\Delta t\) is the length of current time window, \(\hat{{\bf a}}^I_{L, \Delta t} = G_{5,f_c}(|\hat{{\bf a}}^I_{\Delta t}|) - G_{5,f_c}(\overline{|\hat{{\bf a}}^I_{\Delta t}|})\) , \(\zeta\) is a tunable parameter, and \(G_{5,f_c}(\cdot)\) represents a fifth-order low-pass filter with cutoff \(f_c\) . The model is expected to output zero displacements when the physics metadata channel value drops below a threshold \(\tau\) :

\(\begin{equation} \mathbb {E}(y_j^{\prime }) \rightarrow 0 \hspace{2.84544pt}| \hspace{2.84544pt}x_{j,\text{physics metadata channel}} \lt \tau . \end{equation}\)

(33)

We showcase the examples in Sections 5.5 and 5.6.

4.4 Symbolic[Neuro]

Problem Formulation. Consider a dynamical system such that \(g : \hat{{\bf x}}_{k+1|k} \rightarrow {\bf u}_{k+1}\) , \(\hat{{\bf x}}_{k} \hspace{5.69046pt}|\hspace{5.69046pt} g\) \(\text{is non-linear}\) . \(\hat{{\bf x}}_{k+1|k}\) represents the state at epoch \(k+1\) , \(\hat{{\bf x}}_{k}\) represents the state at epoch \(k\) , \(g(\cdot)\) is a neural network backbone, and \({\bf u}_{k+1}\) represents the control input (sensor measurements) at epoch \(k+1\) . The neural system evolution is given as follows:

\(\begin{equation} \hat{{\bf x}}_{k+1|k} = g_v(\hat{{\bf x}}_{k}, {\bf u}_{k+1}, {\bf w}_{k+1}). \end{equation}\)

(34)

\({\bf w}_{k+1}\) is the additive White Gaussian process noise with covariance Q. Now, consider measurement updates \({\bf z}_{k+1}\) coming from a symbolic observation model \(h(\cdot)\) via complementary sensor measurements:

\(\begin{equation} \hat{{\bf x}}_{k+1|k+1} = \hat{{\bf x}}_{k+1|k} + {\bf K}_{k+1}\left(\underbrace{{\bf z}_{k+1} - h_{u}(\hat{{\bf x}}_{k+1|k}, {\bf v}_k}_{\text{measurement residual}})\right). \end{equation}\)

(35)

\({\bf v}_{k}\) is the additive White Gaussian measurement noise with covariance R, and \({\bf K}_{k+1}\) is a gain factor. The goal is to optimally fuse the neural system model and the symbolic measurement model. Assuming Markov property, modeling the uncertainty in \(g(\cdot)\) and \(h(\cdot)\) using Kalman filter theory allows optimal fusion [50]:

\(\begin{equation} {\bf P}_{k+1|k} = {\bf A}{\bf P}_k{\bf A}^T + {\bf B}_{k+1}{\bf U}_{k}{\bf B}_{k+1}^T + {\bf Q}_{k}, \hspace{5.69046pt}{\bf A}_{k+1} = \frac{\partial g}{\partial x}\bigg |_{\hat{{\bf x}}_{k},{\bf u}_{k+1},{\bf w}_{k+1}}, \hspace{5.69046pt}{\bf B}_{k+1} = \frac{\partial g}{\partial u}\bigg |_{\hat{{\bf x}}_{k},{\bf u}_{k+1},{\bf w}_{k+1}}, \end{equation}\)

(36)

\(\begin{equation} {\bf P}_{k+1|k+1} = \left({\bf I} - {\bf K}_{k+1} {\bf H}_{k+1}\right){\bf P}_{k+1|k}, \hspace{5.69046pt} {\bf H}_{k+1} = \frac{\partial h}{\partial x}\bigg |_{\hat{{\bf x}}_{k+1|k}, {\bf v}_k}, \end{equation}\)

(37)

where

\(\begin{equation} {\bf K}_{k+1} = {\bf P}_{k+1|k}{\bf H}_{k+1}^T\left(\underbrace{{\bf H}_{k+1}{\bf P}_{k+1|k}{\bf H}_{k+1}^T + {\bf R}_{k+1}}_{\text{innovation covariance}}\right)^{-1}. \end{equation}\)

(38)

\({\bf A}_{k+1}\) and \({\bf B}_{k+1}\) represents the linearized Jacobian of the neural network w.r.t. the past state and control inputs, while \({\bf H}_{k+1}\) represents the linearized partial derivative of the observation model w.r.t. the past state. The predicted process covariance \(\hat{{\bf P}}\) is given by the Lyapunov equation and updated during measurements using algebraic Riccati recursion [151]. The goal of the search algorithm is to find the optimal hyperparameters of \(g(\cdot)\) and \(h(\cdot)\) , given by hyperparameter vectors \({\bf v}\) and u, respectively:

\(\begin{equation} {\bf u} = \begin{bmatrix} [\alpha _1^{1,1}, \alpha _2^{1,1}, \ldots , \alpha ^{1,1}_{\gamma ^1_1}] & [\alpha _1^{1,2}, \alpha _2^{1,2}, \ldots , \alpha ^{1,2}_{\gamma ^1_2}] & ... & [\alpha _1^{1,e}, \alpha _2^{1,e}, \ldots , \alpha ^{1,e}_{\gamma ^1_e}] \\ \end{bmatrix}, \end{equation}\)

(39)

\(\begin{equation} {\bf v} = \begin{bmatrix} [\beta _1^{1,1}, \beta _2^{1,1}, \ldots , \beta ^{1,1}_{\gamma ^1_1}] & [\beta _1^{1,2}, \beta _2^{1,2}, \ldots , \beta ^{1,2}_{\gamma ^1_2}] & ... & [\beta _1^{1,f}, \beta _2^{1,f}, \ldots , \beta ^{1,f}_{\gamma ^1_f}] \\ \end{bmatrix}. \end{equation}\)

(40)

Parsing. The model parsing follows the same recipe shown in Section 4.1, Algorithm 2, and Figure 4. The symbolic parser sends the optimal \({\bf u}^*\) to main.cc. Algorithm 6 shows an example of the main.cc. The program extensively uses matrix operations (obtainable through CMSIS-NN library [95] available through TFLM) to compute the Kalman hyperparameters. CMSIS-NN matrix operation constructs are used in reshape_jacobian(), lyapunov_eq(), measurement_update(), get_pd(), compute_kalman_gain(), and ricatti() functions to accelerate matrix operations through vector processors found in some Cortex-M microcontrollers. However, a key challenge in realizing the Symbolic[Neuro] form is the lack of on-board Jacobian computation support (GetJacobian()).

Examples. We showcase a Neural-Kalman filter that fuses GPS measurements with a neural inertial odometry model to regress an object’s position [50]. The example is shown in Section 5.7. The neural network regresses the object’s 2D velocity \(v_x, v_y\) from accelerometer \(\hat{{\bf a}}^I\) , gyroscope \(\hat{{\bf w}}^I\) , and magnetometer \(\hat{{\bf m}}^I\) readings:

\(\begin{equation} (v_{x,k}, v_{y,k}) = g\big ({\bf v}^I(0), {\bf g}_0^I, {\bf N}_0^I, \hat{{\bf a}}^I_{q:q+n},\hat{{\bf w}}^I_{q:q+n}, \hat{{\bf m}}^I_{q:q+n}, c_k({}^{I}{}{\hat{{\bf a}}})\big), \hspace{5.69046pt} c_k({}^{I}{}{\hat{{\bf a}}}) = \left| \overline{| \text{FFT}(|\hat{{\bf a}}^I_{q:q+n}|)|} \right|. \end{equation}\)

(41)

The system propagation is given as follows:

\(\begin{equation} \hat{{\bf x}}_{k+1|k} = {\bf A}\hat{{\bf x}}_{k} + f({\bf u}_{k+1}), \end{equation}\)

(42)

\(\begin{equation*} {\bf P}_{k+1|k} = {\bf A}{\bf P}_k{\bf A}^T + {\bf B}_{k+1}{\bf U}_{k}{\bf B}_{k+1}^T \nonumber \nonumber , \hspace{5.69046pt}{\bf B}_{k+1} = \frac{\partial f}{\partial u}\bigg |_{\hat{{\bf x}}_{k},{\bf u}_{k+1}}, \end{equation*}\)

where

\(\begin{equation} \hat{{\bf x}} = \begin{bmatrix}\hat{L}_x \\ \hat{L}_y\\ v_x\\ v_y \end{bmatrix}, \hspace{5.69046pt} {\bf u} = \begin{bmatrix}{\bf a}^I_{q:q+n} \\ {\bf w}^I_{q:q+n} \\ {\bf m}^I_{q:q+n} \\ c({\bf a}^I_{q:q+n}) \end{bmatrix}, \hspace{5.69046pt} {\bf A} = \begin{bmatrix}{\bf I}_{2 \times 2} & {\bf 0}_{2 \times 2} \\ {\bf 0}_{2 \times 2} & {\bf 0}_{2 \times 2} \\ \end{bmatrix}, \hspace{5.69046pt} {\bf B}_{k+1} = \begin{bmatrix}\frac{\Delta t\partial g_{v}(\cdot)_x}{\partial {\bf a}^I_{q:q+n}} & \frac{\Delta t\partial g_{v}(\cdot)_x}{\partial {\bf w}^I_{q:q+n}} & \frac{\Delta t\partial g_{v}(\cdot)_x}{\partial {\bf m}^I_{q:q+n}} & \frac{\Delta t\partial g_{v}(\cdot)_x}{\partial c({\bf a}^I_{q:q+n})} \\ \frac{\Delta t\partial g_{v}(\cdot)_y}{\partial {\bf a}^I_{q:q+n}} & \frac{\Delta t\partial g_{v}(\cdot)_y}{\partial {\bf w}^I_{q:q+n}} & \frac{\Delta t\partial g_{v}(\cdot)_y}{\partial {\bf m}^I_{q:q+n}} & \frac{\Delta t\partial g_{v}(\cdot)_y}{\partial c({\bf a}^I_{q:q+n})} \\ \frac{\partial g_{v}(\cdot)_x}{\partial {\bf a}^I_{q:q+n}} & \frac{\partial g_{v}(\cdot)_x}{\partial {\bf w}^I_{q:q+n}} & \frac{\partial g_{v}(\cdot)_x}{\partial {\bf m}^I_{q:q+n}} & \frac{\partial g_{v}(\cdot)_x}{\partial c({\bf a}^I_{q:q+n})} \\ \frac{\partial g_{v}(\cdot)_y}{\partial {\bf a}^I_{q:q+n}} & \frac{\partial g_{v}(\cdot)_y}{\partial {\bf w}^I_{q:q+n}} & \frac{\partial g_{v}(\cdot)_y}{\partial {\bf m}^I_{q:q+n}} & \frac{\partial g_{v}(\cdot)_y}{\partial c({\bf a}^I_{q:q+n})} \\ \end{bmatrix}, \end{equation}\)

(43)

\(\begin{equation} f(\cdot) = \begin{bmatrix} \Delta t\cdot {\bf I}_{2 \times 2}\\ {\bf I}_{2 \times 2} \end{bmatrix}\cdot g_{v}(\cdot), \hspace{5.69046pt}\Delta t = \frac{s}{n-s}, \hspace{5.69046pt} s = \text{stride}, n = \text{window size.} \end{equation}\)

(44)

\({\bf U}\) consists of Allan variance parameters [52] of the inertial measurement unit. The measurement updates \({\bf z}\) come from the GPS module. \(h\) denotes the inverse mapping from longitude-latitude to 2D Cartesian coordinates. The hyperparameters of the neural network and the Kalman filter are optimized jointly.

4.5 Neuro[Symbolic]

Problem Formulation and Parsing. This paradigm is equivalent to a model with special operators or layers. The search space, therefore, contains the hyperparameters of the model backbone to be optimized. The model parsing follows the same recipe shown in Section 4.1, Algorithm 2, and Figure 4, with no symbolic parsing. However, the special layers must be added as custom operators first to TFLite, and then to TFLM. The steps are as follows:

•

Create the custom operator in TensorFlow.

•

Clone Tensorflow repository.

•

Define the init(), free(), prepare(), and eval() functions for the operator in the OPERATOR_NAME.cc file in tensorflow/lite/kernels/ directory.

•

Register the operator in tensorflow/lite/kernels/register.cc and register_ref. cc. Add the registration under namespace custom and BuiltinRefOpResolver:: BuiltinRefOpResolver(). In the BUILD file, under cc_library(name = ”builtin_op_ kernels”, add the operator .cc file names under srcs. Add the dependencies under deps.

•

Configure, build, and install the modified TensorFlow. Load the model with the custom operator in the TFLite interpreter in Python to verify the correct operation.

•

From tensorflow/lite/core/api/flatbuffer_conversions.cc, under ParseOpData TfLite, extract the code for parsing the operator into a function.

•

Extract the reference for the operator to a standalone header from tensorflow/lite/ kernels/internal/ reference/. Add the new header to tensorflow/lite/kernels/ internal/BUILD.

•

Copy the operator code from tensorflow/lite/kernels/OPERATOR_NAME.cc to tensorflow/lite/micro/ kernels/OPERATOR_NAME.cc. Remove TFLite-specific code. Add the operator registrations in micro_ops.h, micro_mutable_op_resolver.h, and all_op_resolver.cc.

5 Evaluation

In this section, we evaluate the performance of TinyNS on six different case studies resembling four neurosymbolic architecture search recipes (Sections 5.2 to 5.7). We also validate the viability of TinyNS for generating performant microcontroller-class models on the industry-standard MLPerf Tiny v0.5 Inference Benchmark [13] in Section 5.1.

5.1 MLPerf Tiny v0.5 Inference Benchmark

The MLPerf Tiny v0.5 Benchmark Suite contains four classification tasks and quality target metrics representing a wide array of TinyML applications [13, 136]. The tasks include image classification (CIFAR10 dataset [91]), unsupervised anomaly detection (ToyADMOS dataset [88]), keyword spotting (Google Speech Commands dataset [170]), and visual wake words detection (Visual Wake Words dataset [32]). We benchmark TinyNS on the first three tasks.

5.1.1 Dataset Splits and Pre-processing.

We use the standard dataset splits and pre-processing functions provided by the benchmark suite. For CIFAR10, 50,000 32 \(\times\) 32 \(\times\) 3 images are used for training, and 10000 images are used for testing. The dataset has 10 output classes. For ToyADMOS, 3,600 and 400 non-anomalous sound samples from four toy cars mixed with ambient noise are used for training and validation, respectively, and 2,500 anomalous and non-anomalous sound samples from the same four toy cars are used for testing. The pre-processor extracts the Mel-scaled power spectrogram from the raw WAVE files using 128 Mel bands, five frames, an FFT window length of 1,024, and a hop length of 512. The spectrogram is converted to log Mel energy, clipped to keep the central portion, and concatenated with other frames to generate features. Each input tensor is a vector of length 640. For Google Speech Commands, the 100,503 1-s keywords from 2,618 speakers are divided into 85,511, 10,102, and 4,890 utterances for training, validation, and testing, respectively. The dataset has 12 output classes. The pre-processor extracts the log Mel-frequency cepstral coefficient (MFCC) fingerprints from the raw 16 KHz WAVE files after decoding, volume scaling, random time-shifting (100 mS), and adding background noise to the raw audio data. The window size is 30 mS and the stride is 20 mS. 10 MFCC coefficients are used, resulting in each model input being a 49 \(\times\) 10 \(\times\) 1 tensor.

5.1.2 Model Backbones, Training Details, and Search Space Definition.

For image recognition, we optimize the ResNet [77] backbone provided in the benchmark suite. Following the settings in the MLPerf Tiny v0.5 Benchmark [13] and state-of-the-art NAS frameworks for microcontrollers [14, 54, 60, 101, 103, 124], we train each candidate model for a fixed number of epochs of 500. While green AI advocates for training epochs to be considered as a hyperparameter [145] to be optimized, the additional hyperparameter may lead to a longer NAS convergence time from more candidate models being trained to achieve acceptable accuracy, minimizing the reduction in the total number of training epochs. In addition, TinyML neural architectures are either well-known (e.g., ResNet [77], MobileNets [79], or SqueezeNet [81]) or compact (e.g., FastGRNN [94], Bonsai [93], ProtoNN [74] or temporal CNN [97]), allowing the use of known and fixed training epochs or a small number of training epochs to achieve acceptable performance [136]. We use the Adam optimizer with a learning rate scheduler having an initial learning rate of 0.001 and decaying by a factor of 0.99 with each passing epoch. The batch size is 32, the loss is categorical cross-entropy, and the NAS error metric is training accuracy. The optimization hyperparameters include:

•

Number of convolutional stacks: range (1, 5)

•

Kernel size: [1, 3, 5, 7]

•

Number of filters (initial layer): [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24]

•

Use batch normalization: [True, False]

•

Use activations: [True, False]

For anomaly detection, we optimize a temporal convolutional autoencoder (denoted as 1D-CNN in the rest of the article) backbone inspired by Thill et al. [159]. The encoder is a TCN [97, 162] without dilated kernels, followed by a 1D convolutional layer (linear activation) with a quarter and one-third of the number of filters and a kernel size of the TCN layer, respectively. The decoder includes the same layers but in reverse, followed by a fully connected layer with 640 units and linear activation. Each candidate model is trained for 350 epochs, using the AMSGrad variant of the Adam optimizer with a learning rate of 0.001, \(\beta _1\) of 0.9, \(\beta _2\) of 0.999, and \(\epsilon\) of 1e-8. The batch size is 1,024, the loss is the mean-squared error, and the NAS error metric is validation loss. The search space is as follows:

•

Number of layers per stack: range (3, 8)

•

Number of TCN stacks: [1, 2, 3]

•

Number of filters in the TCN layers: range (3, 64)

•

Kernel size in the TCN layers: range (3, 16)

•

Skip connections in TCN: [True, False]

For keyword spotting, we optimize a TCN, which can handle spatial and temporal features hierarchically without the explosion of hyperparameter count [97, 162]. The TCN layer is followed by a dense layer with 12 units and softmax activation. Each candidate model is trained for 60 epochs, using the Adam optimizer with a step function learning rate scheduler. The batch size is 1,000, the loss is sparse categorical cross-entropy, and the NAS error metric is sparse categorical accuracy. The search space is as follows:

•

Number of layers per stack: range (3, 8)

•

Number of TCN stacks: [1, 2, 3]

•

Number of filters in the TCN layers: range (2, 64)

•

Kernel size in the TCN layers: range(2, 16)

•

Skip connections in TCN: [True, False]

•

Dilation factor choices: [1, 2, 4, 8, 16, 32, 64, 128, 256]

5.1.3 Overall Performance.

Figures 6 (Left) and 7 showcases the Pareto-optimal frontier generated by TinyNS versus competing frontiers and microcontroller models. TinyNS exceeds the benchmark accuracy by 4.3% and 5.5% for image recognition and anomaly detection, respectively, while consuming 1.14 \(\times\) –3.09 \(\times\) lower flash. For image recognition, TinyNS outperforms models generated SpArSe [59] and \(\mu\) NAS [101] by 4.5%–17.5% while taking 1.7 \(\times\) –7.7 \(\times\) lower convergence time (shown in Figure 6 (Right)). Compared to LEMONADE [53], TinyNS provides 2.2 \(\times\) smaller models at the cost of 1.3% accuracy loss. TinyNS converges faster than gradient-based or evolutionary NAS due to two key properties. First, TinyNS can eliminate infeasible candidate models in the search space without training, thanks to accurate hardware profiling using real microcontrollers during the search process. Proxies are unable to take into account the compiler runtime optimizations, and the dynamic overhead from RTOS, data stacks, and model interpreters. For all three tasks, the models generated by proxied TinyNS not only have sub-optimal accuracy (1.6%–5.5% lower) and flash usage (4.2 \(\times\) higher) compared to proxy less TinyNS but also have higher convergence time (2.3 \(\times\) higher). Second, the exploration-exploitation philosophy of the acquisition function, coupled with parallel search capabilities and the computationally tractable sampling-based approach allows TinyNS to approach the global optimum without requiring evaluation of thousands of candidate architectures. Each model in the Pareto-frontier is generated within 10–50 iterations. For anomaly detection, TinyNS outperforms attention-based OutlierNets [2] by 6.3% and guarantees deployability over MobileNetv2 [140], but underperforms over MicroNets [14] models. We hypothesize that flattening the log MFCC in the 1D-CNN backbone loses spatial correlation across the feature coefficients. This phenomenon also generates sub-optimal TinyNS models for keyword spotting, failing to cross the benchmark accuracy of 90% as shown in Figure 7 (Right). This showcases the importance of performing NAS not just over a single model backbone, but over multiple model backbones. In Section 5.3, we showcase how TinyNS operating on a search space with multiple models can generate models with the lowest flash usage and highest accuracy. Regardless, given an ideal model backbone, TinyNS can generate models with the highest accuracy and guaranteed deployability within a few evaluations without requiring expensive training infrastructure.

Fig. 6.

Fig. 7.

5.1.4 Architectural Adaptation Based on Resource Availability.

Tables 2–4 show the hyperparameters of the model backbones for the three tasks generated by TinyNS for four different STM32 microcontrollers with varying SRAM and flash limits. In general, as the device capabilities increase, TinyNS generates models that have higher FLOPS, and higher SRAM and flash usage. Instead of providing the smallest model with the highest accuracy, TinyNS adapts hyperparameters such as the number of kernels, size of kernels, and the number of convolutional stacks with increasing device capabilities to maximize accuracy. Figures 8 and 9 show visual examples of such architectural adaptation for three of the four microcontrollers. As the SRAM and flash capacity increases, TinyNS automatically adjusts the number of layers per stack, the number of stacks, the kernel size, and the number of filters depending on an increase in SRAM or flash. For example, a model with more parameters but a smaller kernel size and filter count are likely to benefit from an increase in flash but no change in SRAM. Likewise, when dilated convolutions are used, TinyNS assigns a small dilation factor to earlier layers and a large dilation factor in later layers when it cannot increase the number of layers due to resource limits. This allows a TCN with a limited layer count to have the same receptive field (albeit less fine-grained) as a TCN with more layer count, capturing both short-term local context and long-term global time-series inter-dependencies. Tables 2–4 further showcase the problem with proxies as opposed to real-hardware profiling. These models have a higher number of parameters but a lower number of filters and kernel size than proxy-less models. Since proxies are unable to take into account compiler optimizations, the generated models underestimate the available SRAM and overestimate the flash usage, yielding models with poor accuracy.

Table 2.

Device	Profiling	SRAM Usage (kB)	Latency (s) or FLOPS	Number of filters	Kernel size	Number of stacks	Batch normalization	Activations
F446RE (128, 512)	Real	107	0.58 (L)	10	5	4	True	True
F446RE (128, 512)	Proxy	95.8	12.9M (F)	4	7	4	True	True
L476RG (128, 1,024)	Real	87.8	3.13 (L)	24	5	2	True	True
L476RG (128, 1,024)	Proxy	56.5	3.82M (F)	6	3	3	True	True
F746ZG (320, 1,024)	Real	308	1.39 (L)	22	7	2	True	True
F746ZG (320, 1,024)	Proxy	286	55.9M (F)	24	3	3	True	True
L4R5ZI_P (640, 2,048)	Real	608	1.13 (L)	20	3	4	True	True
L4R5ZI_P (640, 2,048)	Proxy	309	40.9M (F)	18	3	4	False	True

Table 2. Chosen ResNet Model Hyperparameters for Each Target Hardware by TinyNS on the CIFAR10 Dataset

The SRAM and flash limits of the hardware are given in parenthesis in kB in the form (SRAM, Flash).

Table 3.

Device	Profiling	SRAM Usage (kB)	Latency (s) or FLOPS	No. of filters	Kernel size	No. of layers per stack	No. of stacks	Skip connections
F446RE (128, 512)	Real	87.8	0.01 (L)	50	3	5	1	True
F446RE (128, 512)	Proxy	81.3	0.32M (F)	16	10	4	1	True
L476RG (128, 1,024)	Real	88.2	0.06 (L)	38	10	6	1	True
L476RG (128, 1,024)	Proxy	62.0	0.24M (F)	26	3	5	1	True
F746ZG (320, 1,024)	Real	288	0.01 (L)	42	4	4	3	True
F746ZG (320, 1,024)	Proxy	78.1	0.31M (F)	30	4	3	1	True
L4R5ZI_P (640, 2,048)	Real	608	0.03 (L)	63	3	5	1	True
L4R5ZI_P (640, 2,048)	Proxy	444	1.77M (F)	57	6	4	2	True

Table 3. Chosen 1D-CNN Model Hyperparameters for Each Target Hardware by TinyNS on the ToyADMOS Dataset

The SRAM and flash limits of the hardware are given in parenthesis in kB in the form (SRAM, Flash).

Table 4.

Device	Profiling	SRAM Usage (kB)	Latency (s) or FLOPS	No. of filters	Kernel size	Dilations, no. of layers per stack	No. of stacks	Skip connections
F446RE (128, 512)	Real	106	0.31 (L)	51	9	[1,8,64,128], 4	2	True
F446RE (128, 512)	Proxy	77.8	21.6M (F)	27	9	[1,2,16,32,64,128], 6	2	True
L476RG (128, 1024)	Real	95.4	0.65 (L)	44	7	[1,2,4,8,16,128], 6	2	True
L476RG (128, 1024)	Proxy	79.4	22.0M (F)	30	9	[1,2,8,16,128], 5	2	True
F746ZG (320, 1024)	Real	286	0.04 (L)	45	4	[1,4,16,64,128], 5	1	True
F746ZG (320, 1024)	Proxy	147	32.4M (F)	56	4	[1,4,8,64], 4	3	True
L4R5ZI_P (640, 2048)	Real	606	1.66 (L)	63	8	[1,4,8,16,32,64,128,256], 8	3	True
L4R5ZI_P (640, 2048)	Proxy	210	68.2M (F)	55	8	[1,16,128], 3	3	True

Table 4. Chosen TCN Model hyperparameters for Each Target Hardware by TinyNS on the Google Speech Commands Dataset

The SRAM and flash limits of the hardware are given in parenthesis in kB in the form (SRAM, Flash).

Fig. 8.

Fig. 9.

5.1.5 Convergence Time of Proxyless versus Proxied TinyNS.

Figure 10 shows the number of iterations needed to reach the best optimization score for proxy less and proxied TinyNS for all three tasks. Mango allows both random initialization and an initial set of evaluation points to warm up the optimizer. The user can either customize the initial evaluation points to guide the optimization process or choose random sampling to mitigate randomness effects [138]. We showcase the results for an average from three independent runs for each algorithm to account for the effect of randomness. For both profiling techniques, tighter hardware constraints (lower SRAM and flash capacities) equate to more iterations required for convergence. However, proxy less TinyNS converges 3.2 \(\times\) –12.6 \(\times\) faster to the highest performing model compared to proxied TinyNS. Intuitively, platform-in-the-loop should be slow while analytical proxies should be fast, as real measurements have compilation time and profiling time overhead and are not immediate. However, since proxies are inaccurate and do not reflect the execution level dynamics, more infeasible model candidates are trained rather than discarded, wasting valuable computing time and increasing the search completion time. In our evaluation, we found the platform-in-the-loop approach to be 50% faster than using proxies for hardware profiling. Even though proxied TinyNS achieves a higher score than proxy less TinyNS, the deployability of models generated by proxied TinyNS is not guaranteed due to high flash consumption. Further, we have seen earlier that these models do not fully exploit the SRAM capabilities and have lower accuracy than proxy-less models. The increased score achieved by proxied TinyNS is contributed by model candidates with a high flash footprint.

Fig. 10.

5.2 Optimization of Features and Neural Weights (Symbolic Neuro Symbolic)

In this case study, we showcase how TinyNS provides the best combination of features and neural network hyperparameters for various target hardware.

5.2.1 Dataset and Task Description.

We use the UCI-HAR dataset [6] for this case study. The task is to classify 6 human activities (walking, walking upstairs, walking downstairs, sitting, laying, and standing) from a single waist-mounted x-axis accelerometer data sampled at 50 Hz from 30 volunteers. The dataset is split with leave-7 out, i.e., data from 21 volunteers are in the training set, and data from the remaining 7 volunteers are in the test set. As suggested by the dataset authors, we use a window size of 128 (2.56 s) with a stride of 64. 10% of the training data is used for validation.

5.2.2 Model Backbones, Training Details, and Search Space Definition.

The model backbone consists of a TCN. The TCN layer is followed by a dense layer with 6 units and softmax activation. Each candidate model is trained for 150 epochs, using the Adam optimizer with default parameters. The loss is categorical cross-entropy, and the NAS error metric is validation accuracy. The search space for the model is as follows:

•

Number of layers per stack: range (3, 8)

•

Number of TCN stacks: [1, 2, 3]

•

Number of filters in the TCN layers: range (3, 64)

•

Kernel size in the TCN layers: range(3, 16)

•

Skip connections in TCN: [True, False]

•

Dilation factor choices: [1, 2, 4, 8, 16, 32, 64, 128]

The feature space consists of 12 features listed in Table 5. There are six statistical features, three temporal features, and three spectral features to choose from. The search space for the features is defined using the binary mask technique shown in Section 4.1.

Table 5.

Device	Features
ISPU (8, 32)	Mean	IQR	Maximum	Median	Variance	MAD	Abs. Energy	Entropy	Peak-to-Peak	FFT Mean Coeff.	Fundamental Frequency	Max. Power Spectrum
F446RE (128, 512)	Mean	IQR	Maximum	Median	Variance	MAD	Abs. Energy	Entropy	Peak-to-Peak	FFT Mean Coeff.	Fundamental Frequency	Max. Power Spectrum
L476RG (128, 1024)	Mean	IQR	Maximum	Median	Variance	MAD	Abs. Energy	Entropy	Peak-to-Peak	FFT Mean Coeff.	Fundamental Frequency	Max. Power Spectrum
F746ZG (320, 1024)	Mean	IQR	Maximum	Median	Variance	MAD	Abs. Energy	Entropy	Peak-to-Peak	FFT Mean Coeff.	Fundamental Frequency	Max. Power Spectrum
L4R5ZI_P (640, 2048)	Mean	IQR	Maximum	Median	Variance	MAD	Abs. Energy	Entropy	Peak-to-Peak	FFT Mean Coeff.	Fundamental Frequency	Max. Power Spectrum

Table 5. Chosen Features (Shaded) for Each Target Hardware for Neurosymbolic Optimization of Input Feature Choices and Model Backbone

The SRAM and flash limits of the hardware are given in parenthesis in kB in the form (SRAM, Flash).

5.2.3 Target Hardware.

We perform neurosymbolic optimization for the same four microcontrollers from Section 5.1. In addition, we also perform optimization for an integrated sensor processing unit (ISPU) from STMicroelectronics. The ISPU is an ultra-low-power 10 MHz 32-bit RISC processor (architecture: STRED) embedded within the LSM6DSOIS and ISM330IS 6DoF MEMS inertial sensor. The processor uses a proprietary version of TFLM (called q2c) to run on-chip neural networks without needing a power-hungry microcontroller in the loop and uses the STRED/ISPU toolchain to compile C++ programs. The processor has 8 kB SRAM and 32 kB flash [107].

5.2.4 Overall Performance.

Figure 11 (Left) shows the Pareto-frontier generated by TinyNS versus using all the features and directly operating on the raw accelerometer data. On average, TinyNS provides up to 2% improvement in accuracy over the same model operating on raw data or operating on all the features. Extracting all the features is computationally intensive (especially for the ISPU) while operating on raw data without a gyroscope or magnetometer or other axes of the accelerometer results in performance degradation. Tables 5 and 6 show the chosen features and model hyperparameters for each target hardware. Surprisingly, TinyNS learns to pick only the most important features (e.g., peak-to-peak, FFT mean coefficients, entropy, and variance) for the ISPU and the microcontrollers with the lowest SRAM and flash capacities. These features are well-known to have the highest effect on classifier performance in human activity recognition literature [8, 168]. As the device capabilities increase, TinyNS selects other features in the feature set. TinyNS also performs architectural adaptation and device capability exploitation seen in Section 5.1, increasing the number of filters, the kernel size, and the number of stacks of the model candidates. To prevent exploding and vanishing gradient problem, TinyNS learns to add skip connections to deeper TCN models. The SRAM usage and FLOPS count of the models steadily increase with increasing device capabilities as shown in Figures 11 (Center) and 11 (Right). The median SRAM saturation is around 20%, with the saturation being higher for devices with higher flash availability, showing full resource exploitation by TinyNS for each target hardware. Overall, choosing the best synergy of features and model hyperparameters makes it possible to run models on extremely resource-constrained platforms beyond microcontrollers like the ISPU.

Table 6.

Device	Number of filters	Kernel size	Number of stacks	Dilations, number of layers per stack	Skip connections
ISPU (8, 32)	3	5	1	[1,2,4,32,64,128], 6	False
F446RE(128, 512)	5	3	3	[1,2,16,32,128], 5	False
L476RG (128, 1024)	7	7	2	[1,2,4,32,128], 5	False
F746ZG (320, 1024)	3	10	3	[1,2,8,16,32], 5	True
L4R5ZI_P (640, 2048)	29	6	1	[1,4,16,64,128], 5	True

Table 6. Chosen Model hyperparameters for Each Target Hardware for Neurosymbolic Optimization of Input Feature Choices and Model Backbone

The SRAM and flash limits of the hardware are given in parenthesis in kB in the form (SRAM, Flash).

Fig. 11.

5.3 Fall Detection under 2 kB and Activity Recognition (Symbolic Neuro Symbolic)

In this case study, we showcase how TinyNS picks the best model backbone (neural or non-neural) and its hyperparameters out of a zoo of TinyML model backbones.

5.3.1 Dataset and Task Description.

We use the Auritus dataset [135] for this case study. There are two tasks. The first task is to distinguish between fall and non-fall activities under a 2 kB memory constraint (suitable for ISPU) using an ear-mounted 6DoF inertial measurement unit called earable. The second task is to classify nine human activities (walking, jogging, standing, sitting, laying, turning left, turning right, jumping, and falling). The dataset is sampled at 100 Hz from 45 volunteers. We split the dataset in two ways: split with no unseen participants and split with leave-1 out. In the first splitting technique, we use 80% of the data for training, 10% for validation, and 10% for testing. In the second splitting technique, we perform 10-way cross-validation by leaving a random participant out of the training set. The data from the chosen 44 participants are split 90:10 for training: validation. The stride was set to 0.5 s and the window size was optimized as a hyperparameter.

5.3.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

We set five different model backbones (three neural, two non-neural) in the search space, each with its own set of optimization hyperparameters:

•

TCN (neural) [97, 162]—number of filters in the TCN layers: range (2, 64); kernel size in the TCN layers: range (2, 16); skip connections in TCN: [True, False]; the number of layers per stack: range (3, 8); dilation factor choices: [1, 2, 4, 8, 16, 32, 64, 128, 256].

•

FastGRNN (neural) [94]—number of hidden units: range (20, 60).

•

FastRNN (neural) [94]—number of hidden units: range (20, 60).

•

Bonsai (non-neural) [93]—projection dimension: range (10, 70); sigmoid parameter: uniform (1.0, 4.0); depth: range(1, 6).

•

ProtoNN (non-neural) [74]—projection dimension: range (10, 70); \(\gamma\) : uniform (0.0015, 0.05); the number of prototypes: range (10, 70).

In addition, for all the models, the search space for the window size is [1, 2, 3, 5] s. For TCN, we generate Pareto-frontier for four different STM32 microcontrollers (F446RE, L476RG, F407VET6, and F746ZG) and the Qualcomm CSR8670 microcontroller found inside the earable. We use proxies for profiling the CSR processor as it does not support firmware modification. For the STM32 microcontrollers, we use platform-in-the-loop profiling. For Bonsai and ProtoNN, we apply five features on the accelerometer and gyroscope vector sums: maxima, minima, range, variance, and standard deviation. The rest of the models operate directly on the raw data. The loss is categorical cross-entropy for all the models, except for Bonsai, which uses multi-class hinge loss. The NAS error metric is validation accuracy for TCN and training accuracy for the rest of the classifiers.

5.3.3 Overall Results.

Figure 12 summarizes the accuracy and model size for the highest performing models for each of the five backbones against competing models, while Table 7 shows the hyperparameters of the said models. TinyNS achieves state-of-the-art improvement in both accuracy and model size reduction, providing earable activity detection models that are 98 \(\times\) –740 \(\times\) smaller yet 3%–6% more accurate than competing models. The activity recognition models are as small as 6–13 kB. Further, TinyNS achieves 98% earable fall detection accuracy with a model as small as 2.3 kB. The case study illustrates the importance of optimizing several model backbones rather than a single backbone, particularly in unseen domains void of expert knowledge. Notably, models with more parameters do not necessarily provide higher accuracies. Appropriate architectural encodings make it possible to achieve the same or better accuracy with a lower parameter count (e.g., a CNN is likely to outperform a fully connected neural network due to the ability to extract spatial relations, even though the latter may have more parameters). Even if one architecture performs poorly, the search algorithm would have other architectures to choose from. Thereby, exploring various architectures is important for squeezing highly performant models beyond microcontrollers, such as the ISPU.

Table 7.

Model Backbone	Device	Hyperparameters
Model Backbone	Device	Number of filters	Kernel size	Dilations, number of layers per stack	Skip connections
TCN	F446RE (128, 512)	18	2	[2, 4, 8, 16, 32, 64, 128, 256], 8	Yes
	L476RG (128, 1024)	13	7	[1, 4, 16, 32], 4	No
	eSense earable (128, 16000)	15	2	[1, 2, 4, 8, 32, 128, 256], 7	Yes
	F407VET6 (192, 512)	17	3	[2, 4, 32, 128, 256], 5	No
	F746ZG (320, 1024)	21	2	[2, 8, 16, 64, 128, 256], 6	Yes
FastGRNN	None (hardware-agnostic)	Hidden Units
FastGRNN		50
FastRNN		32
Bonsai		Projection Dimension		Sigmoid Parameter	Depth
Bonsai		22		1.0	3
ProtoNN		Projection Dimension		Prototypes	\(\gamma\)
ProtoNN		70		70	0.004

Table 7. Chosen Model hyperparameters for Each Backbone Found by TinyNS when Optimizing Several Model Backbones for Earable Activity Detection

The SRAM and flash limits of the hardware are given in parenthesis in kB in the form (SRAM, Flash).

Fig. 12.

5.4 Optimization of Neural Detector Weights and Symbolic Object Tracker (Neuro \(\rightarrow\) Symbol)

In this case study, we show the ability of TinyNS to jointly optimize neural and symbolic modules, where the symbolic module makes high-level reasoning over the neural outputs.

5.4.1 Dataset and Task Description.

We use the MOT17 dataset [113] for this case study. The goal is to develop multiple people tracking algorithms from a single camera feed under model size constraints. The dataset is pre-processed using the ByteTrack library [178].

5.4.2 Model Backbones and Search Space Definition.

We use the ByteTrack library [178] to implement the CenterNet algorithm [179], which was discussed in Section 4.2. Each candidate model is trained for 70 epochs with a batch size of 16. The search space for the ResNet + Deformable Convolutional Network and the tracking filter are

•

Number of convolutional stacks: range (1, 5)

•

Kernel size: [1, 3, 5, 7, 9,..., 23]

•

Layer-wise activations: [True, False]

•

Head convolutional value: [50, 100, 150,..., 300]

•

Rendering threshold: linspace (0.1, 0.9, 9)

•

Confidence threshold: linspace (0.1, 0.9, 9)

5.4.3 Overall Results.

Table 8 shows the performance, resource usage, and hyperparameters of the CenterNet algorithm under hard memory constraints compared to the handcrafted algorithm with default hyperparameters. Note that the MOTA and IDF1 for all the models are low as no pre-training or fine-tuning on additional data is performed. The 250 MB model achieves MOTA and IDFf within 1% of the handcrafted model, while the 500 MB model exceeds the MOTA and IDF by 4.5%. The case study showcases that TinyNS can achieve the performance of neurosymbolic models hand-tuned using hundreds of human hours automatically, and even exceed the performance when device constraints relax. Compared to a human designer, TinyNS can find models whose hyperparameters may be counter-intuitive (e.g., reducing the head convolutional value from 150 to 100 and removing layer-wise activations for the 500 MB model) but provide superior performance.

Table 8.

Constraint	Flash Usage (MB)	Performance		Model hyperparameters				Filter hyperparameters (thresholds)
Constraint	Flash Usage (MB)	MOTA	IDF1	Kernel size	Stack count	Head convolution value	Activations	Rendering	Confidence
Handcrafted (none)	238	36.5	55.0	1	1	128	True	0.4	0.5
250 MB limit	238	36.1	54.6	1	1	150	True	0.3	0.4
500 MB limit	270	38.0	57.2	9	1	100	False	0.7	0.5

Table 8. Chosen Object Detector and Tracking Filter Hyperparameters for CenterNet Algorithm under Different Size Limits

5.5 Improving Adversarial Robustness of TinyML Models (Neuro \(\cup\) Compile (Symbolic))

In this case study, we showcase how TinyNS can find model architectures that follow some coveted architecture-dependent constraints.

5.5.1 Dataset and Task Description.

We use the Auritus dataset in this case study (the same dataset used in Section 5.3). The goal and the dataset splits are the same as that in Section 5.3, except that now we want TinyML models that not only have the highest accuracy within the device constraints but are also adversarially robust to white-box attacks (discussed in Section 4.3).

5.5.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

We use the TCN, Bonsai, and ProtoNN backbones using the same model search space defined in Section 5.3. The window size is fixed to 5 s. For the TCN, we generate Pareto-frontier for F446RE, L476RG, and F746ZG. The rest of the training details are the same as Section 5.3.

5.5.3 Overall Results.

Figure 13 shows the test accuracy, adversarial accuracy, and the model size of TinyNS generated models with adversarial robustness optimization, versus handcrafted models and models generated by TinyNS with no adversarial robustness optimization. TinyNS generates models that are 1%–26% (9% on average more adversarially robust than competing models while maintaining or exceeding the accuracy on the main task. This comes at the cost of increased model size, albeit well within the flash constraints of the target hardware. This is because larger models have more parameters and are therefore more robust to small input perturbations. In addition, models generated by TinyNS without adversarial robustness optimization are more sensitive to small perturbations compared to handcrafted models. This is probably due to high loss smoothness and low gradient variance in the loss contour of NAS-generated models [117].

Fig. 13.

5.6 Physics-aware Neural Inertial Localization (Neuro \(\cup\) Compile (Symbolic))

In this case study, we showcase how TinyNS can force models to follow some coveted constraints via the inclusion of physics channels.

5.6.1 Dataset and Task Description.

We use 5 inertial odometry datasets spanning 4 applications for this case study. These include two datasets for human tracking, namely, OxIOD [29] and RoNIN [78], AQUALOC [61] unmanned underwater vehicle (UUV) tracking, EuRoC MAV [22] undermanned aerial vehicle (UAV) tracking, and the GunDog [73] animal tracking. The split information for all the datasets is shown in Table 9. The goal is to train a model to predict the position of an object using inertial sensor data without GPS updates while mitigating position explosion error innate in inertial sensors due to bias and drift. The model must be able to detect when sufficient translational movement has not happened, thereby not updating the position (physics-aware).

Table 9.

Dataset	Sampling Rate (Hz)	Window Size	Stride	Splits (Tr, Val, Te) (%)	Model Epochs
OxIOD	100	200	10	85, 5, 10	900
RoNIN	200	400	20	70, 5, 25	900
AQUALOC	200	400	20	80, 5, 15	300
EuRoC MAV	200	50	5	80, 10, 10	300
GunDog	40	10	10	45 \(^\) , 5 \(^\) , 50	300

Table 9. Window Size, Stride, Training-validation-test Splits, and Training Epochs Used in the Inertial Odometry Datasets

\(^*\) Training trajectory split into two parts for train and validation splits.

5.6.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

We use a TCN backbone. The outputs of the TCN are reshaped, pooled, and flattened, and then fed to a 32-unit dense layer with linear activations. The loss is a mean-squared error, the optimizer is Adam with a learning rate of 0.001, and the NAS error metric is validation loss. The search space for the model is as follows:

•

Number of layers per stack: range (3, 8)

•

Dropout: uniform (0.0, 1.0)

•

Normalization: [Weight, Layer, Batch]

•

Number of filters in the TCN layers: range (2, 64)

•

Kernel size in the TCN layers: range (2, 16)

•

Skip connections in TCN: [True, False]

•

Dilation factor choices: [1, 2, 4, 8, 16, 32, 64, 128, 256]

We generate the Pareto-frontier for the 4 STM32 microcontrollers outlined in Section 5.3.

5.6.3 Overall Results.

Figure 14 shows the odometric resolution of models found by TinyNS (called TinyOdom) versus handcrafted state-of-the-art neural and symbolic models. TinyNS models outperform purely neural and purely symbolic models on all four applications by 1.15 \(\times\) while being 31 \(\times\) –134 \(\times\) smaller. In other words, TinyNS not only exceeds the resolution of human-designed neural and symbolic models but also ensures the deployability of the models on microcontrollers. The superior performance is possible partly due to the inclusion of the physics channel, which improves the resolution by \(1.1\times\) on average, as showcased in Table 10. The physics channel ensures that lightweight and under-parameterized models such as those generated by TinyNS are able to follow the underlying system physics as well as over-parametrized baselines. Figure 15 visualizes the architectural adaptation and device capability exploitation by TinyNS when generating the Pareto-frontier. As observed in previous sections, TinyNS changes the appropriate hyperparameters to improve device resource usage and resolution.

Table 10.

Dataset	Absolute Trajectory Error (m)		Relative Trajectory Error (m)
Dataset	With Physics	Without Physics	With Physics	Without Physics
OxIOD	3.35	3.86	0.90	1.24
AQUALOC	3.36	3.71	2.44	2.53
Agrobot (Phase 1)	7.85	9.13	1.10	1.33

Table 10. Effect of Removing the Physics Channel of Proposed Neural-inertial Odometry Models on Three Inertial Odometry Datasets

Fig. 14.

Fig. 15.

5.7 Neural-Kalman Sensor Fusion (Symbolic[Neuro])

In this case study, we showcase how TinyNS can optimally combine a neural system model with a symbolic measurement model using Kalman filter theory.

5.7.1 Dataset and Task Description.

We use the AgroBot dataset [50] in this case study. The goal is to perform precision localization of an agricultural robot using neural inertial localization, with intermittent GPS updates. The underlying system must fuse the smoothness and short-term resolution of neural inertial localization with the long-term precision of GPS. The dataset contains 6.5 h and 4.5 km of inertial and GPS data. We used 80% of the dataset for training and 20% for testing.

5.7.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

We used the same model backbone and search space outlined in Section 5.6. In addition, we optimize noise hyperparameters in the Kalman filter Allan variance matrix:

•

accelerometer noise variance: linspace (0, 1, 10,000)

•

gyroscope noise variance: linspace (0, 1, 10,000)

•

magnetometer noise variance: linspace (0, 1, 10,000)

The batch size, optimizer, and training epochs were set to 256, Adam (learning rate: 0.001), and 3,000, respectively. The NAS error metric is the absolute trajectory error during training. The model size constraint is set to 2 MB.

5.7.3 Overall Results.

Table 11 outlines the performance of TinyNS generated neurosymbolic model versus human-engineered state-of-the-art neural and symbolic approaches of localization. Compared to competing neural models, TinyNS model without GPS lowers model size and absolute trajectory error by 1.5 \(\times\) –27 \(\times\) and 1.4 \(\times\) –5.8 \(\times\) , respectively. Compared to competing symbolic models, TinyNS model with GPS lowers absolute trajectory error and relative trajectory error by 1.2 \(\times\) –11 \(\times\) and 1.1 \(\times\) –3.8 \(\times\) . The neural-Kalman fusion exploited by TinyNS combines the long-term precision of symbolic models with the short-term robustness and resolution of neural networks within the 2 MB limit set forth in this case study.

Table 11.

Paradigm	Method	Code Size (MB)	Absolute Trajectory Error (m)	Relative Trajectory Error (m)
Neural	IONet [28]	1.71	5.58 \| 10.1	0.92 \| 0.57
	L-IONet [29]	0.55	8.11 \| 18.6	0.91 \| 1.40
	AbolDeepIO [55]	12.5	7.24 \| 20.5	0.96 \| 0.93
	VeTorch [62]	29.6	2.86 \| 15.6	0.44 \| 0.84
Symbolic	UKF-M INS+GPS [21]	0.192	5.50	0.49
	EKF INS+GPS [125]	0.077	3.31	0.58
	GPS only	—	1.89	0.42
Neurosymbolic	Ours (no GPS, w physics)	1.10	1.76 \| 9.12	0.28 \| 1.55
Neurosymbolic	Ours (w GPS, w physics)	1.12	1.02 \| 1.81	0.28 \| 0.64

Table 11. Odometric Resolution and Flash Usage of Proposed Neural-Kalman GPS-INS Fusion for Locating Precision Agricultural Robots Versus State-of-the-art Neural and Symbolic Approaches

First term in the error is on seen trajectory, second term is on unseen trajectory; single term is on unseen trajectory.

6 Conclusion, Limitations, AND Future Work

Neurosymbolic AI provides a pathway for making context-aware, physics-aware, robust, interpretable, and performant AI systems. TinyNS provides a stepping stone in automating the deployment of neurosymbolic frameworks onto ultra resource-constrained IoT devices like microcontrollers and ISPUs. The Bayesian optimization formulation provides an inexpensive method to iterate over complex neurosymbolic search spaces, providing Pareto-optimal models depending upon resource availability. GP-UCB and hard thresholding policy allow fine-grained search space exploration and exploitation and improved convergence time. Through TinyNS, we have showcased state-of-the-art performance in various unseen applications. Several lessons, limitations, and directions for future work for our framework are as follows:

•

There is an absence of general-purpose parsers, lexers, and visitors needed to realize symbolic program graphs on microcontrollers. We need tools that are similar to TFLM but for parsing program decision trees.

•

The process of porting a custom symbolic layer from TF to TFLM is convoluted, with support for mostly the layers available in TFL. To run such custom layers, a user-friendly framework for the automatic porting of custom TF operators to TFLM is necessary.

•

Our framework only supports TFLM so far for model parsing. However, there are other inference engines for which support must be added.

Footnotes

https://www.st.com/en/embedded-software/x-cube-ai.html.

https://eloquentarduino.com/.

https://github.com/nok/sklearn-porter.

⁴

https://shedskin.github.io/.

⁵

https://www.nuitka.net/.

⁶

https://www.csse.canterbury.ac.nz/greg.ewing/python/Pyrex/.

⁷

https://openai.com/blog/chatgpt.

⁸

https://zerynth.com/blog/python-and-c-hybrid-programming-on-a-microcontroller-with-zerynth/.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.

Abstract

1 Introduction

1.1 Challenges

1.2 Contributions

1.3 Organization

2 Background AND Related Work

2.1 Machine Learning on Microcontrollers

2.2 Neural Architecture Search for Microcontrollers

2.3 Neurosymbolic Artificial Intelligence

2.3.1 Taxonomy of Neurosymbolic AI.

2.3.2 Neurosymbolic Language Tools.

2.3.3 Recent Trends in Neurosymbolic Artificial Intelligence.

2.4 Python to Microcontroller Code Parsers

2.4.1 TinyML Compiler Suites.

2.4.2 General Purpose Parsers.

3 Mango: Fast, Parallel, AND Gradient-Free Bayesian Optimizer

3.1 Surrogate Model

3.2 Acquisition Function

3.3 Handling Mixed Search Spaces

3.4 Parallelization

3.5 Addition to Mango

3.6 Evaluation: Parallel Search in Mango

3.7 Evaluation: Comparison Against Other Bayesian Optimizers

4 Platform-aware Neurosymbolic Optimization

4.1 Symbolic Neuro Symbolic

4.2 Neuro \(\rightarrow\) Symbol

4.3 Neuro \(\cup\) Compile (Symbolic)

4.4 Symbolic[Neuro]

4.5 Neuro[Symbolic]

5 Evaluation

5.1 MLPerf Tiny v0.5 Inference Benchmark

5.1.1 Dataset Splits and Pre-processing.

5.1.2 Model Backbones, Training Details, and Search Space Definition.

5.1.3 Overall Performance.

5.1.4 Architectural Adaptation Based on Resource Availability.

5.1.5 Convergence Time of Proxyless versus Proxied TinyNS.

5.2 Optimization of Features and Neural Weights (Symbolic Neuro Symbolic)

5.2.1 Dataset and Task Description.

5.2.2 Model Backbones, Training Details, and Search Space Definition.

5.2.3 Target Hardware.

5.2.4 Overall Performance.

5.3 Fall Detection under 2 kB and Activity Recognition (Symbolic Neuro Symbolic)

5.3.1 Dataset and Task Description.

5.3.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

5.3.3 Overall Results.

5.4 Optimization of Neural Detector Weights and Symbolic Object Tracker (Neuro \(\rightarrow\) Symbol)

5.4.1 Dataset and Task Description.

5.4.2 Model Backbones and Search Space Definition.

5.4.3 Overall Results.

5.5 Improving Adversarial Robustness of TinyML Models (Neuro \(\cup\) Compile (Symbolic))

5.5.1 Dataset and Task Description.

5.5.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

5.5.3 Overall Results.

5.6 Physics-aware Neural Inertial Localization (Neuro \(\cup\) Compile (Symbolic))

5.6.1 Dataset and Task Description.

5.6.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

5.6.3 Overall Results.

5.7 Neural-Kalman Sensor Fusion (Symbolic[Neuro])

5.7.1 Dataset and Task Description.

5.7.2 Model Backbones, Training Details, Target Hardware, and Search Space Definition.

5.7.3 Overall Results.

6 Conclusion, Limitations, AND Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

A Survey on Automated Machine Learning: Problems, Methods and Frameworks

Tuning Deep Neural Network’s Hyperparameters Constrained to Deployability on Tiny Systems

Auto-Keras: An Efficient Neural Architecture Search System

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources