1. Introduction
Deep learning (DL) has emerged as an influential tool that drives innovation efforts for the advancement of mission-critical applications such as the prediction of environmental phenomena [
1], object detection and classification [
2], natural language understanding and generation [
3], and autonomous navigation and steering [
4]. DL is inspired by how the human brain is structured and functions, with the foundation and premise attributed to Deep Neural Networks (DNNs). In DNNs, multiple layers of interconnected artificial neurons form a network of hierarchies aimed at enhancing the network’s ability to automatically derive low-dimensional representations from unstructured and high-dimensional data, such as human-written text and images [
5].
With the last decade featuring several breakthroughs in the area of generative AI, such as Generative Adversarial Networks (GANs) [
6], transformers [
7], and Large Language Models (LLMs) [
8,
9], model scaling is gaining significant interest. Model scaling refers to the expansion of the underlying structure of the DNN [
10]. Early research attempted to advocate for the addition of more neurons in DNNs, leading to increased network depth [
11]. By scaling the DNN, users can select among several model variants to enhance utility for different application aspects and deployment sites (i.e., data centers or edge computing) [
12]. More recent examples have suggested increasing multiple dimensions of the DNN structure together, such as the popular EfficientNet model architecture for object detection that features eight model variants with different network depths, widths, and input resolutions [
13]. However, as the underlying DNN structure expands, so does the computational complexity of the network to increase the representational power of the DNN and improve accuracy while also generalizing better over newly introduced tasks [
14]. A recent study by OpenAI revealed that since 2012, the amount of computational effort required by DL models has increased exponentially and doubled every 4 months [
15]. Nonetheless, while IoT and edge computing hardware are vastly improving, they are not doing so at a rate capable of catering to the advancements in deep learning [
16,
17]. Hence, although the AI community was, in the past, focused solely on improving model accuracy metrics, a paradigm shift is now being observed where runtime performance during inference and energy efficiency are gaining industry interest [
18,
19].
This brings us to the focal point and motivation that support our work. With so many different scaling strategies for DNNs, it becomes extremely difficult for users to select a model variant that will meet different QoS requirements. Models can scale in terms of depth, width, and clarity of input data, and different parameters can be set (e.g., batch size and training epochs). All these configurations can significantly impact the performance of a DL application. The majority of existing works that empirically evaluate model scaling focus on either a single model structure [
13,
20] and/or introduce findings that are tailored to solely evaluating accuracy and overhead during the model training stage [
21,
22]. We deviate from the norm and focus on the impacts of different model scaling strategies during runtime inference. Specifically, we focus on three key performance axes: classification accuracy, computational overhead, and latency. By offering a detailed examination of trade-offs during model inference, our work aims to guide AI practitioners in selecting and tuning DL models for efficient and effective deployment of intelligent IoT services in resource-constrained environments.
The main contributions of our work are summarized as follows:
We introduce a high-level description of a benchmarking framework for the assessment of the various trade-offs that occur when scaling the underlying network of a DL model across different dimensions (i.e., depth and width). This modular and extensive framework provides performance insights on the impact of DL model scaling during runtime inference in light of classification accuracy and model loss, computational overhead, and the model’s latency footprint.
We demonstrate the utility of our framework by introducing an empirical study evaluating DL model scaling strategies and their impacts on the QoS of runtime inference. We employ several model structures from three popular DL application domains. For our work, we utilize publicly available and pre-trained model architectures (BERT, EfficientNet, and MLP), and for inference workloads, we employ openly available datasets (i.e., GLUE-MRPC and ImageNet).
The rest of this article is structured as follows:
Section 2 presents a brief overview of model scaling for DNNs as the premise of our work.
Section 3 elaborates on the benchmark methodology employed for the evaluation and empirical study.
Section 4 introduces three different DL use cases and the datasets used during validation.
Section 5 documents the performance metrics used during the empirical study, while
Section 5 presents a comprehensive overview of the results.
Section 6 provides an overview of recent related works, while
Section 7 concludes the article and outlines future work.
2. Background
The following subsections introduce an overview of the architecture pattern encompassing deep neural networks (DNNs) and the premise of model scaling for DNNs.
2.1. Deep Neural Networks
Deep learning is synonymous with many-layered artificial neural networks that are well-recognized and referred to as deep neural networks (DNNs). It is in these networks that data are hierarchically represented, achieving excellent results in high-level abstraction and intricate pattern recognition when working with tasks connected to unstructured data such as images, audio, and text [
23].
An exemplary image of a DNN is depicted in
Figure 1. The key constituents of deep learning are neurons, layers, weights, biases, and activation functions. Neurons are the basic units that process input and produce output through an activation function. Layers are collections of neurons and can be classified into three types: input layers, hidden layers, and output layers. The input type layer first accepts initial data, like pixel values in image data or word embeddings in text. These layers’ neurons represent traits of the data. Hidden layers process the flow of data by identifying complex patterns. Neural networks with one hidden layer are called shallow neural networks (SNNs), and networks with typically more than one layer are called deep neural networks (DNNs) [
24]. The neurons within layers receive inputs, process them, and transmit information to successive layers. The output layer delivers the results, such as class labels in a classification task or numerical values in regression, with the number of neurons representing the number of categories or outputs. Weights and biases are parameters integrated into the model to learn during training and make the model accurate in making predictions. Activation functions introduce non-linearity into the network for the learning of complex patterns.
2.2. Model Scaling for Deep Neural Networks
Model scaling is the method where the capacity of the DNN is enlarged by expanding the underlying architectural structure to meet the varying requirements of different workloads [
20,
25].
Structure-wise, the baseline of a neural network can be expanded as follows:
In terms of depth by adding more layers to capture a richer set of features and generalize when introduced to new tasks;
In terms of width by embedding more neurons per layer to capture more fine-grained features and intricate relationships within the data;
In terms of data clarity (i.e., resolution for images) to potentially capture more fine-grained patterns in the given input.
An example of a neural network model structure that can be scaled depth-wise to introduce new variants are the popular Residual Networks (ResNets) [
11]. The naming convention of a ResNet (i.e., ResNet-50 or ResNet-152) includes a number denoting the number of layers in the network.
While scaling towards one dimension can increase model accuracy, as a network deepens (or widens), training speed is impacted due to the vanishing gradient problem where convergence is slowed down as the network parameters become very small while propagated back through the layers [
26]. To compensate,
compound scaling is considered the norm today, where multiple dimensions are scaled together. An example of a family of neural networks adopting compound scaling is EfficientNet models for object detection, where depth, width, and input resolution are uniformly scaled based on a compound coefficient. By doing so, EfficientNet models can achieve accuracy similar to that of other DNNs (i.e., ResNets) while utilizing fewer computational resources during model training and extracting meaningful cost savings [
13].
3. DNN Model Scaling Benchmark Framework
To perform the benchmarking of a diverse set of DNN use cases, we created an automated process where a user needs to specify a dataset, the selected DNN use case, and the user’s preferences about the model scaling parameters.
Figure 2 depicts a high-level overview of our pipeline. The framework provides a lightweight Python library through which users can submit their preferences by either filling in the library’s programming interfaces or by passing their parameters written in a configuration file (in YAML format).
An example of the configuration file is depicted in Listing 1. In this example, the user selects MLP as the DNN use case (use_case). Then, the user must define the parameter space (parameter_space) of the respective use case, which includes the available parameters, along with the selected evaluation values. In this example, the user sets three different values for network depth (network_depth: [4, 5, 7]), two values for network width and batch size parameters (network_width: [32, 64]; batch_size: [32, 128]), and three values for epochs (epochs: [1, 3, 5]). The system generates all possible permutations of these parameters, resulting in a search space of 36 unique runs for this use case.
Next, the monitoring metrics that for the system to expose are selected. The available metrics include computational_overhead, accuracy, and latency. We then examine a set of execution-related parameters, including the boolean parameter training, which specifies whether the system retrains the model; dataset, which indicates the dataset location; and times, which defines the number of experimental executions. It is worth noting that how the dataset is loaded depends on the specific use-case implementation, so the dataset parameter could refer to either a file or a folder where the dataset resides (e.g., a folder of images for object detection and classification use cases).
After that, the framework generates a set of model scaling parameters based on user preferences and combines them with a specified ML use case. If users wish to create their own use cases, they should extend the DNN Model Scaling Benchmark programming interface and add them to the
use-case repository. By default, our system includes a
use-case repository with three predefined DL use cases—
natural language understanding,
regression analysis, and
object detection and classification—each with its respective scaling parameters (see
Section 4). With the parameters and chosen use case defined (via the configuration file or programmatically), the system generates a model whose structure reflects the specified
model parameter set, resulting in a
compiled model.
Listing 1. DNN parameters of the model scaling benchmark framework. |
use_case: MLP
parameter_space:
network_depth: [ 4, 5, 7 ]
network_width: [ 32, 64 ]
batch_size: [ 32, 128 ]
epochs: [ 1, 3, 5 ]
metrics:
- computational_overhead
- accuracy
- latency
training: True
dataset: "..."
times: 1
|
The next phase of our pipeline is
execution. This phase has two steps, namely
training and
inference. Specifically, the pipeline splits the dataset into training and testing subsets. This split is configurable with default values of 80% for training and 20% for the testing set. After that, the pipeline performs the training, updating the weights of the
compiled model. We should note that the training step is optional, since the users may introduce their own pre-trained models. Next, the trained model is utilized to perform inference on the testing subset, extracting the performance metrics, namely
inference latency,
accuracy/error, and
computational overhead. Since the test subset is already annotated, we can easily extract accuracy for classification tasks or error for regression tasks. For the inference latency, the framework keeps the timestamp before the inference process starts and the timestamp when it is over, subtracts them, and divides by the number of data points. Finally, for the computational overhead, our system extracts the number of Floating-Point Operations (FLOPs) from the compiled ML model. A more detailed description of the metrics can be found in
Section 5. The pipeline stores the
extracted metrics in
metric storage; then, the system selects another set of model parameters and repeats the process until there are no more parameter sets to be examined.
Once all model parameter sets (and trials) for a given ML use case have been evaluated, users can conduct a performance trade-off evaluation. This step enables the system to automatically facilitate various post-processing tasks, such as fitting trend lines, plotting performance metrics, detecting outliers, and generating the respective reports. Additional post-execution metrics are also made available, such as model complexity (), which reflects the model’s FLOPs relative to the most complex model configuration in the use case. Thus, users can intuitively analyze the effects of different parameter configurations, observing trends across a range of ML parameters and model structures. By visualizing these trade-offs, users gain insights into the balance between computational demands, accuracy, and other performance metrics, supporting informed decision making for optimal model selection in alignment with their specific goals and constraints.
4. DL Use Cases and Validation Datasets
Our DL repository comprises of three use-cases originating from different application domains and covering three out of the seven important domains for advancing AI as identified by MLcommons [
27]. All use cases employ open model structures with publicly available variants, and the datasets utilized for the inference workloads are also publicly available. Moreover, we note that although, inherently, any set of values can be used to alter the problem dimensions of a model structure, we embrace the reference value sets suggested by the model designers (e.g., Google Research) that lead to optimal fine tuning of the model accuracy for a given computational budget. These configurations enable all experimental results of the empirical analysis to be realistic, reproducible, and verifiable.
Table 1 provides a summary of the DNN model architectures for the DL use cases, as well as their scaling dimensions and their parameterization, for quick reference.
4.1. Natural Language Understanding
This use case involves natural language understanding, where the DL application setup is responsible for processing a series of text-based inference tasks using a language model to semantically interpret the provided text. The chosen model architecture for this use case is the well-known Bidirectional Encoder Representations from Transformers (BERT) model. BERT, as a model structure, was first introduced in 2019 by Google and is considered one of the most novel model structures for NLP tasks, language reasoning, and conversational AI. In brief, BERT is structured with stacked quantized transformer encoder layers, where each layer comprises self-attention mechanisms and feed-forward neural networks [
3]. BERT was intentionally designed to pre-train deep bidirectional representations, adopting a large English text corpus (i.e., Wikipedia), by jointly conditioning on both left and right contexts in all layers [
28]. With this, BERT models can capture intricate patterns and dependencies in text-based datasets, making them powerful for understanding the nuances of language.
For the empirical study, we utilize 20 BERT model variants that have been introduced by Google Research (
https://github.com/google-research/bert/ accessed on 7 November 2024) and referenced as suitable for edge computing. These models vary in network depth (D = {2, 4, 6, 8, 10}) and hidden embedding sizes (H = {128, 256, 512, 768}). For the workload of inference tasks, we use the widely recognized GLUE-MRPC dataset released by Microsoft Research (
https://www.microsoft.com/en-us/download/details.aspx?id=52398 accessed on 7 November 2024), with the validation set containing 1125 sentence pairs automatically extracted from online news sources, along with human annotations indicating whether the sentences in each pair are semantically equivalent (paraphrase).
4.2. Object Detection and Classification
This use case centers on object detection, where the DL application setup is configured to handle a series of inference tasks to detect objects within a set of given images and classify the objects by assigning appropriate labels from a pre-defined label set. The chosen model architecture for this purpose is the widely used EfficientNet convolutional neural network [
13]. EfficientNet was first introduced in 2020 by GoogleAI to provide object detection for mobile and IoT services run at the network edge. For EfficientNet, the baseline network (denoted as
) exploits a multi-objective neural architecture search that optimizes for both accuracy and FLOPS by adopting a hierarchy of mobile inverted bottleneck convolution layers and the use of squeeze-and-excitation optimization to ultimately enhance representational capacity [
29]. However, the most popular feature of EfficientNet is the introduction of compound scaling, where the network structure can uniformly scale in depth, width, and resolution by taking advantage of a fixed set of scaling coefficients tailored to increase the network’s computational efficiency without impacting representational accuracy.
The publicly released variants of EfficientNet are code-named as
Bx, with
x denoting the model complexity rank (
) and each varying in network depth, width, resolution, and dropout rate. For this empirical study, we utilize the first six models (
), which are all openly available in the Keras pre-trained model library (
https://keras.io/api/applications/efficientnet/ accessed on 7 November 2024). Hence, we note that, as shown in
Table 1, for EfficientNet, the value of each dimension is expressed as a coefficient factor of the base variant (
), and the six variants are formed by taking the
value from each dimension to form its dimensional value set. As an example,
takes values of
,
,
, and
. For the inference workload, we use images extracted from the well-known and publicly available Tiny ImageNet dataset hosted on Hugging Face (
https://huggingface.co/datasets/zh-plus/tiny-imagenet accessed on 7 November 2024), which includes a validation set of 10,000 color images (64 × 64 pixels) categorized into 200 label classes.
4.3. Regression Analysis for Predicting Numerical Outcome
For this use case, we tested 120 Multi-Layer Perceptron (MLP) model variants. In brief, MLPs are fully connected feed-forward artificial neural networks. As MLPs are fully connected, each neuron in layer is connected to every neuron in the subsequent layer (D). For this use case, the MLP networks’ goal is to ingest data points and, through a regression analysis, predict the numerical value of a target variable. A leaky Rectified Linear Unit (ReLU) is embraced for the neuron activation function. These 120 different model configurations differ by network depth (D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}), neurons per layer (W = {32, 64, 128}), batch size (B = {32, 64, 128, 256}), and epochs (E = {1, 3, 5}).
As our inference workload, we use the open and publicly available California Housing dataset from StatLib provided via Keras (
https://keras.io/api/datasets/california_housing/ accessed on 7 November 2024). Specifically, the dataset captures 20Ksamples, with each composed of eight attributes (e.g., house age and number of rooms), and the target variable for reference is the median house value per California district, expressed USD.
5. Benchmark Framework Metrics for Comparison
The benchmark framework supports the evaluation of the reference DL use cases against three performance axes, all evaluated during model inference tasks. Specifically, all use-case model variants are evaluated in terms of accuracy, computational overhead, and inference latency.
The following details how each metric is measured.
5.1. Classification Accuracy and Prediction Error During Inference
For use cases 1 and 2, accuracy is measured as the precision of the DL classification tasks per use case. In brief, classification precision is defined as the ratio of the number of correct predictions to the total number of predictions, where correct predictions include true positives and true negatives, as shown below:
For use case 3, where regression is employed, accuracy (or less inaccuracy) is measured by computing the Mean Square Error (MSE) for the predictions output by the regression analysis of the model variant. Specifically, the MSE computes the average squared difference between the observed (
y) and predicted (
y) values of a sample comprising
N values, as shown below:
For the comparison of the model variants in terms of accuracy, the outcome for each input data point of the validation dataset is recorded during runtime inference. Upon the completion of the experimental run, the accuracy metric per model variant is computed and can be used to compare against experimental runs utilizing other model variants.
5.2. Computational Overhead During Inference
The computational overhead of a DL model is attributed to the processing effort required by the model to perform the requested inference task (e.g., classification or regression) and output a result for a given data point. To measure the computational overhead imposed by a DL model variant during an experimental run, our benchmark framework counts floating-point operations, denoted as FLOPs. In brief, FLOPs measure the total number of mathematical operations involving floating-point values across the computational path of the DNN during runtime inference to output a prediction after receiving a new data point as input. FLOPs are a key metric for evaluating computational efficiency for DL models, as they provide measurable and hardware-independent results suitable for comparison among DL model variants [
30]. To measure the FLOPs for each model variant, our benchmark framework utilizes the open and popular
flopth (
https://github.com/vra/flopth accessed on 7 November 2024) Python library, which supports the measurement of the FLOPs performed during DNN inference.
5.3. Inference Latency
The third performance evaluation metric of our benchmark framework is inference latency. Inference latency for DL models is defined as the total amount of time required by the DNN variant to output a prediction after receiving a new data point as input and perform the necessary (mathematical) operations imposed throughout the computational path of the layered neural network during runtime inference. A customized monitoring probe has been created to extract the inference latency from open and popular ML backends, including TensorFlow and PyTorch, as used in the two DL use-cases.
5.4. Model Complexity
To ease both the depiction of the results and to easily identify models among a large corpus of variants per use case, we introduce a new metric definition. Model complexity, denoted as , is defined as the proportional difference of a model variant relative to the most complex model () under consideration ().
For instance, assuming a set of N models denoted as M, where all variants differ only by network depth (layers), then if has a network depth of 10 layers, a model () with has a depth of three layers. For the introduced use cases where multiple DNN dimensions can be scaled, a normalized is computed and used as a reference for each model variant. For this, first, each scaling dimension is normalized; next, all dimensions are multiplied, and finally, the intermediate product is divided by the of .
As an example, the normalized model complexity vector for the EfficientNet variants of use case 2 is the following: . With this vector, we can now place variants in an orderly fashion and can denote proportional differences to estimate the computational overhead and latency of inference, as we see from the experimentation. Finally, we note that in all introduced plots, we employ the model complexity as the independent variable and show how accuracy, computational overhead, and latency are affected when scaling the model’s complexity.
6. Evaluation
In this section, we embrace our benchmark framework to conduct an empirical analysis to understand various trade-offs during runtime inference and assess their impacts on the introduced DL use cases.
6.1. Testbeds
For experimentation, we opted for two testbeds: a public and a private cloud infrastructure. Our aim is to enhance the replication of the empirical study so that the results from DL use-case benchmarking can be reproduced and verified by anyone interested in this study.
To this end, all experiment runs for the BERT (UC1) and MLP-Regression (UC3) use cases were executed on a testbed supported by Google Colaboratory (
https://colab.google/ accessed on 7 November 2024). Specifically, this testbed provides the benchmark framework with a dedicated DL execution environment featuring 16 GB of memory for model storage and performance of the computational experimentation, as well as an Nvidia T4 GPU featuring 2560 CUDA cores and 320 Turing tensor cores that are specialized for neural network training and inference. In turn, the experimental runs for the EfficientNet use case (UC2), which requires significant memory for both the model variants and the ImageNet workload, were run on an HP Proliant DL380 G9 server with an Intel Xeon E5-2680 processor embedding 44 cores clocked at 2.50 GHz and 176 GB memory. Finally, we note that all model variants embraced for UC1 and UC2 are the pre-trained versions of these models provided by Google Research and made publicly available in the HuggingFace and Keras repositories, respectively (see
Section 4.1 and
Section 4.2). This enables both the verification of the study findings (e.g., classification accuracy) with minimal human effort and the necessary computational overhead to reproduce the experimental settings.
6.2. Experimental Analysis
Figure 3,
Figure 4 and
Figure 5 depict the results of empirical study in which several model variants per DL use case were run and evaluated during runtime inference. It should be noted that all plots are presented with the x axis (independent variable) capturing the normalized complexity of each model variant as introduced in
Section 5.4.
6.3. Inference Quality
Let us start the discussion of the empirical study with the plots presented in
Figure 3. These plots depict the classification accuracy (UC1 and UC2) and MSE (UC3) for the introduced model variants for each use case. First, it is evident that in all use cases, given enough data during training, quality can be improved by increasing the complexity of the model. In the case of the BERT (
Figure 3a) and EfficientNet (
Figure 3b) models, the classification accuracy increases, while in the case of the MLP regression models (
Figure 3c), the mean square error decreases. Hence, scaling the DNN structure’s dimensionality (e.g., network depth and width) by creating more computationally complex variants is beneficial for improving the inference quality during runtime inference.
The second insight from the empirical study is that, for all use cases, the learning ratio of simpler models plateaus at some point. An illustrative example is provided by the plots in
Figure 3a,b. Scaling the EfficientNet-B0 structure by the first compound scaling coefficient increases the model’s classification accuracy by more than 2.2%, but scaling from the B4 variant to the B5 variant introduces minimal gains in accuracy that are less than 0.5%. This can also be confirmed by looking at the MLP-Regression plot (
Figure 3c) where performing a further study, one can establish the non-linear nature of the error and the diminishing error reduction through a Ramsey REST test (r-test I-value = 0.000826).
Third, as is evident in the MLP-Regression plot (
Figure 3c) the global minimum is achieved for
, and afterwards, more complex models present a slight increment in their reported MSE values. This occurs as the DNN model structure scales and becomes too complex, which results in overfitting to the randomized effects that are present only in the specific (California Housing) dataset used for training. In this regard, although the expressive power of a model is (somehow) dependent on model complexity, a mathematical expression for estimating inference quality irrespective of the DL use-case domain does not exist [
31]. This is inherently due to the phenomenon introduced in the second (learning plateau) and third (overfitting) insights.
6.4. Computational Overhead of Inference
The plots in
Figure 4 depict the effect of model scaling on the computational overhead imposed on the underlying model serving as infrastructure during runtime inference for each of the referenced DL use cases.
From these plots, it is immediately evident that when the model complexity increases, so does the computational overhead. Moreover, the model complexity and the computational overhead present an almost linear relationship. For example, in the BERT case (
Figure 4a), the coefficient of determination (R squared) is
, and for the EfficientNet use case (
Figure 4b),
. These value indicate a well-fit linear relationship. Furthermore, for the MLP-Regression case (
Figure 4c), we observe two distinct phases of linearity, the second of which presents a higher linear coefficient that the first phase.
6.5. Inference Latency
The plots in
Figure 5 depict the effect of model scaling on the latency observed by users submitting inference tasks to the DL model serving the application during runtime inference for each of the referenced DL use cases. In these plots, we observe that the linear relationship between model complexity and inference latency holds, especially for the BERT (
Figure 5a) and EfficientNet (
Figure 5b) use cases, although it is less perfect. The coefficients of determination for these two use cases are
and
, respectively. The reduction can be attributed to the fact that although latency is highly correlated with the computational overhead (FLOPs) of the DNN required to output a result, other factors contribute to increased latency. Specifically, latency also depends on data pre-processing to shape the input data points to meet the processing requirements of the DNN and DNN optimizations that can affect memory access patterns. While these factors can (slightly) influence the overall latency, the inference delay observed in the MLP-Regression use case (
Figure 5c) is highly affected by overfitting after the global minimum is achieved (
), distorting the depicted linear relationship between model complexity and latency (
).
Finally, a key observation among the three performance axes must be made. A specific model complexity (e.g., ) and computational budget (in FLOPs) can be achieved by utilizing different scaling policies (i.e., depth only, compound coefficient on depth/width, etc.). For example, in the case of the BERT models in which two dimensions are available for scaling (network layers (d), embedding sizes (e)) a can be achieved with pairs that are equivalent to (2, 768) and (6, 256). In the first example, the classification accuracy is 74.2%, and latency is 2.2 s, while in the second case, accuracy is 77.8%, and the latency is 2.79 s. Therefore, model scaling can have a significantly different effect on the classification accuracy and inference latency. Hence, benchmarking various model scaling strategies to evaluate the performance outcome during DL model inference is of utmost importance when striving to achieve cost savings in geo-distributed deployments and under resource constraints.
7. Related Work
The following introduces relevant scientific works split into two subsections, the first of which briefly discusses DL model scaling and the latter of which introduces empirical evaluations in the field of DL model scaling.
7.1. DL Model Scaling
Several DL model scaling techniques have been proposed throughout the last decade according to which the underlying structure of the DNN can be expanded in terms of either depth [
11] or width [
26], or adopt higher data clarity for the provided input [
32]. However, as the DNN structure increases towards only a single dimension (i.e., depth), the computational costs become prohibited, and model quality can be impacted as well due to parameter redundancy [
33]. To overcome these challenges, Tan et al. [
13] and Han et al. [
34] proposed the use of compound scaling with the introduction of EfficientNet and TinyNet models, respectively. Compound scaling, as previously described, suggests that the DNN structure can expand uniformly towards multiple dimensions by embracing a suitable scaling coefficient. With this, DNNs can achieve accuracy similar to that of other (competitive) DNNs while utilizing fewer computational resources during model training and, thus, extracting meaningful cost savings.
7.2. Empirical Evaluations of DL Model Scaling
The following presents both recent and notable scientific works introducing empirical evaluations for DL model scaling.
Lin et al. [
20] investigated the pitfalls of one-dimensional model scaling for convolutional neural networks (CNNs) and proposed a scaling method for CNNs that utilizes dimensional relationship and runtime proxy constraints to improve accuracy and inference latency. In turn, Hestness et al. [
35] introduced a study where they empirically projected the computational requirements for future DL applications based on the representational growth achieved (and targeted) by the AI community.
Dollár et al. [
14] presented an evaluation study to demonstrate how different compound scaling strategies affect the model parameters; activations; and, consequently, training time for different CNN model structures, focusing primarily on EfficientNet and RegNet models. With a focus on language models that adopt a transformers-based architecture, Kaplan et al. [
21] empirically showed that performance (measured by test loss) improves and follows a power-law relationship as the model size, dataset size, and availability of computational power are scaled during model training. When only scaling one factor to achieve the referenced performance, the other two must not become bottlenecks. For optimal performance, the authors showed that all three factors must be scaled-up in tandem.
In a recent study, Bahri et al. [
22] capitalized on the previously mentioned work to show how scaling the model and dataset size can be explained under variance-limited and/or resolution-limited scaling regimes. In the end, the authors provided a theoretical framework that can be used to estimate upper and lower bounds for model loss depending on the aforementioned scaling regimes. Furthermore, Aach et al. [
36] presented an empirical analysis of three open and popular distributed deep learning frameworks to assess their performance and scalability during model training with experiments focusing on ResNet and ImageNet models as the examined workload. Finally, Wang et al. [
37] introduced an empirical study that examined a large corpus of DL model variants investigating unit test adoption and if/how metrics beyond accuracy are used to improve model robustness and reliability.
Based on the provided list of existing works empirically evaluating model scaling, it is obvious that the majority of the aforementioned studies either focus on a single model structure (e.g., ResNet, RegNet, or CNN models) and/or introduce findings that are tailored to solely evaluating accuracy and overheads during the model training stage. We deviate from the norm and focus on the impacts different model scaling strategies have during runtime inference. By offering a detailed examination of trade-offs during model inference, our work can contribute to guiding AI practitioners in selecting and tuning DL models for efficient and effective deployment of intelligent IoT services in resource-constrained environments. Moreover, we note that although a direct comparison against other empirical studies cannot be performed, as the body of related work focuses on performance and accuracy during the model training phase, several of our findings can be confirmed by the aforementioned studies. For example, Lin et al. [
20] showcased that compound model scaling positively impacts classification accuracy for DNNs, while Bahri et al. [
22] confirmed this but also noted (as we do) that the gains in terms of accuracy diminish and can plateau as a model attempts to scale using large compound coefficients. In turn, Dollár et al. [
14] showed that although two model variants can present the same computational complexity, the output classification accuracy can (significantly) differ.
8. Conclusions and Future Work
This article presents an empirical benchmarking study focused on understanding the correlation between DL model scaling and three performance axes during runtime inference. These axes cover accuracy, computational effort, and latency. This distinguishes our work from other studies that solely studied model accuracy and training duration during the training phase for different model variants. To this end, we designed and presented a prototype of a DL model scaling benchmark framework, along with details about its execution and functionalities. Moreover, we introduced three diverse DL use cases from the domains of natural language understanding, object detection, and regression analysis. All three use cases embrace open DNN model architectures, multiple and different scaling techniques, and the use of popular datasets for workload generation.
With the three developed DL use cases, we then performed an empirical analysis with the use of our benchmark framework to discover various insights and trade-offs introduced across the three performance axes during runtime inference when examining multiple model variants per use case. Some key observations from the analysis are the following. Across all three DNN use cases, increasing model complexity generally led to improved inference quality—higher accuracy for classification tasks and lower MSE for regression tasks—but only up to a point. As model complexity continued to increase, the improvements began to plateau. Moreover, scaling the model complexity beyond a certain point that is both use-case and database-dependent, may lead to the model overfitting to the training data, capturing noise rather than meaningful patterns, which can degrade performance on new data. We also observed an almost linear relationship between model complexity and computational overhead (measured in FLOPs). In addition, the inference latency increased with model complexity, correlating with computational overhead, although the relationship was not perfectly linear due to additional factors like data pre-processing effort and memory access patterns influenced by model optimizations. Finally, our study highlights that models with the same complexity but different configurations can have significantly different impacts on accuracy and latency.
Our future work includes several key areas of development to enhance the capabilities and applicability of our benchmarking framework. Initially, we plan to introduce new use cases covering more real-world problems that will include more complex DNN models, like LLMs or generally generative AI. To that end, we plan to integrate our system with online model repositories (e.g., HuggingFace) in order to further automate our analysis. Secondly, we envision an extension of our monitoring subsystem to capture not only DNN performance metrics but also underlying utilization metrics, e.g., compute utilization (GPU/CPU), memory usage, disk I/O, etc. This will reveal hidden correlations between the structures of specific DNN models and their effects on the execution environment. Furthermore, our plans include the introduction of an intuitive user interface, along with a visualization library that will include post-experimentation advanced plotting and analysis functions. This new module will alleviate the difficulties faced by end users in extracting correlations between different parameters, find the sweet spot for their deployment, and extract useful insights. Finally, we would like to create a recommendation service that suggests (near-)optimal configurations for DNN-enabled applications without requiring every possible permutation to be tested. This service will be especially beneficial for beginners, providing a strong starting point to reduce evaluation time and help them choose the best-fit DNN model for their requirements.