Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Opportunities and Challenges of Artificial Intelligence Applied to Identity and Access Management in Industrial Environments
Next Article in Special Issue
Cache Aging with Learning (CAL): A Freshness-Based Data Caching Method for Information-Centric Networking on the Internet of Things (IoT)
Previous Article in Journal
A Survey on MLLMs in Education: Application and Future Directions
Previous Article in Special Issue
Virtualization vs. Containerization, a Comparative Approach for Application Deployment in the Computing Continuum Focused on the Edge
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark Analysis

by
Demetris Trihinas
1,*,
Panagiotis Michael
1 and
Moysis Symeonides
2
1
Department of Computer Science, School of Sciences and Engineering, University of Nicosia, Nicosia CY-2417, Cyprus
2
Department of Computer Science, University of Cyprus, Nicosia CY-2109, Cyprus
*
Author to whom correspondence should be addressed.
Future Internet 2024, 16(12), 468; https://doi.org/10.3390/fi16120468
Submission received: 8 November 2024 / Revised: 6 December 2024 / Accepted: 10 December 2024 / Published: 13 December 2024

Abstract

:
With generative Artificial Intelligence (AI) capturing public attention, the appetite of the technology sector for larger and more complex Deep Learning (DL) models is continuously growing. Traditionally, the focus in DL model development has been on scaling the neural network’s foundational structure to increase computational complexity and enhance the representational expressiveness of the model. However, with recent advancements in edge computing and 5G networks, DL models are now aggressively being deployed and utilized across the cloud–edge–IoT continuum for the realization of in situ intelligent IoT services. This paradigm shift introduces a growing need for AI practitioners, as a focus on inference costs, including latency, computational overhead, and energy efficiency, is long overdue. This work presents a benchmarking framework designed to assess DL model scaling across three key performance axes during model inference: classification accuracy, computational overhead, and latency. The framework’s utility is demonstrated through an empirical study involving various model structures and variants, as well as publicly available datasets for three popular DL use cases covering natural language understanding, object detection, and regression analysis.

1. Introduction

Deep learning (DL) has emerged as an influential tool that drives innovation efforts for the advancement of mission-critical applications such as the prediction of environmental phenomena [1], object detection and classification [2], natural language understanding and generation [3], and autonomous navigation and steering [4]. DL is inspired by how the human brain is structured and functions, with the foundation and premise attributed to Deep Neural Networks (DNNs). In DNNs, multiple layers of interconnected artificial neurons form a network of hierarchies aimed at enhancing the network’s ability to automatically derive low-dimensional representations from unstructured and high-dimensional data, such as human-written text and images [5].
With the last decade featuring several breakthroughs in the area of generative AI, such as Generative Adversarial Networks (GANs) [6], transformers [7], and Large Language Models (LLMs) [8,9], model scaling is gaining significant interest. Model scaling refers to the expansion of the underlying structure of the DNN [10]. Early research attempted to advocate for the addition of more neurons in DNNs, leading to increased network depth [11]. By scaling the DNN, users can select among several model variants to enhance utility for different application aspects and deployment sites (i.e., data centers or edge computing) [12]. More recent examples have suggested increasing multiple dimensions of the DNN structure together, such as the popular EfficientNet model architecture for object detection that features eight model variants with different network depths, widths, and input resolutions [13]. However, as the underlying DNN structure expands, so does the computational complexity of the network to increase the representational power of the DNN and improve accuracy while also generalizing better over newly introduced tasks [14]. A recent study by OpenAI revealed that since 2012, the amount of computational effort required by DL models has increased exponentially and doubled every 4 months [15]. Nonetheless, while IoT and edge computing hardware are vastly improving, they are not doing so at a rate capable of catering to the advancements in deep learning [16,17]. Hence, although the AI community was, in the past, focused solely on improving model accuracy metrics, a paradigm shift is now being observed where runtime performance during inference and energy efficiency are gaining industry interest [18,19].
This brings us to the focal point and motivation that support our work. With so many different scaling strategies for DNNs, it becomes extremely difficult for users to select a model variant that will meet different QoS requirements. Models can scale in terms of depth, width, and clarity of input data, and different parameters can be set (e.g., batch size and training epochs). All these configurations can significantly impact the performance of a DL application. The majority of existing works that empirically evaluate model scaling focus on either a single model structure [13,20] and/or introduce findings that are tailored to solely evaluating accuracy and overhead during the model training stage [21,22]. We deviate from the norm and focus on the impacts of different model scaling strategies during runtime inference. Specifically, we focus on three key performance axes: classification accuracy, computational overhead, and latency. By offering a detailed examination of trade-offs during model inference, our work aims to guide AI practitioners in selecting and tuning DL models for efficient and effective deployment of intelligent IoT services in resource-constrained environments.
The main contributions of our work are summarized as follows:
  • We introduce a high-level description of a benchmarking framework for the assessment of the various trade-offs that occur when scaling the underlying network of a DL model across different dimensions (i.e., depth and width). This modular and extensive framework provides performance insights on the impact of DL model scaling during runtime inference in light of classification accuracy and model loss, computational overhead, and the model’s latency footprint.
  • We demonstrate the utility of our framework by introducing an empirical study evaluating DL model scaling strategies and their impacts on the QoS of runtime inference. We employ several model structures from three popular DL application domains. For our work, we utilize publicly available and pre-trained model architectures (BERT, EfficientNet, and MLP), and for inference workloads, we employ openly available datasets (i.e., GLUE-MRPC and ImageNet).
The rest of this article is structured as follows: Section 2 presents a brief overview of model scaling for DNNs as the premise of our work. Section 3 elaborates on the benchmark methodology employed for the evaluation and empirical study. Section 4 introduces three different DL use cases and the datasets used during validation. Section 5 documents the performance metrics used during the empirical study, while Section 5 presents a comprehensive overview of the results. Section 6 provides an overview of recent related works, while Section 7 concludes the article and outlines future work.

2. Background

The following subsections introduce an overview of the architecture pattern encompassing deep neural networks (DNNs) and the premise of model scaling for DNNs.

2.1. Deep Neural Networks

Deep learning is synonymous with many-layered artificial neural networks that are well-recognized and referred to as deep neural networks (DNNs). It is in these networks that data are hierarchically represented, achieving excellent results in high-level abstraction and intricate pattern recognition when working with tasks connected to unstructured data such as images, audio, and text [23].
An exemplary image of a DNN is depicted in Figure 1. The key constituents of deep learning are neurons, layers, weights, biases, and activation functions. Neurons are the basic units that process input and produce output through an activation function. Layers are collections of neurons and can be classified into three types: input layers, hidden layers, and output layers. The input type layer first accepts initial data, like pixel values in image data or word embeddings in text. These layers’ neurons represent traits of the data. Hidden layers process the flow of data by identifying complex patterns. Neural networks with one hidden layer are called shallow neural networks (SNNs), and networks with typically more than one layer are called deep neural networks (DNNs) [24]. The neurons within layers receive inputs, process them, and transmit information to successive layers. The output layer delivers the results, such as class labels in a classification task or numerical values in regression, with the number of neurons representing the number of categories or outputs. Weights and biases are parameters integrated into the model to learn during training and make the model accurate in making predictions. Activation functions introduce non-linearity into the network for the learning of complex patterns.

2.2. Model Scaling for Deep Neural Networks

Model scaling is the method where the capacity of the DNN is enlarged by expanding the underlying architectural structure to meet the varying requirements of different workloads [20,25].
Structure-wise, the baseline of a neural network can be expanded as follows:
  • In terms of depth by adding more layers to capture a richer set of features and generalize when introduced to new tasks;
  • In terms of width by embedding more neurons per layer to capture more fine-grained features and intricate relationships within the data;
  • In terms of data clarity (i.e., resolution for images) to potentially capture more fine-grained patterns in the given input.
An example of a neural network model structure that can be scaled depth-wise to introduce new variants are the popular Residual Networks (ResNets) [11]. The naming convention of a ResNet (i.e., ResNet-50 or ResNet-152) includes a number denoting the number of layers in the network.
While scaling towards one dimension can increase model accuracy, as a network deepens (or widens), training speed is impacted due to the vanishing gradient problem where convergence is slowed down as the network parameters become very small while propagated back through the layers [26]. To compensate, compound scaling is considered the norm today, where multiple dimensions are scaled together. An example of a family of neural networks adopting compound scaling is EfficientNet models for object detection, where depth, width, and input resolution are uniformly scaled based on a compound coefficient. By doing so, EfficientNet models can achieve accuracy similar to that of other DNNs (i.e., ResNets) while utilizing fewer computational resources during model training and extracting meaningful cost savings [13].

3. DNN Model Scaling Benchmark Framework

To perform the benchmarking of a diverse set of DNN use cases, we created an automated process where a user needs to specify a dataset, the selected DNN use case, and the user’s preferences about the model scaling parameters.
Figure 2 depicts a high-level overview of our pipeline. The framework provides a lightweight Python library through which users can submit their preferences by either filling in the library’s programming interfaces or by passing their parameters written in a configuration file (in YAML format).
An example of the configuration file is depicted in Listing 1. In this example, the user selects MLP as the DNN use case (use_case). Then, the user must define the parameter space (parameter_space) of the respective use case, which includes the available parameters, along with the selected evaluation values. In this example, the user sets three different values for network depth (network_depth: [4, 5, 7]), two values for network width and batch size parameters (network_width: [32, 64]; batch_size: [32, 128]), and three values for epochs (epochs: [1, 3, 5]). The system generates all possible permutations of these parameters, resulting in a search space of 36 unique runs for this use case.
Next, the monitoring metrics that for the system to expose are selected. The available metrics include computational_overhead, accuracy, and latency. We then examine a set of execution-related parameters, including the boolean parameter training, which specifies whether the system retrains the model; dataset, which indicates the dataset location; and times, which defines the number of experimental executions. It is worth noting that how the dataset is loaded depends on the specific use-case implementation, so the dataset parameter could refer to either a file or a folder where the dataset resides (e.g., a folder of images for object detection and classification use cases).
After that, the framework generates a set of model scaling parameters based on user preferences and combines them with a specified ML use case. If users wish to create their own use cases, they should extend the DNN Model Scaling Benchmark programming interface and add them to the use-case repository. By default, our system includes a use-case repository with three predefined DL use cases—natural language understanding, regression analysis, and object detection and classification—each with its respective scaling parameters (see Section 4). With the parameters and chosen use case defined (via the configuration file or programmatically), the system generates a model whose structure reflects the specified model parameter set, resulting in a compiled model.
Listing 1. DNN parameters of the model scaling benchmark framework.
 use_case: MLP
 parameter_space:
    network_depth: [ 4, 5, 7 ]
    network_width: [ 32, 64 ]
    batch_size: [ 32, 128 ]
    epochs: [ 1, 3, 5 ]
 metrics:
    - computational_overhead
    - accuracy
    - latency
 training: True
 dataset: "..."
 times: 1
The next phase of our pipeline is execution. This phase has two steps, namely training and inference. Specifically, the pipeline splits the dataset into training and testing subsets. This split is configurable with default values of 80% for training and 20% for the testing set. After that, the pipeline performs the training, updating the weights of the compiled model. We should note that the training step is optional, since the users may introduce their own pre-trained models. Next, the trained model is utilized to perform inference on the testing subset, extracting the performance metrics, namely inference latency, accuracy/error, and computational overhead. Since the test subset is already annotated, we can easily extract accuracy for classification tasks or error for regression tasks. For the inference latency, the framework keeps the timestamp before the inference process starts and the timestamp when it is over, subtracts them, and divides by the number of data points. Finally, for the computational overhead, our system extracts the number of Floating-Point Operations (FLOPs) from the compiled ML model. A more detailed description of the metrics can be found in Section 5. The pipeline stores the extracted metrics in metric storage; then, the system selects another set of model parameters and repeats the process until there are no more parameter sets to be examined.
Once all model parameter sets (and trials) for a given ML use case have been evaluated, users can conduct a performance trade-off evaluation. This step enables the system to automatically facilitate various post-processing tasks, such as fitting trend lines, plotting performance metrics, detecting outliers, and generating the respective reports. Additional post-execution metrics are also made available, such as model complexity ( γ ), which reflects the model’s FLOPs relative to the most complex model configuration in the use case. Thus, users can intuitively analyze the effects of different parameter configurations, observing trends across a range of ML parameters and model structures. By visualizing these trade-offs, users gain insights into the balance between computational demands, accuracy, and other performance metrics, supporting informed decision making for optimal model selection in alignment with their specific goals and constraints.

4. DL Use Cases and Validation Datasets

Our DL repository comprises of three use-cases originating from different application domains and covering three out of the seven important domains for advancing AI as identified by MLcommons [27]. All use cases employ open model structures with publicly available variants, and the datasets utilized for the inference workloads are also publicly available. Moreover, we note that although, inherently, any set of values can be used to alter the problem dimensions of a model structure, we embrace the reference value sets suggested by the model designers (e.g., Google Research) that lead to optimal fine tuning of the model accuracy for a given computational budget. These configurations enable all experimental results of the empirical analysis to be realistic, reproducible, and verifiable. Table 1 provides a summary of the DNN model architectures for the DL use cases, as well as their scaling dimensions and their parameterization, for quick reference.

4.1. Natural Language Understanding

This use case involves natural language understanding, where the DL application setup is responsible for processing a series of text-based inference tasks using a language model to semantically interpret the provided text. The chosen model architecture for this use case is the well-known Bidirectional Encoder Representations from Transformers (BERT) model. BERT, as a model structure, was first introduced in 2019 by Google and is considered one of the most novel model structures for NLP tasks, language reasoning, and  conversational AI. In brief, BERT is structured with stacked quantized transformer encoder layers, where each layer comprises self-attention mechanisms and feed-forward neural networks [3]. BERT was intentionally designed to pre-train deep bidirectional representations, adopting a large English text corpus (i.e., Wikipedia), by jointly conditioning on both left and right contexts in all layers [28]. With this, BERT models can capture intricate patterns and dependencies in text-based datasets, making them powerful for understanding the nuances of language.
For the empirical study, we utilize 20 BERT model variants that have been introduced by Google Research (https://github.com/google-research/bert/ accessed on 7 November 2024) and referenced as suitable for edge computing. These models vary in network depth (D = {2, 4, 6, 8, 10}) and hidden embedding sizes (H = {128, 256, 512, 768}). For the workload of inference tasks, we use the widely recognized GLUE-MRPC dataset released by Microsoft Research (https://www.microsoft.com/en-us/download/details.aspx?id=52398 accessed on 7 November 2024), with the validation set containing 1125 sentence pairs automatically extracted from online news sources, along with human annotations indicating whether the sentences in each pair are semantically equivalent (paraphrase).

4.2. Object Detection and Classification

This use case centers on object detection, where the DL application setup is configured to handle a series of inference tasks to detect objects within a set of given images and classify the objects by assigning appropriate labels from a pre-defined label set. The chosen model architecture for this purpose is the widely used EfficientNet convolutional neural network [13]. EfficientNet was first introduced in 2020 by GoogleAI to provide object detection for mobile and IoT services run at the network edge. For EfficientNet, the  baseline network (denoted as B 0 ) exploits a multi-objective neural architecture search that optimizes for both accuracy and FLOPS by adopting a hierarchy of mobile inverted bottleneck convolution layers and the use of squeeze-and-excitation optimization to ultimately enhance representational capacity [29]. However, the most popular feature of EfficientNet is the introduction of compound scaling, where the network structure can uniformly scale in depth, width, and resolution by taking advantage of a fixed set of scaling coefficients tailored to increase the network’s computational efficiency without impacting representational accuracy.
The publicly released variants of EfficientNet are code-named as Bx, with x denoting the model complexity rank ( B 0 B 7 ) and each varying in network depth, width, resolution, and dropout rate. For this empirical study, we utilize the first six models ( B 0 B 5 ), which are all openly available in the Keras pre-trained model library (https://keras.io/api/applications/efficientnet/ accessed on 7 November 2024). Hence, we note that, as shown in Table 1, for EfficientNet, the value of each dimension is expressed as a coefficient factor of the base variant ( B 0 ), and the six variants are formed by taking the i t h value from each dimension to form its dimensional value set. As an example, B 4 takes values of D = 1.8 , W = 1.4 , R = 380 , and  O = 0.6 . For the inference workload, we use images extracted from the well-known and publicly available Tiny ImageNet dataset hosted on Hugging Face (https://huggingface.co/datasets/zh-plus/tiny-imagenet accessed on 7 November 2024), which includes a validation set of 10,000 color images (64 × 64 pixels) categorized into 200 label classes.

4.3. Regression Analysis for Predicting Numerical Outcome

For this use case, we tested 120 Multi-Layer Perceptron (MLP) model variants. In brief, MLPs are fully connected feed-forward artificial neural networks. As MLPs are fully connected, each neuron in layer D 1 is connected to every neuron in the subsequent layer (D). For this use case, the MLP networks’ goal is to ingest data points and, through a regression analysis, predict the numerical value of a target variable. A leaky Rectified Linear Unit (ReLU) is embraced for the neuron activation function. These 120 different model configurations differ by network depth (D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}), neurons per layer (W = {32, 64, 128}), batch size (B = {32, 64, 128, 256}), and epochs (E = {1, 3, 5}).
As our inference workload, we use the open and publicly available California Housing dataset from StatLib provided via Keras (https://keras.io/api/datasets/california_housing/ accessed on 7 November 2024). Specifically, the dataset captures 20Ksamples, with each composed of eight attributes (e.g., house age and number of rooms), and the target variable for reference is the median house value per California district, expressed USD.

5. Benchmark Framework Metrics for Comparison

The benchmark framework supports the evaluation of the reference DL use cases against three performance axes, all evaluated during model inference tasks. Specifically, all use-case model variants are evaluated in terms of accuracy, computational overhead, and inference latency.
The following details how each metric is measured.

5.1. Classification Accuracy and Prediction Error During Inference

For use cases 1 and 2, accuracy is measured as the precision of the DL classification tasks per use case. In brief, classification precision is defined as the ratio of the number of correct predictions to the total number of predictions, where correct predictions include true positives and true negatives, as shown below:
C A = T P + T N T P + T N + F P + F N
For use case 3, where regression is employed, accuracy (or less inaccuracy) is measured by computing the Mean Square Error (MSE) for the predictions output by the regression analysis of the model variant. Specifically, the MSE computes the average squared difference between the observed (y) and predicted (y) values of a sample comprising N values, as shown below:
M S E = 1 N i = 1 N ( y i y i ^ ) 2
For the comparison of the model variants in terms of accuracy, the outcome for each input data point of the validation dataset is recorded during runtime inference. Upon the completion of the experimental run, the accuracy metric per model variant is computed and can be used to compare against experimental runs utilizing other model variants.

5.2. Computational Overhead During Inference

The computational overhead of a DL model is attributed to the processing effort required by the model to perform the requested inference task (e.g., classification or regression) and output a result for a given data point. To measure the computational overhead imposed by a DL model variant during an experimental run, our benchmark framework counts floating-point operations, denoted as FLOPs. In brief, FLOPs measure the total number of mathematical operations involving floating-point values across the computational path of the DNN during runtime inference to output a prediction after receiving a new data point as input. FLOPs are a key metric for evaluating computational efficiency for DL models, as they provide measurable and hardware-independent results suitable for comparison among DL model variants [30]. To measure the FLOPs for each model variant, our benchmark framework utilizes the open and popular flopth (https://github.com/vra/flopth accessed on 7 November 2024) Python library, which supports the measurement of the FLOPs performed during DNN inference.

5.3. Inference Latency

The third performance evaluation metric of our benchmark framework is inference latency. Inference latency for DL models is defined as the total amount of time required by the DNN variant to output a prediction after receiving a new data point as input and perform the necessary (mathematical) operations imposed throughout the computational path of the layered neural network during runtime inference. A customized monitoring probe has been created to extract the inference latency from open and popular ML backends, including TensorFlow and PyTorch, as used in the two DL use-cases.

5.4. Model Complexity

To ease both the depiction of the results and to easily identify models among a large corpus of variants per use case, we introduce a new metric definition. Model complexity, denoted as γ ( 0 , 1 ] , is defined as the proportional difference of a model variant relative to the most complex model ( m N ) under consideration ( γ = 1.0 ).
For instance, assuming a set of N models denoted as M, where all variants differ only by network depth (layers), then if M N has a network depth of 10 layers, a model ( m i M ) with γ = 0.3 has a depth of three layers. For the introduced use cases where multiple DNN dimensions can be scaled, a normalized γ is computed and used as a reference for each model variant. For this, first, each scaling dimension is normalized; next, all dimensions are multiplied, and finally, the intermediate product is divided by the γ of M N .
As an example, the normalized model complexity vector for the EfficientNet variants of use case 2 is the following: γ = [ 0.070 , 0.082 , 0.160 , 0.235 , 0.597 , 1.0 ] . With this vector, we can now place variants in an orderly fashion and can denote proportional differences to estimate the computational overhead and latency of inference, as we see from the experimentation. Finally, we note that in all introduced plots, we employ the model complexity as the independent variable and show how accuracy, computational overhead, and latency are affected when scaling the model’s complexity.

6. Evaluation

In this section, we embrace our benchmark framework to conduct an empirical analysis to understand various trade-offs during runtime inference and assess their impacts on the introduced DL use cases.

6.1. Testbeds

For experimentation, we opted for two testbeds: a public and a private cloud infrastructure. Our aim is to enhance the replication of the empirical study so that the results from DL use-case benchmarking can be reproduced and verified by anyone interested in this study.
To this end, all experiment runs for the BERT (UC1) and MLP-Regression (UC3) use cases were executed on a testbed supported by Google Colaboratory (https://colab.google/ accessed on 7 November 2024). Specifically, this testbed provides the benchmark framework with a dedicated DL execution environment featuring 16 GB of memory for model storage and performance of the computational experimentation, as well as an Nvidia T4 GPU featuring 2560 CUDA cores and 320 Turing tensor cores that are specialized for neural network training and inference. In turn, the experimental runs for the EfficientNet use case (UC2), which requires significant memory for both the model variants and the ImageNet workload, were run on an HP Proliant DL380 G9 server with an Intel Xeon E5-2680 processor embedding 44 cores clocked at 2.50 GHz and 176 GB memory. Finally, we note that all model variants embraced for UC1 and UC2 are the pre-trained versions of these models provided by Google Research and made publicly available in the HuggingFace and Keras repositories, respectively (see Section 4.1 and Section 4.2). This enables both the verification of the study findings (e.g., classification accuracy) with minimal human effort and the necessary computational overhead to reproduce the experimental settings.

6.2. Experimental Analysis

Figure 3, Figure 4 and Figure 5 depict the results of empirical study in which several model variants per DL use case were run and evaluated during runtime inference. It should be noted that all plots are presented with the x axis (independent variable) capturing the normalized complexity of each model variant as introduced in Section 5.4.

6.3. Inference Quality

Let us start the discussion of the empirical study with the plots presented in Figure 3. These plots depict the classification accuracy (UC1 and UC2) and MSE (UC3) for the introduced model variants for each use case. First, it is evident that in all use cases, given enough data during training, quality can be improved by increasing the complexity of the model. In the case of the BERT (Figure 3a) and EfficientNet (Figure 3b) models, the classification accuracy increases, while in the case of the MLP regression models (Figure 3c), the mean square error decreases. Hence, scaling the DNN structure’s dimensionality (e.g., network depth and width) by creating more computationally complex variants is beneficial for improving the inference quality during runtime inference.
The second insight from the empirical study is that, for all use cases, the learning ratio of simpler models plateaus at some point. An illustrative example is provided by the plots in Figure 3a,b. Scaling the EfficientNet-B0 structure by the first compound scaling coefficient increases the model’s classification accuracy by more than 2.2%, but scaling from the B4 variant to the B5 variant introduces minimal gains in accuracy that are less than 0.5%. This can also be confirmed by looking at the MLP-Regression plot (Figure 3c) where performing a further study, one can establish the non-linear nature of the error and the diminishing error reduction through a Ramsey REST test (r-test I-value = 0.000826).
Third, as is evident in the MLP-Regression plot (Figure 3c) the global minimum is achieved for γ = 0.6 , and afterwards, more complex models present a slight increment in their reported MSE values. This occurs as the DNN model structure scales and becomes too complex, which results in overfitting to the randomized effects that are present only in the specific (California Housing) dataset used for training. In this regard, although the expressive power of a model is (somehow) dependent on model complexity, a mathematical expression for estimating inference quality irrespective of the DL use-case domain does not exist [31]. This is inherently due to the phenomenon introduced in the second (learning plateau) and third (overfitting) insights.

6.4. Computational Overhead of Inference

The plots in Figure 4 depict the effect of model scaling on the computational overhead imposed on the underlying model serving as infrastructure during runtime inference for each of the referenced DL use cases.
From these plots, it is immediately evident that when the model complexity increases, so does the computational overhead. Moreover, the model complexity and the computational overhead present an almost linear relationship. For example, in the BERT case (Figure 4a), the coefficient of determination (R squared) is R 2 = 0.913 , and for the EfficientNet use case (Figure 4b), R 2 = 0.984 . These value indicate a well-fit linear relationship. Furthermore, for the MLP-Regression case (Figure 4c), we observe two distinct phases of linearity, the second of which presents a higher linear coefficient that the first phase.

6.5. Inference Latency

The plots in Figure 5 depict the effect of model scaling on the latency observed by users submitting inference tasks to the DL model serving the application during runtime inference for each of the referenced DL use cases. In these plots, we observe that the linear relationship between model complexity and inference latency holds, especially for the BERT (Figure 5a) and EfficientNet (Figure 5b) use cases, although it is less perfect. The coefficients of determination for these two use cases are R 2 = 0.871 and R 2 = 0.972 , respectively. The reduction can be attributed to the fact that although latency is highly correlated with the computational overhead (FLOPs) of the DNN required to output a result, other factors contribute to increased latency. Specifically, latency also depends on data pre-processing to shape the input data points to meet the processing requirements of the DNN and DNN optimizations that can affect memory access patterns. While these factors can (slightly) influence the overall latency, the inference delay observed in the MLP-Regression use case (Figure 5c) is highly affected by overfitting after the global minimum is achieved ( γ = 0.6 ), distorting the depicted linear relationship between model complexity and latency ( R 2 = 0.597 ).
Finally, a key observation among the three performance axes must be made. A specific model complexity (e.g., γ = 0.2 ) and computational budget (in FLOPs) can be achieved by utilizing different scaling policies (i.e., depth only, compound coefficient on depth/width, etc.). For example, in the case of the BERT models in which two dimensions are available for scaling (network layers (d), embedding sizes (e)) a γ = 0.2 can be achieved with ( d , h ) pairs that are equivalent to (2, 768) and (6, 256). In the first example, the classification accuracy is 74.2%, and latency is 2.2 s, while in the second case, accuracy is 77.8%, and the latency is 2.79 s. Therefore, model scaling can have a significantly different effect on the classification accuracy and inference latency. Hence, benchmarking various model scaling strategies to evaluate the performance outcome during DL model inference is of utmost importance when striving to achieve cost savings in geo-distributed deployments and under resource constraints.

7. Related Work

The following introduces relevant scientific works split into two subsections, the first of which briefly discusses DL model scaling and the latter of which introduces empirical evaluations in the field of DL model scaling.

7.1. DL Model Scaling

Several DL model scaling techniques have been proposed throughout the last decade according to which the underlying structure of the DNN can be expanded in terms of either depth [11] or width [26], or adopt higher data clarity for the provided input [32]. However, as the DNN structure increases towards only a single dimension (i.e., depth), the computational costs become prohibited, and model quality can be impacted as well due to parameter redundancy [33]. To overcome these challenges, Tan et al. [13] and Han et al. [34] proposed the use of compound scaling with the introduction of EfficientNet and TinyNet models, respectively. Compound scaling, as previously described, suggests that the DNN structure can expand uniformly towards multiple dimensions by embracing a suitable scaling coefficient. With this, DNNs can achieve accuracy similar to that of other (competitive) DNNs while utilizing fewer computational resources during model training and, thus, extracting meaningful cost savings.

7.2. Empirical Evaluations of DL Model Scaling

The following presents both recent and notable scientific works introducing empirical evaluations for DL model scaling.
Lin et al. [20] investigated the pitfalls of one-dimensional model scaling for convolutional neural networks (CNNs) and proposed a scaling method for CNNs that utilizes dimensional relationship and runtime proxy constraints to improve accuracy and inference latency. In turn, Hestness et al. [35] introduced a study where they empirically projected the computational requirements for future DL applications based on the representational growth achieved (and targeted) by the AI community.
Dollár et al. [14] presented an evaluation study to demonstrate how different compound scaling strategies affect the model parameters; activations; and, consequently, training time for different CNN model structures, focusing primarily on EfficientNet and RegNet models. With a focus on language models that adopt a transformers-based architecture, Kaplan et al. [21] empirically showed that performance (measured by test loss) improves and follows a power-law relationship as the model size, dataset size, and availability of computational power are scaled during model training. When only scaling one factor to achieve the referenced performance, the other two must not become bottlenecks. For optimal performance, the authors showed that all three factors must be scaled-up in tandem.
In a recent study, Bahri et al. [22] capitalized on the previously mentioned work to show how scaling the model and dataset size can be explained under variance-limited and/or resolution-limited scaling regimes. In the end, the authors provided a theoretical framework that can be used to estimate upper and lower bounds for model loss depending on the aforementioned scaling regimes. Furthermore, Aach et al. [36] presented an empirical analysis of three open and popular distributed deep learning frameworks to assess their performance and scalability during model training with experiments focusing on ResNet and ImageNet models as the examined workload. Finally, Wang et al. [37] introduced an empirical study that examined a large corpus of DL model variants investigating unit test adoption and if/how metrics beyond accuracy are used to improve model robustness and reliability.
Based on the provided list of existing works empirically evaluating model scaling, it is obvious that the majority of the aforementioned studies either focus on a single model structure (e.g., ResNet, RegNet, or CNN models) and/or introduce findings that are tailored to solely evaluating accuracy and overheads during the model training stage. We deviate from the norm and focus on the impacts different model scaling strategies have during runtime inference. By offering a detailed examination of trade-offs during model inference, our work can contribute to guiding AI practitioners in selecting and tuning DL models for efficient and effective deployment of intelligent IoT services in resource-constrained environments. Moreover, we note that although a direct comparison against other empirical studies cannot be performed, as the body of related work focuses on performance and accuracy during the model training phase, several of our findings can be confirmed by the aforementioned studies. For example, Lin et al. [20] showcased that compound model scaling positively impacts classification accuracy for DNNs, while Bahri et al. [22] confirmed this but also noted (as we do) that the gains in terms of accuracy diminish and can plateau as a model attempts to scale using large compound coefficients. In turn, Dollár et al. [14] showed that although two model variants can present the same computational complexity, the output classification accuracy can (significantly) differ.

8. Conclusions and Future Work

This article presents an empirical benchmarking study focused on understanding the correlation between DL model scaling and three performance axes during runtime inference. These axes cover accuracy, computational effort, and latency. This distinguishes our work from other studies that solely studied model accuracy and training duration during the training phase for different model variants. To this end, we designed and presented a prototype of a DL model scaling benchmark framework, along with details about its execution and functionalities. Moreover, we introduced three diverse DL use cases from the domains of natural language understanding, object detection, and regression analysis. All three use cases embrace open DNN model architectures, multiple and different scaling techniques, and the use of popular datasets for workload generation.
With the three developed DL use cases, we then performed an empirical analysis with the use of our benchmark framework to discover various insights and trade-offs introduced across the three performance axes during runtime inference when examining multiple model variants per use case. Some key observations from the analysis are the following. Across all three DNN use cases, increasing model complexity generally led to improved inference quality—higher accuracy for classification tasks and lower MSE for regression tasks—but only up to a point. As model complexity continued to increase, the improvements began to plateau. Moreover, scaling the model complexity beyond a certain point that is both use-case and database-dependent, may lead to the model overfitting to the training data, capturing noise rather than meaningful patterns, which can degrade performance on new data. We also observed an almost linear relationship between model complexity and computational overhead (measured in FLOPs). In addition, the inference latency increased with model complexity, correlating with computational overhead, although the relationship was not perfectly linear due to additional factors like data pre-processing effort and memory access patterns influenced by model optimizations. Finally, our study highlights that models with the same complexity but different configurations can have significantly different impacts on accuracy and latency.
Our future work includes several key areas of development to enhance the capabilities and applicability of our benchmarking framework. Initially, we plan to introduce new use cases covering more real-world problems that will include more complex DNN models, like LLMs or generally generative AI. To that end, we plan to integrate our system with online model repositories (e.g., HuggingFace) in order to further automate our analysis. Secondly, we envision an extension of our monitoring subsystem to capture not only DNN performance metrics but also underlying utilization metrics, e.g., compute utilization (GPU/CPU), memory usage, disk I/O, etc. This will reveal hidden correlations between the structures of specific DNN models and their effects on the execution environment. Furthermore, our plans include the introduction of an intuitive user interface, along with a visualization library that will include post-experimentation advanced plotting and analysis functions. This new module will alleviate the difficulties faced by end users in extracting correlations between different parameters, find the sweet spot for their deployment, and extract useful insights. Finally, we would like to create a recommendation service that suggests (near-)optimal configurations for DNN-enabled applications without requiring every possible permutation to be tested. This service will be especially beneficial for beginners, providing a strong starting point to reduce evaluation time and help them choose the best-fit DNN model for their requirements.

Author Contributions

Conceptualization, D.T., M.S. and P.M.; methodology, D.T., P.M. and M.S.; software, P.M. and M.S.; validation, P.M.; writing—review and editing, M.S., D.T. and P.M.; project administration, D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is part of AdaptoFlow, which is indirectly funded by the European Union’s Horizon Europe research and innovation action programme via the TRIALSNET Open Call issued and executed under the TrialsNet project (Grant Agreement no. 101017141).

Data Availability Statement

The study used open and publicly available pre-trained model variants for the DL use cases, and the workloads feature open and popular ML/DL datasets. Links to the model and data repositories are provided throughout the article. The data generated by the benchmark framework are made publicly accessible in a Github repository available at the following link: https://github.com/unic-ailab/ModelScalingBench accessed on 7 November 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Elbes, M.; AlZu’bi, S.; Kanan, T. Deep Learning-Based Earthquake Prediction Technique Using Seismic Data. In Proceedings of the 2023 International Conference on Multimedia Computing, Networking and Applications (MCNA), Valencia, Spain, 19–22 June 2023; pp. 103–108. [Google Scholar] [CrossRef]
  2. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  3. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:cs.CL/1810.04805, 04805. [Google Scholar]
  4. Teichmann, M.; Weber, M.; Zöllner, M.; Cipolla, R.; Urtasun, R. MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1013–1020. [Google Scholar] [CrossRef]
  5. Ren, J.; Xia, F. Brain-inspired Artificial Intelligence: A Comprehensive Review. arXiv 2024, arXiv:2408.14811. [Google Scholar]
  6. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  7. Vaswani, A. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  8. Radford, A. Improving language understanding by generative pre-training. 2018, Preprint.
  9. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
  10. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  11. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  12. Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5314–5321. [Google Scholar] [CrossRef] [PubMed]
  13. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:cs.LG/1905.11946, 11946. [Google Scholar]
  14. Dollár, P.; Singh, M.; Girshick, R. Fast and accurate model scaling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 924–932. [Google Scholar]
  15. Amodei, D.; Hernandez, D. AI and Compute. 2018. Available online: https://openai.com/research/ai-and-compute (accessed on 7 November 2024).
  16. Gujarati, A.; Elnikety, S.; He, Y.; McKinley, K.S.; Brandenburg, B.B. Swayam: Distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, New York, NY, USA, 9 December 2017; Middleware’17. pp. 109–120. [Google Scholar] [CrossRef]
  17. Trihinas, D.; Symeonides, M.; Georgiou, J.; Pallis, G.; Dikaiakos, M.D. Energy-Aware Streaming Analytics Job Scheduling for Edge Computing. In Proceedings of the 2023 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Naples, Italy, 4–6 December 2023; pp. 161–168. [Google Scholar] [CrossRef]
  18. Wu, C.J.; Raghavendra, R.; Gupta, U.; Acun, B.; Ardalani, N.; Maeng, K.; Chang, G.; Aga, F.; Huang, J.; Bai, C.; et al. Sustainable AI: Environmental Implications, Challenges and Opportunities. In Proceedings of the Machine Learning and Systems, Santa Clara, CA, USA, 29 August–1 September 2022; Marculescu, D., Chi, Y., Wu, C., Eds.; Volume 4, pp. 795–813. [Google Scholar]
  19. Trihinas, D.; Michael, P.; Symeonides, M. Towards Low-Cost and Energy-Aware Inference for EdgeAI Services via Model Swapping. In Proceedings of the 2024 IEEE International Conference on Cloud Engineering (IC2E), Paphos, Cyprus, 24–27 September 2024. [Google Scholar] [CrossRef]
  20. Lin, C.; Yang, P.; Wang, Q.; Qiu, Z.; Lv, W.; Wang, Z. Efficient and accurate compound scaling for convolutional neural networks. Neural Netw. 2023, 167, 787–797. [Google Scholar] [CrossRef]
  21. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
  22. Bahri, Y.; Dyer, E.; Kaplan, J.; Lee, J.; Sharma, U. Explaining neural scaling laws. Proc. Natl. Acad. Sci. USA 2024, 121, e2311878121. [Google Scholar] [CrossRef] [PubMed]
  23. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 7 November 2024).
  24. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  25. Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12104–12113. [Google Scholar]
  26. Zagoruyko, S. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
  27. MLcommons MLPerf Benchmarks. 2024. Available online: https://mlcommons.org/benchmarks/ (accessed on 7 November 2024).
  28. Devlin, J.; Chang, M.W. Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing. Google Res. 2019. Available online: https://research.google/blog/open-sourcing-bert-state-of-the-art-pre-training-for-natural-language-processing/ (accessed on 7 November 2024).
  29. Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
  30. Patterson, D.A.; Gonzalez, J.; Le, Q.V.; Liang, C.; Munguia, L.; Rothchild, D.; So, D.R.; Texier, M.; Dean, J. Carbon Emissions and Large Neural Network Training. arXiv 2021, arXiv:2104.10350. [Google Scholar]
  31. Raghu, M.; Poole, B.; Kleinberg, J.; Ganguli, S.; Sohl-Dickstein, J. On the expressive power of deep neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2847–2854. [Google Scholar]
  32. Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  33. Bello, I.; Fedus, W.; Du, X.; Cubuk, E.D.; Srinivas, A.; Lin, T.Y.; Shlens, J.; Zoph, B. Revisiting ResNets: Improved Training and Scaling Strategies. Adv. Neural Inf. Process. Syst. 2021, 27, 22614–22627. [Google Scholar]
  34. Han, K.; Wang, Y.; Zhang, Q.; Zhang, W.; Xu, C.; Zhang, T. Model rubik’s cube: Twisting resolution, depth and width for TinyNets. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–12 December 2020. NIPS’20. [Google Scholar]
  35. Hestness, J.; Ardalani, N.; Diamos, G. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, Washington, DC, USA, 16–20 February 2019. [Google Scholar]
  36. Aach, M.; Inanc, E.; Sarma, R.; Riedel, M.; Lintermann, A. Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks. J. Big Data 2023, 10, 96. [Google Scholar]
  37. Wang, H.; Yu, S.; Chen, C.; Turhan, B.; Zhu, X. Beyond Accuracy: An Empirical Study on Unit Testing in Open-source Deep Learning Projects. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–22. [Google Scholar] [CrossRef]
Figure 1. High-level overview of a deep neural network.
Figure 1. High-level overview of a deep neural network.
Futureinternet 16 00468 g001
Figure 2. Pipeline of performance evaluation trade-offs.
Figure 2. Pipeline of performance evaluation trade-offs.
Futureinternet 16 00468 g002
Figure 3. Inference quality (classification accuracy and MSE) with respect to model complexity. The presented plots include: (a) BERT model variants, (b) EfficientNet model variants, and (c) MLP-Regression model variants.
Figure 3. Inference quality (classification accuracy and MSE) with respect to model complexity. The presented plots include: (a) BERT model variants, (b) EfficientNet model variants, and (c) MLP-Regression model variants.
Futureinternet 16 00468 g003
Figure 4. Computational overhead of inference with respect to model complexity. The presented plots include: (a) BERT model variants, (b) EfficientNet model variants, and (c) MLP-Regression model variants.
Figure 4. Computational overhead of inference with respect to model complexity. The presented plots include: (a) BERT model variants, (b) EfficientNet model variants, and (c) MLP-Regression model variants.
Futureinternet 16 00468 g004
Figure 5. Inference latency with respect to model complexity. The presented plots include: (a) BERT model variants, (b) EfficientNet model variants, and (c) MLP-Regression model variants.
Figure 5. Inference latency with respect to model complexity. The presented plots include: (a) BERT model variants, (b) EfficientNet model variants, and (c) MLP-Regression model variants.
Futureinternet 16 00468 g005
Table 1. DL DNN model architectures and scaling dimensionalities for DL use cases.
Table 1. DL DNN model architectures and scaling dimensionalities for DL use cases.
DNN Model
Architecture
No. of
Variants
Scaling
Dimensions
Experimental
Value Sets
BERT20Network depth (D);
Hidden embeddings (H)
D = {2, 4, 6, 8, 10},
H = {128, 256, 512, 768}
EfficientNet6Network depth (D);
Network width (W);
Input resolution (R);
Dropout rate (O)
D = {1.0, 1.1, 1.2, 1.4, 1.8, 2.2},
W = {1.0, 1.0, 1.1, 1.2, 1.4, 1.6},
R = {224, 240, 260, 300, 380, 456},
O = {0.2, 0.2, 0.4, 0.4, 0.6, 0.6}
MLP120Network depth (D);
Network width (W);
Batch size (B);
Epochs (E)
D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
W = {32, 64, 128},
B = {32, 64, 128, 256},
E = {1, 3, 5}
For EfficientNet, values are expressed as coefficients from the baseline B O model and the i t h value of each dimension was paired together to produce the 6 variants.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Trihinas, D.; Michael, P.; Symeonides, M. Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark Analysis. Future Internet 2024, 16, 468. https://doi.org/10.3390/fi16120468

AMA Style

Trihinas D, Michael P, Symeonides M. Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark Analysis. Future Internet. 2024; 16(12):468. https://doi.org/10.3390/fi16120468

Chicago/Turabian Style

Trihinas, Demetris, Panagiotis Michael, and Moysis Symeonides. 2024. "Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark Analysis" Future Internet 16, no. 12: 468. https://doi.org/10.3390/fi16120468

APA Style

Trihinas, D., Michael, P., & Symeonides, M. (2024). Evaluating DL Model Scaling Trade-Offs During Inference via an Empirical Benchmark Analysis. Future Internet, 16(12), 468. https://doi.org/10.3390/fi16120468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop