research-article

Open access

DASS: Differentiable Architecture Search for Sparse Neural Networks

Authors:

Hamid Mousavi,

Mohammad Loni,

Mina Alibeigi,

Masoud DaneshtalabAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 22, Issue 5s

Article No.: 105, Pages 1 - 21

https://doi.org/10.1145/3609385

Published: 09 September 2023 Publication History

PDF eReader

Abstract

The deployment of Deep Neural Networks (DNNs) on edge devices is hindered by the substantial gap between performance requirements and available computational power. While recent research has made significant strides in developing pruning methods to build a sparse network for reducing the computing overhead of DNNs, there remains considerable accuracy loss, especially at high pruning ratios. We find that the architectures designed for dense networks by differentiable architecture search methods are ineffective when pruning mechanisms are applied to them. The main reason is that the current methods do not support sparse architectures in their search space and use a search objective that is made for dense networks and does not focus on sparsity.

This paper proposes a new method to search for sparsity-friendly neural architectures. It is done by adding two new sparse operations to the search space and modifying the search objective. We propose two novel parametric SparseConv and SparseLinear operations in order to expand the search space to include sparse operations. In particular, these operations make a flexible search space due to using sparse parametric versions of linear and convolution operations. The proposed search objective lets us train the architecture based on the sparsity of the search space operations. Quantitative analyses demonstrate that architectures found through DASS outperform those used in the state-of-the-art sparse networks on the CIFAR-10 and ImageNet datasets. In terms of performance and hardware effectiveness, DASS increases the accuracy of the sparse version of MobileNet-v2 from 73.44% to 81.35% (+7.91% improvement) with a 3.87× faster inference time.

1 Introduction

Deep Neural Networks (DNNs) provide an excellent avenue for obtaining the maximum feature extraction capacities required to resolve highly complex computer vision tasks [1, 2, 3, 4]. There is an increasing demand for DNNs to become more efficient in order to be deployed on extremely resource-constrained edge devices. However, DNNs are not intrinsically designed for the limited computing and memory capacities of tiny edge devices, prohibiting their deployment in such devices [5, 6, 7, 8, 9].

To democratize DNN acceleration, a variety of optimization approaches have been proposed, including network pruning [10, 11, 12, 13], efficient architecture design [6, 8], network quantization [14, 15, 16], knowledge distillation [17, 18], and low-rank decomposition [19]. In particular, network pruning is known to provide remarkable computational and memory savings by removing redundant weight parameters in the unstructured scenario [10, 11, 12, 20, 21], and the entire filter in the structured scenario [22, 23, 24, 25, 26]. Recently, unstructured pruning methods have been reported to provide extreme reductions in network size. The state-of-the-art unstructured pruning methods [10] provide up to 99% pruning ratio, which is an excellent scenario for tiny edge devices.

However, these methods suffer from a substantial accuracy drop, preventing them from being applied in practice (\(\approx\)19% accuracy drop for MobileNet-v2 compared to dense one [10]). Current pruning methods use handcrafted architectures designed without concern about sparsity. [10, 11, 12, 20, 25]. We hypothesize that the backbone architecture may not be optimal for scenarios with extreme pruning ratios. Instead, we can learn more efficient backbone architectures that are adaptable to pruning techniques by exploring the space of sparse networks.

Neural Architecture Search (NAS) has achieved great success in the automated design of high-performance DNN architectures. Differentiable architecture search (DARTS) methods [27, 28, 29] are popular NAS methods that use a gradient-based search algorithm to accelerate search speed. Motivated by the promising results of NAS, we came up with the idea of designing custom backbone architectures compatible with pruning methods. Nevertheless, the search space of current DARTS algorithms comprises dense convolution and linear operations that are incapable of exploring the correct backbone for pruning. To demonstrate this issue, we first prune 99% of the weights from the best architecture designed by the NAS method [27] with base search space without considering sparsity into account. Disappointingly, after applying the pruning method to the final architecture, it performs poorly with up to \(\approx\)21% accuracy loss in compression to DASS (our proposed method) that extends the search space by sparse operations. (Section 4). This failure is due to a lack of support for specific sparse network characteristics, leading to low generalization performance. Based on the above hypothesis and empirical observations, we formulate a search space that includes sparse and dense operations. Therefore, the original convolution and linear operations in the search space of the NAS have been extended by parametric SparseConv and SparseLinear operations, respectively. Moreover, to make a consistency between the proposed search space and search objective function, we modify the bi-level optimization problem to take sparsity into account. In this way, the search process tries to find the best sparse operation by optimizing both architecture and pruning parameters. This modification creates a complex bi-level optimization problem. To tackle this difficulty, we split the complex bi-level optimization into two simple bi-level optimization problems and solve them.

We show that explicitly integrating pruning into the search procedure can lead to finding sparse network architectures with significant accuracy improvement. In Figure 1, we compare the CIFAR-10 Top-1 accuracy and the number of parameters of the found architecture by DASS with the state-of-the-art sparse (unstructured pruning) and dense networks. The results show that the architecture designed by DASS outperforms all competing architectures that employ the pruning method. DASS-Small demonstrates its consistent effectiveness by achieving 15%, 10%, and 8% accuracy improvement over MobileNet-v2\(_{sparse}\) [30], EfficientNet-v2\(_{sparse}\) [31], and DARTS\(_{sparse}\) [27], respectively. In addition, compared to networks with similar accuracy, DASS-Large has a significant reduction in network complexity (#Params) by 3.5\(\times\), 30.0\(\times\), 105.2\(\times\) over PDO-eConv [32], CCT-6/3\(\times\)1 [33], and MomentumNet [34], respectively. Section 6 provides a comprehensive experimental study to evaluate different aspects of DASS. Our main contributions are summarized as follows:

Fig. 1.

(1)

We perform extensive experiments to identify the limitations of applying pruning methods with extreme pruning ratios to dense architecture as a post-processing step.

(2)

We define a new search space by extending the base search space in DARTS with a new set of parametric operations (SparseConv and SparseLinear) to consider the sparse operations in the search space.

(3)

We modify the bi-level optimization problem to be consistent with the new search space and propose a three-step gradient-based algorithm to split the complex bi-level problem and learn architecture parameters, network weights, and pruning parameters.

2 Related Work

2.1 Neural Architecture Search and DARTS Variants

Neural Architecture Search (NAS) has recently attracted remarkable attention by relieving human experts from the laborious effort of designing neural networks. Early NAS methods mainly utilized evolutionary-based [9, 36, 37, 38] or reinforcement-learning-based methods [39, 40, 41]. Despite the efficiencies of the architecture designed by evolutionary-based and reinforcement-learning-based methods, they require tremendous computing resources. For example, the proposed method in [39] evaluates 20,000 neural candidates in 500 NVIDIA^® P100 GPUs over four days. One-shot architecture search methods [42, 43, 44] have been proposed to identify optimal neural architectures in a few GPU days (\(\gt\)1 GPU day [45]). In particular, Differentiable Architecture Search (DARTS) [27, 28, 29] is a variation of one-shot NAS methods that relaxes the search space to be continuous and differentiable. The detailed description of DARTS can be found in Section 3.1. Despite the broad successes of DARTS in advancing NAS applicability, achieving optimal results remains a challenge for real-world problems. Many subsequent works investigate some of these challenges by focusing on (i) increasing search speed [46, 47], (ii) improving generalization performance [35, 48], (iii) addressing robustness issues [49, 50, 51], (iv) reducing quantization error [14, 16], and (v) designing hardware-aware architectures [52, 53, 54]. On the other hand, few works attempt to prune the search space by removing inferior network operations. [55, 56, 57, 58, 59, 60]. These works utilized the pruning mechanism to progressively remove some operations from the search space. Unlike them, our method aims to extend the search space to improve the performance of the sparse network by searching for the best operations with sparse weight structures. Technically, our method extends the search space by adding the parametric sparse version of convolution and linear operations to find the best sparse architecture. Therefore, there is a lack of research on sparse weight parameters when designing neural architectures. Our proposed method (DASS) searches for the operations that are most effective for sparse weight parameters in order to achieve higher generalizing performance.

2.2 Network Pruning

Network pruning is an effective method for reducing the size of DNNs, enabling them to be effectively deployed on devices with limited resource capacity. Prior works on network pruning can be classified into two categories: structured and unstructured pruning methods. The purpose of structured pruning is to remove redundant channels or filters to preserve the entire structure of weight tensors with dimension reduction [22, 23, 24, 25, 61, 62]. While structured pruning is famous for hardware acceleration, it sacrifices a certain degree of flexibility as well as weight sparsity [63].

On the other hand, unstructured pruning methods offer superior flexibility and compression rate by removing parameters with the least impact on the network accuracy of the weight tensors [10, 11, 12, 20, 22, 63, 64, 65, 66]. In general, unstructured pruning entails three stages to make a sparse network, including (i) pre-training, (ii) pruning, and (iii) fine-tuning. Prior unstructured pruning methods used various criteria to select the lowest pruning weight parameters. [67, 68] pruned weight parameters based on the second-derivative values of the loss function. Several studies proposed to remove the weight parameters below a fixed pruning threshold, regardless of the training objective [64, 65, 66, 69, 70, 71]. To address the limitation of fixed thresholding methods, [20, 72] proposed layer-wise trainable thresholds to determine the optimal value for each layer separately. The lottery-ticket hypothesis [66, 73, 74] is a different line of the method that identifies the pruning mask for an initialized CNN and trains the resulting sparse model from scratch without changing the pruning mask. HYDRA [10] formulate the pruning objective as empirical risk minimization and integrate it with the training objective. Unlike other methods, optimization-based pruning criteria improve the performance of sparse networks in comparison to other metrics. Despite the success of optimization-based pruning in achieving a significant compression rate, the classification accuracy is compromised, notably when the pruning ratio is extremely high (up to 99%). We show that the main reason for this issue is due to the non-optimal backbone architecture. We extend the search space of DASS by parametric sparse operations and formulate pruning as an empirical risk minimization problem and integrate it into the bi-level optimization problem to find the best sparse network.

3 Preliminaries

3.1 Differentiable Architecture Search

Differentiable Architecture Search (DARTS) [27] is a NAS method that significantly reduces the search cost by relaxing the search space to be continuous and differentiable. DARTS cell template is represented by a Directed Acyclic Graph (DAG) containing \(N\) intra-nodes. The edge \((i,j)\) between two nodes is associated with an operation \(o^{(i,j)}\) (e.g., skip connection or \(3\times 3\) max-pooling) within \(\mathcal {O}\) search space. Equation (1) computes the output of intermediate nodes.

\begin{equation} \bar{o}^{(i,j)} (x^{(i)}) = \sum _{o\in \mathcal {O}}\frac{exp\big (\alpha _o^{(i,j)}\big)}{\sum _{o^\prime \in \mathcal {O}}exp \big (\alpha _{o^\prime }^{(i,j)}\big)}\cdot o (x^{(i)}) \end{equation}

(1)

where \(\mathcal {O}\) and \(\alpha _{o}^{(i,j)}\) denote the set of all candidate operations and the selection probability of \(o\), respectively. The output node in the cell is the concatenation of all intermediate nodes. DARTS optimizes architecture parameters (\(\alpha\)) and network weights (\(\theta\)) with the following bi-level objective function:

\begin{equation} \mathop {\mathrm{min}}_{\alpha } \mathcal {L}_{val}(\theta ^{\star },\alpha) \;\; s.t. \; \theta ^{\star } = \mathop {\mathrm{argmin}}_{\theta } \mathcal {L}_{train}(\theta ,\alpha) \end{equation}

(2)

where

\begin{equation*} \mathcal {L}_{train} = \frac{\sum _{(\boldsymbol {x},y) \in (X_{train},Y_{train})} l(\theta ,\boldsymbol {x},y)}{|X_{train}|} \end{equation*}

and

\begin{equation*} \mathcal {L}_{val} = \frac{\sum _{(\boldsymbol {x},y) \in (X_{val},Y_{val})} l(\theta ,\boldsymbol {x},y)}{|X_{val}|} \end{equation*}

The operation with the largest \(\alpha _o\) is selected for each edge. \(X_{train}\) and \(Y_{train}\) represent the training dataset and corresponding labels, respectively. Similarly, the validation dataset and labels are indicated by \(X_{val}\) and \(Y_{val}\), respectively. After the search process has been completed, the final architecture is re-trained from scratch to obtain maximum accuracy.

3.2 Unstructured Pruning

Pruning is considered unstructured if it removes low-importance parameters from weight tensors and makes sparse [63]. This paper uses the optimization-based unstructured network pruning method to provide greater flexibility and an extreme compression rate compared to structured pruning methods. The pruning method includes three main optimization stages: (i) pre-training: training the network on the target dataset, (ii) pruning: pruning unimportant weights from the pre-trained network, and (iii) fine-tuning: the sparse network is re-trained to recover its original accuracy. For the pruning stage, we consider an optimization-based method with the following steps: First, we define the pruning parameters that show the importance of each weight of the network (\(s^0\)) and initialize them according to Equation (3).

\begin{equation} s_i^0 \propto \frac{1}{\max (|\theta _{pre,i}|)} \times \theta _{pre,i} \end{equation}

(3)

where \(\theta _{pre,i}\) denotes the weight of \(i_{th}\) layer in the pre-trained network. Next, to learn the pruning parameters \((\hat{s})\), we formulate the optimization problem as Equation (4), which is then solved by stochastic gradient descent (SGD) [75].

\begin{equation} \hat{s} = \mathop {\mathrm{argmin}}_{s} \mathbb {E}_{(x, y)\sim D} \big [\mathcal {L}_{prune}(\theta _{pre},s, x, y)\big ] \end{equation}

(4)

\(\theta _{pre}\) and \(\mathbb {E}\) refer to the pre-trained network parameters and mathematical expectation, respectively. By solving this optimization problem, we are able to determine the effect of each weight parameter on the loss function and, consequently, the accuracy of the network. Finally, we convert the floating values of the pruning parameters to a binary mask based on selecting top-\(k\) weights with the highest magnitude of pruning parameters.

4 Research Motivation

The dense network architectures that were originally designed using conventional NAS methods are inaccurate when integrated with pruning methods, particularly at high pruning ratios. To demonstrate this assertion, we first apply the unstructured pruning method explained in Section 3.2 to the best architecture designed by DARTS [27] for CIFAR-10 and generate a sparse network. We call this solution DARTS\(_{sparse}\). Then, we compare the performance of the sparse architecture designed by DASS with DARTS\(_{sparse}\). Figure 2 illustrates the train and test accuracy curves for the DASS and DARTS\(_{sparse}\) architectures trained on the CIFAR-10 dataset. Disappointingly, the network designed by DARTS\(_{sparse}\) results in reduced test accuracy. This implies that the dense backbone architectures designed by NAS methods without considering sparsity are ineffective (DASS delivers 8% higher test accuracy compared to DARTS\(_{sparse}\)). According to our investigations, we find two issues involved in the training failure of DARTS\(_{sparse}\): (i) DARTS does not support sparse operations in its search space, and (ii) DARTS optimizes the search objective without considering sparsity into account. Section 5.2 addresses the first issue, while the second issue is addressed in Section 5.3.

Fig. 2.

We investigate DASS in two modes to demonstrate the significance of including sparse operations and reformulating the objective function based on sparsity. The first mode extends the search space with sparse operations solely (DASS\(_{Op}\)) and does not optimize the pruning parameters, while the second mode adds sparsity to the optimization process and optimizes the architecture and pruning parameters in a bi-level optimization problem. (DASS\(_{Op+Ob}\)). Figure 3 indicates the test accuracy for DASS\(_{Op}\), DARTS\(_{sparse}\) and (DASS\(_{Op+Ob}\)) architectures with various pruning ratios. As results show, DASS\(_{Op}\) has \(\approx\)3.4% lower accuracy compared to DASS\(_{Op+Ob}\) and \(\approx\)4.47% higher accuracy compared to DARTS\(_{sparse}\). In conclusion, extending the search space with proposed sparse operations (our first contribution) in DASS produces a better architecture than DARTS\(_{sparse}\), but combining it with the sparsity-based optimization objective (our second contribution) enhances performance.

Fig. 3.

5 Differentiable Architecture Search for Sparse Neural Networks (DASS) Method

5.1 DASS: Overview

We propose DASS, a differentiable architecture search method for sparse neural networks. DASS at first extends the search space of the NAS with parametric sparse operations. Then it modifies the bi-level optimization problem to learn the architecture, weights, and pruning parameters. DASS employs a three-step approach to solve the complicated bi-level optimization problem, which consists of (1) Pre-training: find the best dense architecture (pruning parameters equal to zero) from the search space and pre-train it (2) Pruning and sparse architecture design: find the best pruning mask (optimizing pruning parameters) and update the architecture parameters based on the sparse weights and finally (3) Fine-tuning: re-train the sparse architecture to achieve the maximum classification performance.

5.2 DASS Search Space

To support sparse operations, DASS proposes the parametric sparse version of convolution and linear operations called SparseConv and SparseLinear, respectively. These operations have a sparsity mask (\(m\)) to remove redundant weight parameters from the network. Figure 4 illustrates the functionality of these two operations. In addition, Table 1 summarizes the operations of the DASS search space.

Table 1.

Operation	Separable	Dilated	Max	Average	Skip
Type	Sparse Convolution	sparse Convolution	Pooling	pooling	connect
Kernel Size	\(3\times 3\), \(5\times 5\)	\(3\times 3\), \(5\times 5\)	\(3\times 3\)	\(3\times 3\)	N/A

Table 1. Operations of the DASS Search Space

Fig. 4.

To empirically investigate the efficiency of the proposed sparse search space, we compare the similarity of the feature maps of high-performance dense architecture (with a large number of parameters) with the sparse architecture discovered by DASS and the architecture designed from the original search space; DARTS\(_{spasre}\); methods. We use Kendall’s \(\tau\) [76] metric to measure the similarity between output feature maps. The \(\tau\) correlation coefficient returns a value between \(-\)1 and 1. To present the outcome more clearly, we scale up these values between \(-\)100 and 100. Closer values to 100 indicate stronger positive similarity between the feature maps. Figure 5 summarizes the results. Our observations reveal a similarity between DASS feature maps and dense architecture (up to 16%). On the other hand, the correlation between DARTS\(_{sparse}\) and dense architecture is insignificant. Therefore, it shows that the architecture designed by DASS based on new search space can extract features more similar to high-performance dense architecture while DARTS\(_{sparse}\) that use dense search space lost important features after pruning. The level of similarity is not very high because DASS is a sparse network with a pruning ratio of 99%. However, it can demonstrate that DASS retrieves useful features.

Fig. 5.

5.3 DASS Search Objective

DASS aims to search for optimal architecture parameters (\(\alpha ^\star\)) to minimize the loss of validation of the sparse network. Thus, to achieve a consistent search objective with the proposed sparse search space, we formulate the entire search objective as a complex bi-level optimization problem:

\begin{equation} \begin{split} \alpha ^\star =&\mathop {\mathrm{min}}_{\alpha }(\mathcal {L}_{val}(\hat{\theta }(\alpha),\alpha))\\ s.t. \;\;\;\; & {\left\lbrace \begin{array}{ll} \theta ^{\star }(\alpha) = \mathop {\mathrm{argmin}}_{\theta } \mathcal {L}_{train}(\theta ,\alpha)\\ \hat{m} =\mathop {\mathrm{argmin}}_{m\in \lbrace 0,1\rbrace ^N} \big [\mathcal {L}_{prune}(\theta ^\star (\alpha) \odot m,\alpha)\big ] \\ \hat{\theta }(\alpha) = \theta ^\star (\alpha)\odot \hat{m}. \end{array}\right.} \end{split} \end{equation}

(5)

Here \(m\) denotes the parameters of the binary pruning mask. This formulation learns the architecture parameters based on the sparse weight parameters. However, Equation (5) is not a straightforward bi-level optimization problem because the lower-level problem consists of two optimization problems. To overcome this challenge, we break the search objective into three distinct steps. Thus, the problem is transformed into two bi-level optimization problems to determine the optimal architecture parameters for dense and sparse weights and an optimization problem to fine-tune the weight parameters. In addition, the lower-level optimization problem consists of a discrete optimization problem for pruning masks.

Section 5.4 proposes a multi-step optimization algorithm to solve the optimization problem and handle the discrete optimization problem by converting it to a continuous optimization problem.

5.4 Optimization Algorithm

As shown in Figure 6, the optimization algorithm consists of three main steps, which are described as follows.

Fig. 6.

Step 1: pre-train (learn \(\theta ^\star _{pre}\) and \(\alpha ^*_{pre}\)).

In this step, we break Equation (5) into a bi-level optimization problem to find the best dense architecture. This pre-training is necessary for the next step which learns pruning mask parameters and modifying the sparse architecture.

\begin{equation} \begin{split} \alpha ^\star _{pre}=&\mathop {\mathrm{min}}_{\alpha _{pre}}(\mathcal {L}_{val}(\theta ^*_{pre}(\alpha _{pre}),\alpha _{pre}))\\ s.t. \;\;\;\; & \theta ^{\star }_{pre}(\alpha _{pre}) = \mathop {\mathrm{argmin}}_{\theta _{pre}} \mathcal {L}_{train}(\theta _{pre},\alpha _{pre}) \end{split} \end{equation}

(6)

The first-order approximation technique use to update \(\theta ^\star _{pre}\) and \(\alpha _{pre}\) alternately using gradient descent [27].

Step 2: prune (learn \(\hat{m}\) and \(\alpha ^*_{prune}\)).

To make the search process aware of the sparsity mechanism, we need to solve another bi-level optimization problem that alternately updates the pruning mask and architecture parameters. Pruning mask parameters are binary values. Therefore, learning the mask parameters (\(m\)) is a challenging binary optimization problem. We solve this binary optimization problem by introducing floating point pruning parameters \(s\) and initializing them. Then we use SGD to solve the optimization problem and find the best floating-point pruning mask parameters. Finally, based on the values of pruning parameters, we select the top-\(k\) weight parameters with the highest values and assign one value to them. This step aims to jointly learn architecture parameters \(\alpha _{prune}\) and mask parameters \(\hat{m}\) to consider sparsity in learning architecture parameters. Therefore, we use another bi-level optimization problem:

\begin{equation} \begin{split} \alpha ^\star _{prune}=&\mathop {\mathrm{min}}_{\alpha _{prune}}(\mathcal {L}_{val}(\theta ^*_{pre}\odot \hat{m}(\alpha _{prune}),\alpha _{prune}))\\ s.t. \;\;\;\; & \hat{s}(\alpha _{prune}) = \mathop {\mathrm{argmin}}_{s} \mathcal {L}_{prune}(\theta ^*_{pre},\alpha _{prune},s), \\ & \hat{m}(\alpha _{prune}) = {1\!\!1}(|\hat{s}(\alpha _{prune})|\gt |\hat{s}(\alpha _{prune})|_k) \end{split} \end{equation}

(7)

similar to step 1, the first-order approximation method is used to alternately update \(\hat{m}\) and \(\alpha _{prune}\) by gradient descent.

Step 3: fine-tune (learn \(\hat{\theta }\)).

In the fine-tuning step, we update the non-zero weight parameters using SGD for the best sparse architecture to improve the network accuracy (Equation (8)).

\begin{equation} \hat{\theta }_{t+1} = \hat{\theta }_{t} - \eta _{\hat{\theta }}\nabla _{\hat{\theta }} \mathcal {L}_{fine-tune}(\hat{\theta }_{t} \odot \hat{m},\alpha ^\star _{prune}) \end{equation}

(8)

where \(\eta _{\hat{\theta }}\) and \(\mathcal {L}_{fine-tune}\) denote the learning rate and the loss function for the fine-tuning step.

We show that the proposed three-step optimization algorithm can solve the complex bi-level problem in Equation (5) and finds optimal architecture parameters with higher generalization performance for sparse networks. Figure 7 compares the learning curves of DASS with DARTS\(_{sparse}\) on the CIFAR-10 dataset. As shown, the DASS optimization algorithm significantly reduces the validation loss for the sparse network. Figure 8 compares the behavior of the generalization gap (train minus test accuracy) for DASS and DARTS\(_{sparse}\). DASS has a lower generalization gap (up to 22%), indicating DASS better regularizes the validation loss across all epochs compared to DARTS\(_{sparse}\). Algorithm 1 outlines our DASS for the differentiable neural architecture search for sparse neural networks. We analyze the time complexity of the proposed algorithm as follows: Our algorithm consists of three main steps: Pre-train, Prune, and fine-tune. Each step includes a loop with \(T\) iterations. At each step, we need to fix a variable and run forward and backward propagation algorithms to train the architecture, weight, and pruning parameters. The time complexity of the forward and backward algorithms depends on the complexity of the architecture (the number of layers and neurons in the network). Therefore, for each step, the time complexity is calculated as follows: the complexity of forward propagation for a network with \(n\) layer and \(m\) nodes in each layer and \(T\) iterations is \(O (T \cdot n \cdot m ^3)\). The time complexity for the backward algorithm is \(O (T \cdot n \cdot m ^4)\). Therefore, Algorithm 1 with three distinct steps has total time complexity equal to \(3 \cdot [(T \cdot n \cdot m ^3)+ (T \cdot n \cdot m ^4)] =O (T \cdot n \cdot m ^4)\). If we assume that the number of layers is equal to the number of neurons in each layer, then the total time complexity of Algorithm 1 is equal to \(O (T \cdot n^5)\).

Fig. 7.

Fig. 8.

6 Experiments

6.1 Experimental Setup

(1) DATASET: to evaluate DASS, we use CIFAR-10 [77] and ImageNet [78] public classification datasets. For the search process, we split the CIFAR-10 dataset into 30k data points for training and 30k for validation. We transfer the best-learned cells on CIFAR-10 to ImageNet [27] and re-train the final sparse network from scratch.

(2) Details on Searching Networks: we create a network with 16 initial channels and eight cells. Each cell consists of seven nodes equipped with a depth-wise concatenation operation as the output node. The SparseConv operations follow the ReLU+SpasreConv+Batch Normalization order. We train the network using SGD for 50 epochs with a batch size of 64 in the DASS pre-train step. Then, we update the values of pruning and architecture parameters for 20 epochs in the DASS pruning step. Finally, we fine-tune the network for 200 epochs. The initial learning rate for the DASS in pre-train, pruning, and fine-tuning steps is 0.025, 0.1, and 0.01, respectively. In our experiments, we use the cosine annealing learning rate [79]. We use weight decay = 3 \(\times\) 10^-4 and momentum = 0.9 in all steps. The search process for CIFAR-10 takes \(\approx\)3 GPU-days on a single NVIDIA^® RTX A4000 that produces 4.35 Kg \(CO_2\). We compare the sparse architecture designed by our method, DASS, with other dense and sparse networks. NAS-Bench-101 [80] and NAS-Bench-201 [81] are examples of NAS algorithm evaluation benchmarks. They consist of numerous dense designs and their respective performance. Due to the fact that they do not support sparse architectures, we cannot evaluate DASS using these benchmarks. Creating sparse benchmarks for evaluating NAS algorithms is a suggestion for future work.

(3) DASS variants and Hardware Configuration: Table 2 provides the configuration details of the DASS variants. Each variation is built by stacking a different number of DASS cells and the output channels of the first layer to generate networks for various resource budgets. Table 3 presents specifications of hardware devices utilized for evaluating the performance of DASS at inference time.

Table 2.

DASS	CIFAR-10				ImageNet
DASS	Tiny	Small	Medium	Large	Small	Medium	Large
#Cells	16	20	12	14	14	15	16
#Channels	30	36	86	108	48	86	128

Table 2. Configuration of the DASS Variants

#Cells: the number of stacked cells. #Channels: the number of output channels for the first SparseConv operation.

Table 3.

Platform	Specification	Value
Search & Train	GPU	NVIDIA^® RTX A4000 (735 MHz)
	GPU Memory	16 GB GDDR6
	GPU Compiler	cuDNN version 11.1
	System Memory	64 GB
	Operating System	Ubuntu 18.04
	\(CO_2\) Emission/Day†	1.45 Kg
Real Hardware	Embedded GPU	NVIDIA^® Jetson TX2 (735 MHz)
		256 CUDA Cores
		NVIDIA^® Quadro M1200 (735 MHz)
		640 CUDA Cores
	Embedded CPU	ARM^® Cortex^TM-A7 (1.2 GHz)
		4/4 (Cores/Total Thread)
		Intel^® i5-3210M Mobile CPU
		5/4 (Cores/Total Thread)
Estimation‡	Xiaomi Mi9 GPU	Adreno 640 GPU (750 MHz)
	Xiaomi Mi9 GPU	986 GFLOPs FP32 (Single Precision)
	Myriad VPU	Intel^® Movidius NCS2 (700 MHz)
	Myriad VPU	28-nm Co-processor

Table 3. Hardware Specification

\(\dagger\) Calculated using the ML \(CO_2\) impact framework: https://mlco2.github.io/impact/ [82].

\(\ddagger\) Performance Estimation using the nn-Meter framework [83].

6.2 DASS Compared to Dense Networks

Table 4 compares the performance of DASS against the state-of-the-art and the state-of-the-practice DNNs. We select the architecture with the highest accuracy, DrNAS [35], as the baseline for comparing compression rates. In comparison with DrNAS [35], DASS-Large provides 37.73\(\times\) and 29.23\(\times\) higher network compression rates while delivering a comparable accuracy (less than 2.5% accuracy loss) on the CIFAR-10 and ImageNet datasets, respectively. Compared to the best handcrafted designed network [33] on the CIFAR-10 (CCT-6/3x1), DASS-Large significantly decreases the parameters of the network by 29.9\(\times\) with providing slightly higher accuracy.

Table 4.

Architecture	Year	Search Method	Top-1 Acc.(%)	#Params (\(\times 10^6\))	#Params Compression	Top-1 Acc.(%)	Top-5 Acc.(%)	#Params (\(\times 10^6\))	#Params Compression
			CIFAR-10			ImageNet
ResNet-18‡ [2]	2016	-	91.0	11.1	\(-2.77\times\)	72.33	91.80	11.7	\(-2.05\times\)
PDO-eConv [32]	2020	-	94.62	0.37	+10.81\(\times\)	-	-	-	-
FlexTCN-7 [33]	2021	-	92.2	0.67	+5.97\(\times\)	-	-	-	-
CCT-6/3x1 [33]	2021	-	95.29	3.17	+1.26\(\times\)	-	-	-	-
MomentumNet [34]	2021	-	95.18	11.1	\(-2.77\times\)	-	-	-	-
DARTS (1^st order) [27]	2018	gradient	96.86	3.3	+1.21\(\times\)	-	-	-	-
DARTS (2^nd order) [27]	2018	gradient	97.24	3.3	+1.21\(\times\)	74.3	91.3	4.7	+1.21\(\times\)
SGAS (Cri 1. avg) [84]	2020	gradient	97.34	3.7	+1.08\(\times\)	75.9	92.7	5.4	+1.05\(\times\)
SDARTS-RS [85]	2020	gradient	97.39	3.4	+1.17\(\times\)	75.8	92.8	3.4	+1.67\(\times\)
DrNAS [35]	2020	gradient	97.46	4.0	1.0\(\times\)	76.3	92.9	5.7	1.0\(\times\)
DASS-Small	2023	gradient	89.06	0.017	+235.29\(\times\)	46.48	68.36	0.029	+196.55\(\times\)
DASS-Medium	2023	gradient	92.18	0.054	+74.07\(\times\)	68.34	82.24	0.082	+69.51\(\times\)
DASS-Large	2023	gradient	95.31	0.106	+37.73\(\times\)	73.83	85.94	0.195	+29.23\(\times\)
DASS-Huge	2023	gradient	97.78	2.6	+1.54\(\times\)	-	-	-	-

Table 4. Comparing the DASS Method with the State-of-the-art Dense Networks on the CIFAR-10 and ImageNet Datasets

\(\dagger\) The baseline for comparing the #params compressing rate is DrNAS [35] as the most accurate architecture.

\(\ddagger\) ResNet-18 results are trained in https://github.com/facebook/fb.resnet.torchTorch (July 10, 2018).

We highlight the best results in blue color.

6.3 DASS Compared to Sparse Networks

As we focus on improving the accuracy of sparse networks at extremely high pruning ratios, we compare DASS with other sparse networks with the unstructured pruning method at 99% pruning ratio (Table 5). In comparison with DARTS\(_{sparse}\), DASS-Small yields 7.81% and 7.81% higher top-1 accuracies with 1.23\(\times\) and 1.05\(\times\) reduction in network size on the CIFAR-10 and ImageNet datasets, respectively. It indicates that the network design based on new search space and sparse objective function finds better sparse architecture. In comparison with ResNet-18\(_{sparse}\) on the CIFAR-10 dataset, we provide 1.56% and 4.7% higher accuracy with 2.08\(\times\) and 1.05\(\times\) network size reduction for DASS-Medium and DASS-Large, respectively. Compared to ResNet-18\(_{sparse}\) on the ImageNet dataset, DASS-Medium provides 0.76% higher accuracy with 1.42\(\times\) network size reduction. MCUNET [8] is a lightweight neural network for microcontrollers. It is designed by a tiny neural architecture search mechanism. Compared to MCUNET on the ImageNet dataset, DASS-Large provides 1% higher accuracy with 2.89\(\times\) network size reduction. This result shows that only optimizing the size of the filters without considering the sparsity can not generate the best architecture. DASS directly search for the best operations in sparse version to design high-performance lightweight network. We can conclude that DASS increases sparse networks’ accuracy at high pruning ratios compared to NAS-based and handcrafted networks.

Table 5.

Architecture	Top-1 Acc. (%)	#Params (\(\times 10^3\))	Compression Rate†	NID‡	Top-1 Acc. (%)	Top-5 Acc. (%)	#Params (\(\times 10^3\))	Compression Rate†	NID‡
	CIFAR-10				ImageNet
DARTS\(_{sparse}\) [27]	81.25	21.0	100.47\(\times\)	3.86	38.67	61.33	33.0	Compression Rate†	NID‡	100\(\times\)	1.11
MobileNet-v2\(_{sparse}\) [30]	73.44	22.2	95.04\(\times\)	3.30	17.97	36.72	34.87	94.63\(\times\)	0.515
ResNet-18\(_{sparse}\) [2]	90.62	111.6	18.90\(\times\)	0.81	67.58	80.86	116.84	28.24\(\times\)	0.578
EfficientNet\(_{sparse}\) [31]	79.69	202.3	10.43\(\times\)	0.39	-	-	-	-	-
MCUNET [8]	89.7	210.1	15.70	0.42	72.34	84.86	562.64	5.86\(\times\)	0.128
DASS-Small	89.06	17.0	124.11\(\times\)	5.23	46.48	68.36	28.94	114.02\(\times\)	1.606
DASS-Medium	92.18	53.65	39.32\(\times\)	1.71	68.34	82.24	81.95	40.26\(\times\)	0.841
DASS-Large	95.31	105.5	20\(\times\)	0.90	73.83	85.94	194.6	16.95\(\times\)	0.38

Table 5. Comparing the DASS Method with Sparse Networks on the CIFAR-10 and ImageNet Datasets

\(\dagger\) The baseline for comparing the compressing rate is full-precision and dense DARTS architecture.

\(\ddagger\) NID = Accuracy/#Parameters [86]. NID measures how efficiently each network uses its parameters.

We highlight the best results in blue color.

6.4 Evaluation of DASS with Various Pruning Ratios

Table 6 compares DASS and the DARTS\(_{sparse}\) method with three different pruning ratios including 90%, 95%, and 99% on the CIFAR-10 dataset. DASS achieves 1.57%, 1.04%, and 7.8% higher accuracies with 7%, 6.9%, and 23% network size reduction compared to the DARTS\(_{sparse}\) at 90%, 95%, and 99% pruning ratios, respectively. Thus, DASS is significantly more effective at extremely higher pruning ratios (99%) than lower pruning ratios (90%).

Table 6.

Architecture	90%		95%		99%
Architecture	Accuracy	#Params \((\times 10^3)\)	Accuracy	#Params \((\times 10^3)\)	Accuracy	#Params \((\times 10^3)\)
DARTS\(_{sparse}\)	95.31%	421	93.75%	210.5	81.25%	21.0
DASS-Small	96.88%	391	94.79%	196.75	89.06%	17.0

Table 6. Evaluating the Effectiveness of DASS at Various Pruning Ratios

We highlight the best results in blue color.

6.5 DASS Compared to Other Pruning Methods

Table 7 compares DASS with state-of-the-art pruning algorithms. The results indicate that DASS outperforms other pruning algorithms with different backbone architectures on CIFAR-10 and ImageNet datasets. On CIFAR-10, DASS-Large shows a 1.6% higher accuracy and 3.8\(\times\) reduction in the network size compared to the most accurate results provided by TAS\(_{Pruning}\) [87]. DASS-Large also provides 4.68% accuracy improvement with 38.14\(\times\) reduction in the network size over TAS\(_{Pruning}\) [87] on ImageNet. In light of DASS’ higher efficiency compared to other pruning methods, we can conclude that the pruning method was not the only reason for the DASS’s effectiveness and it is independent of the pruning algorithm.

Table 7.

Pruning Method	CIFAR-10			ImageNet
	Backbone	Top-1	#Params	Backbone	Top-1	Top-5	#Params
	Arch.	Acc.(%)	(\(\times 10^6\))	Arch.	Acc.(%)	Acc.(%)	(\(\times 10^6\))
SFP [22]	ResNet-20	92.08	0.269	ResNet-18	67.10	87.78	6.46
FPGM [23]		92.31	0.269		68.41	88.48	6.46
TAS\(_{Pruning}\) [87]		93.16	0.232		69.15	88.48	7.40
Sparse [88, 89]	MobileNet-v2	91.53	0.671	MobileNet-v2	53.6	78.9	0.25
Sparse [89]	ShuffleNet	93.05	0.879	-		-	-
DASS-Small	-	89.06	0.017	-	46.48	68.36	0.029
DASS-Medium	-	92.18	0.054	-	68.34	82.24	0.082
DASS-Large	-	95.31	0.106	-	73.83	85.94	0.194

Table 7. Comparing DASS with other Pruning Algorithms

We highlight the best results in blue color.

6.6 DASS Compared to Quantized Networks

Network quantization emerged as a promising research direction to reduce the computation of neural networks. Recently, [14, 15, 16] proposed to integrate the quantization mechanism into the differentiable NAS procedure to improve the performance of quantized networks. Table 8 compares DASS with the best results of NAS-based quantized networks. The compression rate is calculated as \(\frac{\sum _{l=1}^{L} \#W_l \times 32}{\sum _{l=1}^{L} \#W_l^t \times q}\) where \(\#W_l\) and \(\#W_l^t\) are the number of weights in layer \(l\) for full-precision (32-bit) and quantized network with \(q-\)bit resolution [14]. DASS-Medium yields 0.24% and 3.24% higher accuracies and significantly higher compression rate by 2.7\(\times\) and 4.24\(\times\) compared to TAS [14] as the most accurate quantized network on the CIFAR-10 and ImageNet datasets, respectively.

Table 8.

Architecture	#bits	Top-1	#Params	Compression	Top-1	Top-5	#Params	Compression
		CIFAR-10			ImageNet
Architecture	(W/A)‡	Acc.(%)	(\(\times 10^6\))	Rate†	Acc.(%)	Acc.(%)	(\(\times 10^6\))	Rate†
Binary NAS (A) [16]	1/1	90.66	2.4	44.0\(\times\)	57.69	79.89	5.57	32.74\(\times\)
TAS [14]	2/2	91.94	2.4	22.0\(\times\)	65.1	86.3	5.57	16.37\(\times\)
DASS-Small	32/32	89.06	0.017	194.11\(\times\)	46.48	68.36	0.029	196.55\(\times\)
DASS-Medium	32/32	92.18	0.054	61.11\(\times\)	68.34	82.24	0.082	69.51\(\times\)
DASS-Large	32/32	95.31	0.106	31.13\(\times\)	73.83	85.94	0.194	29.38\(\times\)

Table 8. Comparing the DASS Method with Quantized Networks on CIFAR-10

\(\dagger\) The baseline for comparison is full-precision DARTS with 3.3M and 5.7M parameters for CIFAR-10 and ImageNet.

\(\ddagger\) (Weights/Activation Function).

We highlight the best results in blue color.

6.7 Hardware Performance Results of DASS

We extensively study the effectiveness of DASS in the context of hardware efficiency by computing the inference time (latency) of various state-of-the-art sparse networks for a wide range of resource-constrained edge devices on the CIFAR-10 dataset (Figure 9). The batch size is equal to 1 for all experiments. It is worth noting that we did not utilize any simplification techniques, such as [90], to compact the sparse filters by fusing weight parameters. Our results reveal that the Pareto-frontier of DASS consistently outperforms all other counterparts by a significant margin, especially on CPUs that have very limited parallelism. DASS-Tiny as the fastest network improves the accuracy from MobileNet-v2’s 73.44% to 81.35% (+7.91% improvement) and accelerates the inference by up to 3.87\(\times\). More importantly, DASS-Tiny runs much faster than DARTS\(_{sparse}\) by 1.67–4.74\(\times\) with slightly better accuracy. Compared to ResNet-18\(_{sparse}\) as the closest network to DASS in terms of accuracy, DASS-Medium provides 1.46% accuracy improvement and up to 1.94\(\times\) acceleration on hardware.

Fig. 9.

6.8 Analyzing the Discrimination Power of DASS

We use the t-distributed stochastic neighbor embedding (t-SNE) method [91] for visualizing decision boundaries of dense high-performance architecture designed by DARTS, DARTS\(_{sparse}\) (sparse dense DART architecture with pruning), and DASS (our sparse architecture) on the CIFAR-10 dataset. Figure 10 illustrates the decision boundaries of classification for each network. According to the results, DASS has a higher discrimination power than DARTS\(_{sparse}\), and DASS with a 99% pruning ratio behaves very similarly to the dense and high-performance DARTS architecture.

Fig. 10.

6.9 Qualitative Analysis of the Searched Cell

Figure 11 shows the best cells searched by DASS-Small. An interesting finding is that, for the normal cell, DASS-Small tends to select SparseConv operation with larger kernel sizes (\(5 \times 5\)), providing more pruning candidates to optimize the pruning mask. DASS-Small tends to leverage max-pooling operations in the reduction cell instead of avg-pooling operations. This is because the max-pooling operation has a higher feature extraction capability with sparse filters [92].

Fig. 11.

6.10 Reproducibility Analysis

To verify the reproducibility of results, the DASS-Small search procedure was run five times with different random seeds. Figure 12 plots the average of accuracy and loss variations as well as the shades to indicate the confidence intervals. Results show that, while the confidence interval is wide at first, the average of multiple runs converges to neural architectures with similar performance with an average standard deviation (STDEV) of 2.22%.

Fig. 12.

7 Conclusion and Future Work

We propose DASS, a differentiable architecture search method, to design high-performance sparse architectures for DNNs. DASS significantly improves the performance of sparse architectures by proposing: (i) a new search space that contains sparse parametric operations; and (ii) a new search objective that is consistent with sparsity and pruning mechanisms. Our experimental results reveal that the learned sparse architectures outperform the architectures used in the state-of-the-art on both the CIFAR-10 and ImageNet datasets. In the long term, we foresee that our designed networks can effectively contribute to the goal of green artificial intelligence by efficiently utilizing resource-constrained devices as edge-accelerating solutions. We propose several promising directions for future research:

(1)

Developing an indicator to evaluate sparse networks to enable training-free sparse NAS.

(2)

Integration of quantization into the DASS framework to identify the optimal combination of sparsity and low-precision architecture.

(3)

Decreasing the complexity of the optimization algorithm by combining pre-training and pruning steps.

(4)

improving latency and energy efficiency by employing structured pruning techniques.

Acknowledgments

The computations were enabled by the supercomputing resource Berzelius provided by National Supercomputer Centre at Linköping University and the Knut and Alice Wallenberg foundation.

References

[1]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (2015).

Abstract

1 Introduction

2 Related Work

2.1 Neural Architecture Search and DARTS Variants

2.2 Network Pruning

3 Preliminaries

3.1 Differentiable Architecture Search

3.2 Unstructured Pruning

4 Research Motivation

5 Differentiable Architecture Search for Sparse Neural Networks (DASS) Method

5.1 DASS: Overview

5.2 DASS Search Space

5.3 DASS Search Objective

5.4 Optimization Algorithm

Step 1: pre-train (learn \(\theta ^\star _{pre}\) and \(\alpha ^*_{pre}\)).

Step 2: prune (learn \(\hat{m}\) and \(\alpha ^*_{prune}\)).

Step 3: fine-tune (learn \(\hat{\theta }\)).

6 Experiments

6.1 Experimental Setup

6.2 DASS Compared to Dense Networks

6.3 DASS Compared to Sparse Networks

6.4 Evaluation of DASS with Various Pruning Ratios

6.5 DASS Compared to Other Pruning Methods

6.6 DASS Compared to Quantized Networks

6.7 Hardware Performance Results of DASS

6.8 Analyzing the Discrimination Power of DASS

6.9 Qualitative Analysis of the Searched Cell

6.10 Reproducibility Analysis

7 Conclusion and Future Work

Acknowledgments

References

Cited By

Index Terms

Recommendations

True Rank Guided Efficient Neural Architecture Search for End to End Low-Complexity Network Discovery

Differentiable neural architecture learning for efficient neural networks

GraphPAS: Parallel Architecture Search for Graph Neural Networks

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations