Efficient and Flexible Method for Reducing Moderate-Size Deep Neural Networks with Condensation

Chen, Tianyi; Xu, Zhi-Qin John

doi:10.3390/e26070567

Open AccessArticle

Efficient and Flexible Method for Reducing Moderate-Size Deep Neural Networks with Condensation

by

Tianyi Chen

and

Zhi-Qin John Xu

^*

School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(7), 567; https://doi.org/10.3390/e26070567

Submission received: 28 April 2024 / Revised: 20 June 2024 / Accepted: 24 June 2024 / Published: 30 June 2024

(This article belongs to the Special Issue An Information-Theoretical Perspective on Complex Dynamical Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Neural networks have been extensively applied to a variety of tasks, achieving astounding results. Applying neural networks in the scientific field is an important research direction that is gaining increasing attention. In scientific applications, the scale of neural networks is generally moderate size, mainly to ensure the speed of inference during application. Additionally, comparing neural networks to traditional algorithms in scientific applications is inevitable. These applications often require rapid computations, making the reduction in neural network sizes increasingly important. Existing work has found that the powerful capabilities of neural networks are primarily due to their nonlinearity. Theoretical work has discovered that under strong nonlinearity, neurons in the same layer tend to behave similarly, a phenomenon known as condensation. Condensation offers an opportunity to reduce the scale of neural networks to a smaller subnetwork with a similar performance. In this article, we propose a condensation reduction method to verify the feasibility of this idea in practical problems, thereby validating existing theories. Our reduction method can currently be applied to both fully connected networks and convolutional networks, achieving positive results. In complex combustion acceleration tasks, we reduced the size of the neural network to 41.7% of its original scale while maintaining prediction accuracy. In the CIFAR10 image classification task, we reduced the network size to 11.5% of the original scale, still maintaining a satisfactory validation accuracy. Our method can be applied to most trained neural networks, reducing computational pressure and improving inference speed.

Keywords:

neural networks; condensation; reduction

1. Introduction

Neural networks have achieved globally astounding results, demonstrating an exceptional performance across a range of tasks in scientific fields including biology, medicine, astronomy, environmental science, physics, chemistry, etc. [1,2,3,4,5,6]. Serving as a novel numerical solution tool, neural networks have been effectively applied in solving partial differential equations within various domains [7,8,9,10]. Neural networks can learn multiscale models from data by designing appropriate network structures and sampling methods, offering insights into these intricate systems [11].

Currently, the scale of neural networks applied in the scientific field is generally moderate size [7]. Compared to the industrial and internet sectors, obtaining experimental data in the scientific field often incurs high costs, resulting in significantly fewer data samples available for neural network training [7]. This limitation in sample size restricts the scale of networks that can be trained. In addition, training large-scale neural networks requires expensive computational resources. The computing conditions in academia generally lag behind those in the industrial sector, making it challenging to support the training and inference of overly large neural network models.

Reducing the scale of neural networks applied in the scientific field holds significant importance [12,13]. Many traditional numerical algorithms, such as multigrid methods and fast Fourier transforms, exhibit quasi-linear or even linear time complexity, boasting an extremely high computational efficiency [14]. When applying neural networks to time-sensitive scientific computing tasks, it is crucial to reduce the network size to enhance its inference speed [11]. Additionally, many scientific applications require deploying models in resource-constrained environments, such as embedded systems, mobile devices, and sensors [15,16]. The computational power of these deployment platforms often falls far short of the platforms used for the original training of the tasks. Therefore, it is essential to reduce the network size as much as possible while maintaining the performance of the original network model, enabling efficient deployment in these contexts.

Nonlinearity is the fundamental characteristic that endows neural networks with their powerful expressive, learning, and generative capacities [17]. Nonlinear activation functions grant neural networks the ability to learn complex patterns, enabling neurons to model complex nonlinear relationships and approximate any continuous function with arbitrary precision. The superposition of multiple layers of nonlinear transformations allows neural networks to construct highly nonlinear decision boundaries and learn complex nonlinear mappings between inputs and outputs. Theoretical findings indicate that in the presence of strong nonlinearity within neural networks, neurons in the same layer tend to exhibit a condensation phenomenon, known as condensation [17]. Condensation refers to the gradual alignment of the direction of neurons’ parameter vectors during the training process, implying a uniform preference for the inputs from the previous layer. This phenomenon is prevalent across various neural network architectures, suggesting that condensed neurons function similarly, or are nearly equivalent to a single neuron. Condensation implies the presence of a large number of similar neurons in neural networks, indicating that the structural complexity of neural networks may not be as high as it appears. When the network reaches an extreme point, the condensation phenomenon becomes more pronounced. In this case, the embedding principle suggests that the extreme point reached by the network is actually an extreme point of one of its subnetworks [18].

Based on the above idea, we propose the condensation reduction method to validate existing theories. The main contribution of this paper is the successful validation of the universality of the condensation phenomenon and the applicability of the embedding principle in practical applications through the condensation reduction method. The use of the condensation reduction method requires a pretrained network, which could be considered a drawback. However, this limitation might be quite common, as seen in the work related to the “Lottery Ticket Hypothesis” [19].

From the perspective of reduction algorithms, our method can be approximated as a pruning method. The key difference is that our approach merges branches instead of merely removing redundant ones. The work by Liu et al. [20] proposed a network condensation method, but their definition of condensation differs from ours. In their work, the term “condense” is primarily used as a verb to denote operations such as simplifying, refining, and compressing the network. However, in this study, “condensation” is defined as a significant phenomenon commonly observed during neural network training and can be directly observed using cosine similarity. Most pruning methods measure the importance of parameters based on certain metrics and remove the less important ones [21], which can be either structured or unstructured. These metrics usually include parameter norms, impact on loss, and sensitivity [22,23,24].

In contrast, our condensation reduction method is a structured network reduction approach. Unlike structured pruning, we do not delete neurons or parameters based on metrics. Instead, we merge neurons based on their similarity (degree of condensation). Merging neurons involves deleting redundant neurons and modifying the parameters of the retained neurons, ensuring that the merged neurons have similar expressive capabilities as before. In the following sections, we prove the consistency of the neural network’s output before and after the merger under strict conditions.

By leveraging the condensation phenomenon for neural network reduction, we aim to strike a balance between model performance and model size to identify an appropriate subnet. Utilizing the embedding principle, we understand that when a neural network reaches an extremum, it is likely at an extremum of a smaller subnet, making the neural network approximately equivalent to this subnet [18]. At this point, we merge neurons that have condensed into a single neuron, resulting in a subnet of the original network and thereby achieving a reduction in the neural network’s scale. The reduction based on the condensation phenomenon requires only the calculation of angles between neurons, making the time required for each reduction negligible compared to the training time. Owing to the universality of the condensation phenomenon, our reduction algorithm can be broadly applied to various types of models.

We are currently capable of reducing two major categories of neural networks: fully connected neural networks (FNNs) and convolutional neural networks (CNNs), achieving promising results. We selected two representative tasks for model reduction: fitting and classification. The fitting task involved neural network acceleration for combustion simulation, while the classification task focused on image classification on the CIFAR10 dataset, corresponding to fully connected neural networks and convolutional neural networks, respectively. In the acceleration of the combustion simulation, we reduced the original model to 40.9% of its parameter count, maintaining consistency in both zero-dimensional and one-dimensional combustion simulations with the original model, and observed virtually no difference in complex turbulent flame simulations. For the CIFAR10 image classification task, we reduced the model to 11.5% of its original parameter count, with the classification accuracy dropping only to 94% of the original accuracy.

The rest of the article is organized as follows: The second part, Materials and Methods, introduces the concept of condensation and the details of condensation reduction. The third part, Results, presents the outcomes of condensation reduction in combustion tasks and CIFAR10 image classification tasks. The fourth part, Discussion, provides a discussion on the results and methods.

2. Materials and Methods

2.1. Condensation Reduction in FNN

2.1.1. FNN

Consider an

(L + 1)

-layer fully connected neural network with a structure of

d_{i n} - m_{1} - \dots - m_{i} - \dots - m_{L} - d_{o u t},

(1)

where

m_{i}

is the width of the i-th hidden layer. The neural network can be defined as

\begin{matrix} {\vec{x}}_{[0]} = (\vec{x}, 1), {\vec{x}}_{[i]} = (σ (W_{[i]} {\vec{x}}_{[i - 1]}), 1), for i \in {2, 3, \dots, L} \\ f (θ, x) = a^{⊤} {\vec{x}}_{[L]} ≜ f_{θ} (\vec{x}), \end{matrix}

(2)

where

W_{[i]} = (W_{[i]}^{0}, {\vec{b}}_{[i]}^{⊤})

consists of both the weight matrix and bias vector, and

σ (\cdot)

is the activation function. The parameter matrix of the i-th layer of the neural network is

W_{[i]}^{m_{i} \times m_{i - 1}} = [\begin{matrix} {\vec{v}}_{1} \\ {\vec{v}}_{2} \\ ⋮ \\ {\vec{v}}_{m_{i}} \end{matrix}],

(3)

wherein

{\vec{v}}_{t}

is the t-th neuron in the i-th layer.

2.1.2. Cosine Similarity

We define the cosine similarity between the i-th layer neurons as

S ({\vec{v}}_{i}, {\vec{v}}_{j}) = \frac{{\vec{v}}_{i} {\vec{v}}_{j}^{⊤}}{{({\vec{v}}_{i} {\vec{v}}_{i}^{⊤})}^{1 / 2} {({\vec{v}}_{j} {\vec{v}}_{j}^{⊤})}^{1 / 2}} .

(4)

When considering neurons as vectors, if the cosine similarity between them is very close to 1, their directions are very similar, indicating that these two neurons are highly similar. The calculation of the cosine similarity is straightforward. Furthermore, in the following section, we will demonstrate that when the cosine similarity is exactly 1, the functions expressed by the original network and the reduced network are consistent.

2.1.3. Condensation of FNN

Consider a neural network with a single hidden layer, whose activation function satisfies homogeneity, i.e., satisfies

\begin{matrix} σ (k \cdot i n p u t) & = k σ (i n p u t), \forall k \in R^{+} \\ f_{θ} (x) & = \sum_{i = 1}^{m} a_{i} σ ({\vec{v}}_{i} \cdot x) . \end{matrix}

(5)

The ReLU activation function is an activation function that satisfies homogeneity.

If the directions of the p-th and q-th neurons are the same, i.e, when the cosine similarity is 1, we have

{\vec{v}}_{q} = λ {\vec{v}}_{p}, λ \in R^{+} .

(6)

The neural network at this point can be represented as

\begin{matrix} f_{θ} (x) & = \sum_{i = 1}^{m} a_{i} σ ({\vec{v}}_{i} \cdot x) \\ = \sum_{i \neq p, q}^{m} a_{i} σ ({\vec{v}}_{i} \cdot x) + a_{p} σ ({\vec{v}}_{p} \cdot x) + a_{q} σ ({\vec{v}}_{q} \cdot x) \\ = \sum_{i \neq p, q}^{m} a_{i} σ ({\vec{v}}_{i} \cdot x) + (a_{p} + λ a_{q}) σ ({\vec{v}}_{p} \cdot x) \\ ≜ f_{θ}^{[n e w]} (x) . \end{matrix}

(7)

At this point, the function expression of the neural network is completely identical to the new neural network formed by merging neurons p and q. This means that neuron p and neuron q are equivalent to a single neuron at this moment. The above derivation can be generalized to any fully connected neural network of arbitrary depth that meets the criteria of homogeneity.

2.1.4. Condensation Reduction

Leveraging the phenomenon of condensation for reduction involves merging neurons that have condensed into a single neuron. Determining whether two neurons have condensed is a matter of judging whether their directions are similar. Thus, we only need to artificially set a threshold for the cosine similarity. When the cosine similarity between two neurons exceeds this threshold, they are considered to have condensed. We will henceforth refer to the cosine similarity threshold as the condensation threshold.

To perform a condensation-based reduction, we first need to obtain the cosine similarity among neurons in the l-th layer targeted for reduction. By utilizing the formula for cosine similarity, we can easily calculate it using the parameter matrix of that layer of the neural network, specifically:

C_{[l]} = D_{[l]} A_{[l]} D_{[l]},

(8)

wherein

A_{[l]} = (a_{i j}) = W_{[l]} W_{[l]}^{⊤}, D_{[l]} = diag (\frac{1}{\sqrt{a_{11}}}, \dots, \frac{1}{\sqrt{a_{m_{l} m_{l}}}}) .

(9)

After obtaining the cosine similarities among neurons, we need to group neurons that have condensed together. In each grouping session, given the condensation threshold, we calculate the number of neurons that have condensed with each ungrouped neuron. Subsequently, we select the neuron with the highest number of condensed neurons as the main neuron for that group.

After grouping all neurons, the next step is to merge the condensed neurons. We merge all neurons in each group toward the main neuron of that group. Let us assume, for simplicity, that the first N neurons are classified into one group. Then, we have

{\vec{u}}_{m a i n}^{[n e w]} = \sum_{k}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} {\vec{u}}_{k}, {\vec{v}}_{m a i n}^{[n e w]} = {\vec{v}}_{m a i n},

(10)

wherein

W_{[l]}^{m_{i} \times m_{i - 1}} = [\begin{matrix} {\vec{v}}_{1} \\ ⋮ \\ {\vec{v}}_{N} \\ {\vec{v}}_{N + 1} \\ ⋮ \\ {\vec{v}}_{m_{i}} \end{matrix}], W_{[l + 1]}^{m_{i + 1} \times m_{i}} = [{\vec{u}}_{1}^{⊤} \dots {\vec{u}}_{N}^{⊤} {\vec{u}}_{N + 1}^{⊤} \dots {\vec{u}}_{m_{i}}^{⊤}] .

(11)

The new neural network parameter matrix obtained is

W_{{[l]}_{n e w}}^{(m_{i} - N + 1) \times m_{i - 1}} = [\begin{matrix} {\vec{v}}_{m a i n}^{[n e w]} \\ {\vec{v}}_{N + 1} \\ ⋮ \\ {\vec{v}}_{m_{i}} \end{matrix}], W_{{[l + 1]}_{n e w}}^{m_{i + 1} \times (m_{i} - N + 1)} = [{\vec{u}}_{m a i n}^{[n e w] ⊤} {\vec{u}}_{N + 1}^{⊤} \dots {\vec{u}}_{m_{i}}^{⊤}] .

(12)

By using the new neural network parameter matrix, we can create a new neural network, thus completing the condensation-based reduction process.

2.2. Condensation Reduction in CNN

2.2.1. CNN

Convolutional neural networks can be understood as fully connected neural networks with shared weights. Therefore, all the definitions related to fully connected networks mentioned above can naturally be extended to apply to CNNs as well.

Consider a neural network that contains only convolutional layers and fully connected layers; its structure is as follows:

d_{i n} - m_{C}^{[1]} - \dots - m_{C}^{[i]} - \dots - m_{C}^{[L]} - m_{F_{1}} - m_{F_{2}} - d_{o u t},

(13)

wherein

m_{C}^{[i]} = (C_{i n c h a n n e l}^{[i]}, C_{o u t c h a n n e l}^{[i]}, {Size}^{[i]}); {Size}^{[i]} = s^{[i]} \times s^{[i]} .

(14)

wherein

C_{i n c h a n n e l}^{[i]}

and

C_{o u t c h a n n e l}^{[i]}

are the dimensions of the input channel and output channel of the i-th layer, respectively, and

{Size}^{[i]}

is the size of the convolution kernel. We view the i-th convolutional layer as containing outchannel neurons, each neuron being a convolutional kernel, represented by a vector as follows:

{\vec{v}}_{i}^{C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]}} = (k_{1}^{1}, \dots, k_{{Size}^{[l]}}^{1}, \dots, k_{1}^{C_{i n c h a n n e l}^{[l]}}, \dots, k_{{Size}^{[l]}}^{C_{i n c h a n n e l}^{[l]}}) .

(15)

The parameter matrix of this layer can be represented as

W_{[l]}^{C_{o u t c h a n n e l}^{[l]} \times (C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]})} = [\begin{matrix} {\vec{v}}_{1} \\ {\vec{v}}_{2} \\ ⋮ \\ {\vec{v}}_{C_{o u t c h a n n e l}^{[l]}} \end{matrix}] .

(16)

Corresponding to the small data processed by each convolution kernel, a layer of convolution operations can be represented as

y^{C_{o u t c h a n n e l}^{[l]}} = σ {(W_{[l]}^{C_{o u t c h a n n e l}^{[l]} \times (C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]})} \cdot x^{C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]}})}^{⊤},

(17)

Then, we calculate the cosine similarity between neurons

{\vec{v}}_{i}

according to the previous definition.

2.2.2. Condensation of CNN

Similar to fully connected neural networks, when there are two neurons in a convolutional layer with the same orientation, the function represented by the network is completely identical to the function represented by the new network after merging the two neurons. Refer to Appendix A.1 for proof details. Therefore, for convolutional neural networks, the phenomenon of condensation will be consistent with that of fully connected networks, thereby reducing the complexity of the model.

2.2.3. Condensation Reduction

The traditional definition of condensation reduction in convolutional neural networks, following the above derivation, is completely identical to that in fully connected networks. It simply requires unfolding each convolutional layer into a matrix composed of convolutional kernel parameters, specifically:

W_{[l]}^{C_{o u t c h a n n e l}^{[l]} \times (C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]})} = [\begin{matrix} {\vec{v}}_{1} \\ {\vec{v}}_{2} \\ ⋮ \\ {\vec{v}}_{C_{o u t c h a n n e l}^{[l]}} \end{matrix}] .

(18)

Thus, the corresponding cosine similarity matrix can be calculated according to the previous definition. Afterward, the neurons that have condensed are grouped in the exact same manner as during the condensation reduction of fully connected networks. Subsequently, neurons belonging to the same group are merged into a single neuron.

When the layer following the condensation-reduced layer is a fully connected layer, the process of obtaining new network parameters is entirely the same as during the condensation reduction of fully connected networks. When the layer following the condensation-reduced layer is a convolutional layer, the situation is, in fact, consistent with that of fully connected networks; details are shown in Appendix A.2.

The concept of condensation reduction can indeed be extended to various types of convolutional layers, such as depthwise separable convolutions, achieving satisfactory application results. Taking the structure of MobileNetV2 as an example, a complete module within MobileNetV2 consists of a sequence that includes a pointwise convolution followed by batch normalization and a ReLU6 activation function; a depthwise convolution followed by batch normalization and a ReLU6 activation function; and another pointwise convolution followed by batch normalization and a linear activation function [25]. In such a complete module, the depthwise convolution layer plays a primary role, while the pointwise convolution layers mainly serve to expand and reduce dimensions. Therefore, condensation reduction is applied only to the depthwise layer. Due to the characteristics of the complete module, reducing the depthwise layer accordingly reduces the size of the adjacent pointwise layers. The details of derivation for the depthwise separable convolutions are shown in Appendix A.3.

2.3. Manual Condensation Reduction and Automatic Condensation Reduction

2.3.1. Manual Condensation Reduction

In the face of complex application scenarios, neural networks can present challenging metrics that are intricate, time-consuming, or require subjective judgment by supervisors. In such cases, it is necessary to pause and conduct model testing each time a reduced model is obtained. Manual condensation reduction is primarily employed in these situations.

Manual condensation reduction involves reducing a neural network layer by layer after achieving a neural network that meets performance standards. This approach ensures that the neural network does not deviate significantly due to a single reduction. Each manual condensation reduction requires setting a condensation threshold. The appropriate threshold is determined by observing changes in the loss before and after reduction. If the loss increases significantly after reduction, the condensation threshold is raised; if the loss is too small, the threshold is lowered. If the neural network’s performance meets the requirements after a certain amount of training post-reduction, further reduction is performed; otherwise, the condensation threshold is increased, and the process is repeated. This cycle continues until an appropriately sized subnet is obtained. An appropriate subnet must meet two criteria: first, it must achieve the set performance metrics, such as accuracy in combustion simulation predictions. Secondly, the neural network parameters should be difficult to reduce further, indicating that the number of neurons condensing together is scarce or that even slight reductions significantly impact the network’s performance.

2.3.2. Automatic Condensation Reduction

The main steps of automatic condensation reduction are consistent with manual condensation reduction, except that the reduction process does not require manual control. Automatic condensation reduction is mainly used in scenarios with clear neural network performance metrics and where testing is convenient and rapid, such as using validation accuracy to assess performance in image classification problems. The process of automatic condensation reduction is shown in Figure 1.

The main flow of automatic condensation reduction begins with training the original model and periodically testing its performance. Once the test meets the main reduction criteria, main reduction is carried out. After the main reduction, the reduced model continues training until it meets the main reduction criteria again, and this cycle repeats until the model reduction reaches the anticipated extent. After each main reduction, the criteria for the next main reduction are slightly lowered because reduction inevitably affects the network’s performance. Keeping the same criteria would make the reduction cycle very slow or even impossible, wasting a lot of time on model training.

Main reduction involves sequentially reducing each layer of the model. Before each layer-by-layer reduction, the model’s relevant data are saved, creating a save point, followed by the reduction. For each layer to be reduced, a specific condensation threshold is applied during its reduction. After reducing a layer, a specific number of training steps are performed, the same for each layer. Upon completion, the model is automatically tested to see if it meets the layer-by-layer reduction criteria, which are generally lower than the main reduction criteria to expedite the overall main reduction process. If the model meets the layer-by-layer criteria and the current layer is not the last to be reduced, reduction proceeds to the next layer. Simultaneously, the layer-by-layer criteria are lowered, as is the initial learning rate for the next reduction. Specifically, if the model’s parameters have hardly decreased after a reduction, the specific layer’s condensation threshold is lowered for a more aggressive reduction next time. If the model does not meet the layer-by-layer criteria, it reverts to the previous save point to restart reduction. Simultaneously, the specific training steps for layer-by-layer reduction increase, and the specific condensation threshold for that layer is raised. This aims to improve model performance by extending the training time for the reduced model to recover performance and by performing stricter condensation reduction, as a higher condensation threshold means neurons deemed for condensation are more directionally aligned, thus reducing error upon merging. Since the condensation threshold can be adjusted up to 1, where the model remains unchanged, such layer-by-layer reduction can always be completed. When the last layer’s reduction is finished, the main reduction concludes.

Automatic condensation reduction merely requires setting initial values for the algorithm beforehand. Aside from the main and layer-by-layer reduction criteria, the quality of other initial values does not sensitively affect the final reduction outcome, as the key condensation threshold can be automatically adjusted. However, excessively deviated initial values will waste considerable time on model training. Before using the automatic reduction algorithm, it is advisable to make reasonable estimates based on the training process of the model to be reduced. The main reduction criteria are mainly based on the highest performance of the model to be reduced and the minimum performance requirements of the reduced model. The automatic reduction algorithm ensures these criteria are met, obtaining a subnetwork that is appropriately sized between the minimal subnetwork that retains structure and the original network.

3. Results

3.1. Acceleration for Combustion Simulation

3.1.1. Background on Task

Numerical simulation is indispensable in both scientific research and industrial production. In problems involving various multiscale dynamic systems, such as combustion, numerical simulation requires solving high-dimensional stiff ordinary differential equations, whereas traditional numerical solution methods often require very small time steps, as in [11]. Moreover, in numerical simulations of combustion, the direct integration of chemical reactions required by traditional methods is time-consuming [11]. It has been proven reasonable and efficient to use neural networks to replace traditional numerical solution methods. Neural networks, by fitting state mapping functions, can map the current state of a system to the state of the system at the next time step. This method can be applied with larger time steps and replaces the time-consuming chemical reaction calculation process with the rapid inference process of neural networks [11].

The simulation of combustion requires integrating computational fluid dynamics (CFD) with chemical reactions. In this task, a fully connected neural network computes the rate of change of chemical substances as source terms for the CFD. The CFD code used is EBI [26], and the chemical mechanism employed is the drm19 mechanism for methane, which involves 21 components and 84 reactions. In the depicted turbulent ignition test, as shown in Figure 2, a computational domain of 1.5 cm × 1.5 cm is set with 512 × 512 cells. The velocity field is generated using the Passot–Pouquet isotropic kinetic energy spectrum. The initial conditions are set at

T = 300 K

,

P = 1 atm

, and

ϕ = 1

. An ignition round is placed at the center of the domain, with a radius of 0.4 mm. Figure 2 compares the results at 1ms simulated using the fully connected neural network and CVODE, both utilizing EBI. For more details, please refer to the original text [11].

The task at hand involves condensing and simplifying the classic neural network fitting problem applied in a real-world scenario using fully connected neural networks. It aims to validate the occurrence of traditional defined condensation phenomena in practical applications. Specifically, due to the complexity of combustion simulation, it can highlight the ability of condensation reduction to maintain the performance of neural networks.

3.1.2. Training Setup

The original model in this experiment is a fully connected neural network with the architecture of 23-3200-1600-800-400-23. The parameters of the neural network are initialized using the following Gaussian distribution:

N (0, v a r), v a r = {(\frac{m_{i n} + m_{o u t}}{2})}^{- 2},

(19)

wherein

m_{i n}

and

m_{o u t}

are the input dimension and output dimensions of the specific layer of the model, respectively. The initial learning rate is set to 0.0001 and is reduced to 50% of its original value every 1000 steps to prevent the learning rate from becoming too large. The entire training process uses the Adam optimizer, sets

b e t a s = (0.9, 0.999)

,

e p s = 1 \times 10^{- 8}

,

w e i g h t d e c a y = 0

. The starting batch size is set to 1024 and is increased by a factor of 3.07 every 1000 steps to gradually accelerate training and enhance the model’s generalization performance. The initial model undergoes training for 5000 steps, resulting in a training loss of 0.02172. Post-training, the model’s accuracy across the 0th, 1st, and 2nd dimension is evaluated. The training and validation loss of the original model is shown in Figure 3a.

3.1.3. Reduction Setup and Result

Due to the complex and time-consuming nature of evaluating neural network metrics in the combustion simulation task, a manual reduction approach is adopted. Please refer to Table 1 for a comparison of the reduction processes. In the first reduction, the first layer of the original network is reduced, with a set condensation threshold of 0.9, resulting in a modified model architecture of 23-2205-1600-800-400-23. The initial learning rate is set at 0.0001, decreasing to 10% of its previous value every 2500 steps, with Adam as the optimizer. The starting batch size is 1024, increasing to 128 times its original size every 2500 steps. After training for 5000 steps, the training loss is 0.01592. This reduction in loss is considered indicative of the model’s performance being between that of the original model trained for 5000 steps and the original model trained for 10,000 steps. This suggests that consolidating condensed neurons has a minimal impact on the neural network’s performance. Furthermore, it is observed from the loss graph that the loss of the reduced neural network rapidly decreases to a satisfactory level within a few training steps, similar to the case of perturbing the parameters of the original model. This indicates that the condensed neurons are approximately equivalent to a single neuron, a finding that is validated by subsequent reductions. The training and validation loss is shown in Figure 3b.

The second reduction further reduces the first layer of the neural network, with a set condensation threshold of 0.8, resulting in a model architecture of 23-1105-1600-800-400-23. This reduction uses the same training settings as the first reduction. After training for 5000 steps, the training loss is 0.01358, showing continued decline. At this point, the first layer of the original model has been reduced to 34.5% of its initial size, yet it still maintains accuracy in the 0th and 1st dimensions and performs exceptionally well in two-dimensional turbulent flame simulations. This reflects that a significant number of neurons in the neural network are condensed, and the condensed model is approximately equivalent to the subnet resulting from the merged condensed neurons, further proving the feasibility of using condensation for network reduction. These two reductions indicate that the extreme points reached after training the original model are likely equivalent to those of a significantly smaller subnet in the first layer. The training and validation loss is shown in Figure 3c.

The third reduction targeted the second layer of the network, setting a condensation threshold of 0.999, resulting in a model architecture of 23-1105-1309-800-400-23. The condensation threshold was set so high because, after the first two reductions, the second layer of the neural network became highly condensed. A slightly lower condensation threshold would have led to a significant reduction in the network, potentially reducing the second layer to fewer than ten neurons and greatly impacting model performance. The phenomenon where reducing one layer of the neural network causes a higher degree of condensation in the subsequent layer is interesting. In practice, the network parameter matrices for the first and second layers, which we use for reduction, can be understood as the input and output parameter matrices of the first layer of the neural network, respectively. They are inherently related, and reducing the first layer of the network decreases the number of rows in the input parameter matrix while also decreasing the number of columns in the output parameter matrix. The relationship between them will be explored further in future research. As shown in Figure 4, it can be clearly seen that the reduction in the first layer of the neural network can immediately change the condensation state of the second layer of the neural network. After the first and second reductions, when examining the cosine similarity matrix of the second layer of the neural network in the untrained state, we can see that the previously inconspicuous condensation in the second layer appeared after the first reduction, and a strong condensation phenomenon emerged after the second reduction, making the second layer neural network capable of extremely strong condensation even at a high condensation threshold of 0.9. Since the previous model had already been trained for 15,000 steps and reached a vicinity of a good extreme point, a smaller learning rate was used for this training phase. The initial learning rate was adjusted to

4 \times 10^{- 5}

, with all other settings remaining the same as in the first reduction. After training for 5000 steps, the training loss was recorded at 0.01360, showing almost no change from the second reduction. The training and validation loss is shown in Figure 3d. The results of the two-dimensional turbulence simulation of the reduced model are shown in Figure 2, which is almost identical to the original model. According to existing theories, we know that training a large network and gradually reducing its size can yield better results compared to directly training a small network. We retrained a small network with the same size as the reduced network, using the same training settings as previously described. The directly trained small network even failed to converge when predicting two-dimensional turbulent flames.

From the performance and training process of the model after three rounds of reduction, it is evident that each reduction preserved the performance capabilities of the previous model, demonstrating that the phenomenon of condensation ensures the quasi-equivalence between the original network and the subnetworks. The final reduced model has an architecture of 23-1105-1309-800-400-23, with a total of 2,848,260 parameters. In contrast, the original model, with an architecture of 23-3200-1600-800-400-23, had a total of 6,802,800 parameters, making the reduced model’s size only 41.7% of the original model. However, observations from various aspects show that the performance of the original and reduced models is nearly identical, proving the effectiveness of the reduction.

3.2. Classification of CIFAR10

3.2.1. Background on Task

Image classification is a critical problem in neural network applications, and convolutional neural networks (CNNs) hold a prominent position among all neural network architectures. Choosing the image classification task can validate the universality of the condensation phenomenon in practical problems while demonstrating the wide applicability of condensation reduction.

The CIFAR10 dataset is a common dataset for image classification tasks, consisting of 60,000

32 \times 32

color images categorized into 10 classes. Each class contains 6000 images, totaling 50,000 training images and 10,000 testing images. Choosing the CIFAR10 dataset is due to the need for a reasonably sized network to achieve good performance and moderate training time, making it convenient for applying the automatic condensation reduction method.

MobileNetV2 is a convolutional neural network model that adopts depthwise separable convolutions and is dedicated to mobile device deployment [25]. Its main network structure is shown in Table 2. For detailed information about MobileNetV2, please refer to the original text [25]. It concatenates 16 modules with the same structural characteristics, and the structure is shown in Figure 5. The reasons for using MobileNetV2 for condensation reduction are as follows. First, MobileNetV2 targets mobile deployment scenarios, which is also an important application scenario for model reduction. MobileNetV2 itself contains a small number of parameters, and further reduction in MobileNetV2 can reflect the reduction capability and practical application value of the condensation reduction method. Second, MobileNetV2 has 16 modules with the same structural characteristics, which is very suitable for using automatic condensation reduction methods for layer-by-layer reduction. Finally, MobileNetV2 mainly adopts depthwise separable convolutions, which will further extend the definition of condensation and enrich the application fields of condensation reduction.

3.2.2. Training and Reduction Setup

For convolutional networks, we similarly adopt an initialization similar to that of fully connected networks, utilizing the following Gaussian distribution:

N (0, v a r), v a r = {(\frac{m_{i n} + m_{o u t}}{2})}^{- 2},

(20)

wherein

m_{i n}

and

m_{o u t}

the are input dimension and output dimension of the specific layer of the model.

The initial learning rate is set to 0.01, and a cosine annealing strategy with a period length of 200 and a minimum learning rate of 0 is used for the first 400 steps. The optimizer during this phase is SGD with

m o m e n t u m = 0.9

. This part of the training can be seen as the process of quickly finding a good critical point for the model. After 400 steps, the learning rate changes to 0.001, and the optimizer switches to Adam, sets

b e t a s = (0.9, 0.999)

,

e p s = 1 \times 10^{- 8}

,

w e i g h t d e c a y = 0

. Subsequently, every 1000 steps, the learning rate alternates between 0.001 and 0.0001. The main idea behind this learning rate strategy is to first use the cosine annealing strategy to find a suitable parameter region for the neural network and then use the Adam optimizer to refine the network parameters.

In our multiple experiments, we observed a noticeable improvement in test accuracy after switching to Adam. The accuracy obtained using this mixed strategy was higher compared to using Adam throughout the entire training process. Throughout the entire training process, the batch size remained at 128.

After each reduction, the above learning rate strategy will be reapplied for model training, but with a change in the initial learning rate. The method for changing the initial learning rate is

l r_{[n e w]} = l r_{m i n} + 0.5 \times (l r_{m a x} - l r_{m i n}) \times (1 + cos (π \times \frac{t_{r e d u c t i o n}}{T_{l r}})),

(21)

wherein

t_{r e d u c t i o n}

represents the current time of reductions,

T_{l r}

is set to 200,

l r_{m i n}

is set to 0.0001, and

l r_{m a x}

is set to 0.01. The reason for this approach is that each reduction can be considered as the model reaching a critical point that meets our requirements. At this point, the learning rate should be appropriately lowered to further enhance the model’s performance. The model initiates the main reduction process when the accuracy on the validation set reaches 0.88. Once the model achieves this main reduction criterion, it automatically begins the layer-by-layer reduction. The criterion for the layer-by-layer reduction is set at an accuracy of 0.84 on the validation set. After initiating the layer-by-layer reduction, whenever the model meets this criterion, it automatically reduces the corresponding neural network layer. After each reduction, both the main reduction and layer-by-layer reduction criterion are automatically adjusted in the following manner:

A c c_{[n e w]} = \{\begin{matrix} A c c_{m i n} + 0.5 \times (A c c_{m a x} - A c c_{m i n}) \times (1 + cos (π \times \frac{t_{r e d u c t i o n}}{T_{A c c}})) & t_{r e d u c t i o n} < T_{A c c} \\ A c c_{m i n} & t_{r e d u c t i o n} \geq T_{A c c}, \end{matrix}

(22)

wherein

T_{A c c}

is set to 100.

A c c_{m a x}

and

A c c_{m i n}

are set to 0.88 and 0.85, respectively, for the main reduction.

A c c_{m a x}

and

A c c_{m i n}

are set to 0.84 and 0.81, respectively, for the layer-by-layer reduction. When the reduction is not the final layer in the layer-by-layer reduction, the initial limit for the number of training steps after reduction is set to 20 steps. If the layer-by-layer reduction criterion is not met within this limit, the model will revert to a previously saved checkpoint to reattempt the reduction of that layer, with the limit on training steps permanently increasing by 10 steps. This limit on training steps is consistent for every layer except the final one in the layer-by-layer reduction. This is because encountering difficulties in reducing one layer often implies that reducing each layer will pose a challenge.

The initial condensation threshold for each layer of the neural network is set at

C u t = \frac{1}{1 + exp (- 2)} .

(23)

If a layer of the neural network fails to meet the layer-by-layer reduction criterion after reaching the limit of restricted training steps following reduction, the condensation threshold for that layer is set at

C u t_{[n e w]} = \frac{1}{1 + exp (- 2 - 0.1 \times t_{f a i l})},

(24)

wherein

t_{f a i l}

represents the number of times reduction has failed. If, after a reduction, the amount of parameters reduced in a certain layer of the neural network is too small, and the layer-by-layer reduction criterion can be met after reaching the limit of restricted training steps, then the condensation threshold for that layer is set at

C u t_{[n e w]} = \frac{1}{1 + exp (- 2 - 0.1 \times t_{f a i l}^{[n e w]})}, t_{f a i l}^{[n e w]} = t_{f a i l} - 1 .

(25)

A parameter reduction is considered too small if the proportion of the model’s parameters after reduction, compared to before reduction, is greater than 0.999.

For the final layer in a layer-by-layer reduction, there are specific, larger limits for training steps and validation set accuracy criterion, as the reduction in the last layer often results in the largest decrease in parameters within the MobileNetV2 model, leading to significant changes in model performance. Here, the limit for training steps is set to 200, and the validation accuracy criterion is set at 0.8.

We have also established a criterion for significant model deviation following model reduction. If, after model reduction, the accuracy on the validation set is below 0.5 after 10 training steps, it is considered that the model has undergone a significant deviation, and it will be reverted to a previously saved checkpoint to undergo reduction again. This approach saves time that would have otherwise been spent training up to the limit of training steps.

Finally, upon achieving an appropriate parameter count through the main reduction cycle, the final classification layer is reduced using a condensation threshold of 0.4 for this last layer.

3.2.3. Reduction Result

In the CIFAR10 image classification task, we can quickly measure the performance of neural networks using the accuracy of the validation set; thus, we adopt automatic condensation reduction. Please refer to Table 3 for a comparison of the reduction processes. The total condensation reduction training process is shown in Figure 6a,b. From the start of training to the beginning of the first major reduction, it took 293 steps, at which point the model reached a validation set accuracy of 88.16%. The major reduction was completed in 338 steps, reducing from the 2nd layer to the 17th layer, totaling 16 reductions. Initially, the main reduction criterion was set at 0.88 and the layer-by-layer reduction criterion at 0.84; by the end, these were adjusted to 0.87837 and 0.83837, respectively. During this major reduction, the condensation threshold for each layer was maintained at the original value of 0.88080. The original model’s parameter count was reduced from 2,236,682 to 1,143,421, meaning the reduced model’s parameters accounted for only 51.1% of the original model. As shown in Figure 7, throughout the major reduction process, most layers underwent significant reductions, with reduction ratios around 50%. The greatest reduction was in the 17th layer, with a reduction ratio of 19.69% and a single-layer parameter reduction of 380,103. Generally, the model exhibited minor performance deviations after each layer’s reduction, with post-reduction, untrained validation set accuracies mostly above 70%. After significant reductions, accuracy could reach over 80% with just one step of training per layer. The observed reduction in parameter count validates the prevalence of condensation phenomena in convolutional neural networks, with a significant number of neurons condensing around a condensation threshold of approximately 0.88. The minor shifts resulting from reductions confirm that the condensation reduction method extended from fully connected to convolutional neural networks is reasonable and effective. The ability to rapidly restore model performance post-reduction suggests that the critical points achieved by the original model likely coincide with those of the reduced subnetworks. At step 385, 47 steps post-reduction, the reduced model’s validation set accuracy exceeded 87%. Considering model deployment objectives, it is only necessary to proceed with an additional 45 steps of reduction training after meeting deployment requirements to reduce nearly half of the parameter count. An additional 47 steps of training can yield a reduced model with negligible performance differences from the original, offering substantial practical value.

The second main reduction began at step 942, with the model’s validation set accuracy reaching 0.8801, and was completed by step 975. Initially, the main reduction criterion was set at 0.87837 and the layer-by-layer reduction criterion at 0.83837; by the end, these were adjusted to 0.87343 and 0.83343, respectively. Throughout this major reduction process, the condensation threshold for each layer remained at the original value of 0.88080, with no instances of reduction failure necessitating a rollback. The model’s parameter count was further reduced from 1,143,421 (after the first reduction) to 910,370, with the new reduced model’s parameters representing only 40.7% of the original model. After the first reduction, the condensation level of the neural network’s layers decreased, which was expected. According to the embedding principle, reduction starts from a specific critical point of the original model, which should correspond to a critical point of a certain fixed subnetwork. The reduction process progressively approximates this subnetwork by merging condensed neurons. As reduction proceeds, the reduced model becomes increasingly similar to the subnetwork, resulting in fewer condensed neurons and a decrease in the neural network’s redundant complexity, making further reductions increasingly challenging. When applying condensation reduction to practical problems, the reduction process can be prematurely concluded based on specific needs. In this study, to explore the limits of condensation reduction, we continued the reduction process as far as practically possible. Similar to the first reduction, each layer’s reduction resulted in minimal performance deviation of the neural network, with a rapid recovery of performance after just one step of training.

Our experiments continued until the completion of the sixth major reduction, with the third to sixth major reductions starting at steps 1601, 2316, 3286, 4867, and 11,919, and ending at steps 1633, 2362, 3351, 4921, and 12,127, respectively. The initial main reduction criteria at the start were 0.87343, 0.86641, 0.85904, 0.85315, and 0.85018, while the layer-by-layer reduction criteria were 0.83343, 0.82641, 0.81904, 0.81315, and 0.81018, respectively. The parameter counts of the reduced models were 827,032, 776,011, 735,527, 704,004, and 669,228, respectively; after the sixth major reduction, the reduced model’s parameters accounted for only 29.9% of the original model. It can be observed that each subsequent reduction became increasingly difficult, with longer training steps required, aligning with the expectation of approximating the ideal subnetwork. Each layer’s reduction still kept the model’s performance relatively unchanged. By the time of the sixth reduction’s completion, there had been a total of 96 successful reductions, with a parameter decrease of 1,567,454, but only a slight decrease in the model’s validation set accuracy, achieving 83.13%, which is not significantly lower than the historical highest validation set accuracy of 88.16%.

The final layer reduction, starting at step 15,200, utilized a condensation threshold set at 0.4, resulting in the model’s parameter count being reduced to 258,212. This is merely 11.5% of the original model’s parameters, and it is considered a very small model for tasks on CIFAR10. The reduction in the final layer alone eliminated 411,016 parameters, cutting over 60% of the model’s parameters following the sixth major reduction. Despite this significant reduction, the accuracy rate on the validation set without further training still reached 75.45%. After training, the final reduced model achieved a peak accuracy of 83.21%, virtually unchanged from before the reduction. This serves as yet another powerful validation of the effectiveness of condensation reduction.

4. Discussion

The successful application of condensation reduction in both fully connected and convolutional networks for practical problems demonstrates the universality of the condensation phenomenon. In the complex task of combustion simulation, condensation reduction effectively reduces network size while maintaining the accuracy of turbulence ignition predictions, confirming that a condensed network and its subnetworks are approximately equivalent. In image classification tasks, particularly in the broader application within depthwise separable convolutional neural networks, the definition and scope of condensation have been expanded. This extension enables the broader application of condensation reduction across various network types and tasks.

Author Contributions

Conceptualization, T.C. and Z.-Q.J.X.; methodology, T.C. and Z.-Q.J.X.; software, T.C. and Z.-Q.J.X.; validation, T.C. and Z.-Q.J.X.; formal analysis, T.C. and Z.-Q.J.X.; investigation, T.C. and Z.-Q.J.X.; resources, T.C. and Z.-Q.J.X.; data curation, T.C. and Z.-Q.J.X.; writing—original draft preparation, T.C. and Z.-Q.J.X.; writing—review and editing, T.C. and Z.-Q.J.X.; visualization, T.C. and Z.-Q.J.X.; supervision, T.C. and Z.-Q.J.X.; project administration, Z.-Q.J.X.; funding acquisition, Z.-Q.J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was sponsored by the National Key R&D Program of China, grant no. 2022YFA1008200; the Shanghai Sailing Program; the Natural Science Foundation of Shanghai, grant no. 20ZR1429000; the National Natural Science Foundation of China, grant no. 62002221; Shanghai Municipal of Science and Technology Major Project no. 2021SHZDZX0102; the HPC of School of Mathematical Sciences and the Student Innovation Center; and the Siyuan-1 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are available by request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FNN	Fully connected neural network.
CNN	Convolutional neural network.
CFD	Computational fluid dynamics.

Appendix A

Appendix A.1. Detail of Condensation in CNN

Consider a neural network consisting of two convolutional layers and one fully connected layer, with a structure of

d_{i n} - m_{C}^{[1]} - m_{C}^{[2]} - m_{F_{1}} - d_{o u t} .

(A1)

Input as

{Input}_{1} = {(x_{n})}^{N_{1} \times (C_{i n c h a n n e l}^{[0]} \times {Size}^{[1]})}, n \in (1, 2, \dots, N_{1}) .

(A2)

Here,

N_{1}

represents the number of moves of the first convolutional layer in the input. After passing through the first convolutional layer, the intermediate output obtained is

{Output}_{1}^{N_{1} \times C_{o u t c h a n n e l}^{[1]}} = [\begin{matrix} {\vec{y}}_{1}^{C_{o u t c h a n n e l}^{[1]}} \\ {\vec{y}}_{2}^{C_{o u t c h a n n e l}^{[1]}} \\ ⋮ \\ {\vec{y}}_{N_{1}}^{C_{o u t c h a n n e l}^{[1]}} \end{matrix}] .

(A3)

The input data processed by each movement of the second convolutional kernel are

{input}_{2}^{C_{o u t c h a n n e l}^{[1]} \times {Size}^{[2]}} = {[\begin{matrix} {\vec{y}}_{1}^{C_{o u t c h a n n e l}^{[1]}} \\ {\vec{y}}_{2}^{C_{o u t c h a n n e l}^{[1]}} \\ ⋮ \\ {\vec{y}}_{{Size}^{[2]}}^{C_{o u t c h a n n e l}^{[1]}} \end{matrix}]}^{⊤} .

(A4)

The output obtained from each movement is

{output}_{2}^{C_{o u t c h a n n e l}^{[2]}} = σ (W_{[2]}^{C_{o u t c h a n n e l}^{[2]} \times (C_{o u t c h a n n e l}^{[1]} \times {Size}^{[2]})} \cdot {intput}^{C_{o u t c h a n n e l}^{[1]} \times {Size}^{[2]}}) .

(A5)

If the direction of the p-th neuron and q-th neuron in the first convolutional layer is consistent, i.e., when the cosine similarity is one, there is

{\vec{v}}_{q}^{{Size}^{[1]} \times C_{i n c h a n n e l}^{[1]}} = λ {\vec{v}}_{p}^{{Size}^{[1]} \times C_{i n c h a n n e l}^{[1]}}, λ \in R^{+} .

(A6)

By utilizing the homogeneity of the activation function, there are

\begin{matrix} {\vec{y}}^{C_{o u t c h a n n e l}^{[1]}} & = σ {(W_{[1]}^{C_{o u t c h a n n e l}^{[1]} \times (C_{i n c h a n n e l}^{[1]} \times {Size}^{[1]})} \cdot {\vec{x}}^{C_{i n c h a n n e l}^{[1]} \times {Size}^{[1]}})}^{⊤} \\ = {[\begin{matrix} ⋮ \\ σ ({\vec{v}}_{p} \cdot \vec{x}) \\ ⋮ \\ σ ({\vec{v}}_{q} \cdot \vec{x}) \\ ⋮ \end{matrix}]}^{⊤} = {[\begin{matrix} ⋮ \\ σ ({\vec{v}}_{p} \cdot \vec{x}) \\ ⋮ \\ λ σ ({\vec{v}}_{p} \cdot \vec{x}) \\ ⋮ \end{matrix}]}^{⊤} . \end{matrix}

(A7)

For each convolutional kernel movement, it holds, so the total output of the first convolutional layer can be represented as

{Output}_{1}^{N_{1} \times C_{o u t c h a n n e l}^{[1]}} = [\begin{matrix} {\vec{y}}_{1}^{C_{o u t c h a n n e l}^{[1]}} \\ {\vec{y}}_{2}^{C_{o u t c h a n n e l}^{[1]}} \\ ⋮ \\ {\vec{y}}_{N_{1}}^{C_{o u t c h a n n e l}^{[1]}} \end{matrix}] = [\begin{matrix} \dots & y_{1}^{p} & \dots & λ y_{1}^{p} & \dots \\ \dots & y_{2}^{p} & \dots & λ y_{2}^{p} & \dots \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ \dots & y_{N_{1}}^{p} & \dots & λ y_{N_{1}}^{p} & \dots \end{matrix}] .

(A8)

So the output obtained from each movement of the second convolutional layer is

\begin{matrix} {output}_{2}^{C_{o u t c h a n n e l}^{[2]}} & = σ (W_{[2]}^{C_{o u t c h a n n e l}^{[2]} \times (C_{o u t c h a n n e l}^{[1]} \times {Size}^{[2]})} \cdot {intput}^{C_{o u t c h a n n e l}^{[1]} \times {Size}^{[2]}}) \end{matrix}

(A9)

Among them, the specific output of each output dimension is

\begin{matrix} o u t & = σ (\vec{v} \cdot {intput}^{C_{o u t c h a n n e l}^{[1]} \times {Size}^{[2]}}) \\ = σ (\sum_{i}^{C_{o u t c h a n n e l}^{[1]}} \sum_{j}^{{Size}^{[2]}} v_{i j} \cdot {intput}_{i j}) \\ = σ (\sum_{i \neq p, q}^{C_{o u t c h a n n e l}^{[1]}} \sum_{j}^{{Size}^{[2]}} v_{i j} \cdot {intput}_{i j} + \sum_{j}^{{Size}^{[2]}} v_{q j} \cdot {intput}_{p j} + λ \sum_{j}^{{Size}^{[2]}} v_{p j} \cdot {intput}_{p j}) \\ = σ (\sum_{i \neq p, q}^{C_{o u t c h a n n e l}^{[1]}} \sum_{j}^{{Size}^{[2]}} v_{i j} \cdot {intput}_{i j} + \sum_{j}^{{Size}^{[2]}} (v_{p j} + λ v_{q j}) \cdot {intput}_{p j}) . \end{matrix}

(A10)

Thus, for the overall output of the second convolutional layer, merging the p-th neuron and q-th neuron from the first convolutional layer requires only that the corresponding input dimensions and parameters of the second convolutional layer be appropriately adjusted. Then, the output of the neural network before and after the merge will be completely identical. If two neurons with consistent directions appear in the second convolutional layer, then, similar to the previous derivation, it can be proven that merging these two directionally consistent neurons will not change the output of the neural network before and after the merger.

Appendix A.2. Details of Condensation Reduction in Traditional CNN

Let us assume that the first N neurons are grouped together; then, the new parameters of the model after reduction are as follows:

{\vec{u}}_{m a i n}^{[n e w]} = \sum_{k}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} {\vec{u}}_{k}, {\vec{v}}_{m a i n}^{[n e w]} = {\vec{v}}_{m a i n},

(A11)

wherein

\begin{matrix} W_{[l]}^{C_{o u t c h a n n e l}^{[l]} \times (C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]})} & = [\begin{matrix} {\vec{v}}_{1} \\ ⋮ \\ {\vec{v}}_{N} \\ {\vec{v}}_{N + 1} \\ ⋮ \\ {\vec{v}}_{C_{o u t c h a n n e l}^{[l]}} \end{matrix}], \\ R^{C_{o u t c h a n n e l}^{[l]} \times (C_{o u t c h a n n e l}^{[l + 1]} \times {Size}^{[l + 1]})} & = Reshape (W_{[l + 1]}^{C_{o u t c h a n n e l}^{[l + 1]} \times (C_{o u t c h a n n e l}^{[l]} \times {Size}^{[l + 1]})}) \\ = [\begin{matrix} {\vec{u}}_{1} \\ ⋮ \\ {\vec{u}}_{N} \\ {\vec{u}}_{N + 1} \\ ⋮ \\ {\vec{u}}_{C_{o u t c h a n n e l}^{[l]}} \end{matrix}] . \end{matrix}

(A12)

The new parameter matrix of the neural network obtained is

\begin{matrix} W_{{[l]}_{n e w}}^{(C_{o u t c h a n n e l}^{[l]} - N + 1) \times (C_{i n c h a n n e l}^{[l]} \times {Size}^{[l]})} & = [\begin{matrix} {\vec{v}}_{m a i n}^{[n e w]} \\ {\vec{v}}_{N + 1} \\ ⋮ \\ {\vec{v}}_{C_{o u t c h a n n e l}^{[l]}} \end{matrix}], \\ R_{n e w}^{(C_{o u t c h a n n e l}^{[l]} - N + 1) \times (C_{o u t c h a n n e l}^{[l + 1]} \times {Size}^{[l + 1]})} & = Reshape (W_{{[l + 1]}_{n e w}}^{C_{o u t c h a n n e l}^{[l + 1]} \times ((C_{o u t c h a n n e l}^{[l]} - N + 1) \times {Size}^{[l + 1]})}) \\ = [\begin{matrix} {\vec{u}}_{m a i n}^{[n e w]} \\ {\vec{u}}_{N + 1} \\ ⋮ \\ {\vec{u}}_{C_{o u t c h a n n e l}^{[l]}} \end{matrix}] . \end{matrix}

(A13)

By using the new neural network parameter matrix to create a new neural network, the condensation-based reduction can be completed. The above generalization still applies completely even after incorporating pooling layers.

Appendix A.3. Details of Condensation Reduction in MobileNetV2

Let us proceed with the derivation for the scenario where a convolutional kernel is applied only once, and initially, let us not consider batch normalization. Assuming the input is as follows:

Input = [\begin{matrix} {\vec{x}}_{1} \\ \dots \\ {\vec{x}}_{C_{i n c h a n n e l}} \end{matrix}], {\vec{x}}_{i} = (x_{i 1} \dots x_{i Size}) .

(A14)

Assuming that two neurons from a depthwise layer have condensed in the same direction, let us designate them as

{\vec{v}}_{q}^{Size} = λ {\vec{v}}_{p}^{Size}, λ \in R^{+} .

(A15)

Their corresponding neurons in the preceding pointwise layer are

{\vec{α}}_{p}^{C_{i n c h a n n e l}} = (\begin{matrix} a_{p}^{[1]} & \dots & a_{p}^{[C_{i n c h a n n e l}]} \end{matrix}), {\vec{α}}_{q}^{C_{i n c h a n n e l}} = (\begin{matrix} a_{q}^{[1]} & \dots & a_{q}^{[C_{i n c h a n n e l}]} \end{matrix}),

(A16)

An arbitrary neuron in the subsequent pointwise layer is represented as

\vec{β} = (\begin{matrix} b_{1} & \dots & b_{p} & \dots & b_{q} & \dots & b_{C_{i n c h a n n e l}} \end{matrix}) .

(A17)

At this point, related to these two condensed neurons, the output of this complete module would be

\begin{matrix} Output & = b_{p} y_{p} + b_{q} y_{q} \\ y_{p} & = {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} \\ y_{q} & = {\vec{v}}_{q}^{Size} \cdot {({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} \\ = λ {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} . \end{matrix}

(A18)

When the two neurons from the depthwise layer are merged, the output of this complete module at that moment would be

\begin{matrix} {Output}_{[n e w]} & = b_{p_{[n e w]}} y_{p_{[n e w]}} \\ y_{p_{[n e w]}} & = {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} . \end{matrix}

(A19)

Following the previous logic of condensation reduction, there should be

b_{p_{[n e w]}} = b_{p} + λ b_{q} .

(A20)

Therefore, to ensure that the module’s output remains consistent before and after the merge, it should be

\begin{matrix} Output & = b_{p} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + b_{q} {\vec{v}}_{q}^{Size} \cdot {({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} \\ = b_{p} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + λ b_{q} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} \\ = b_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} . \end{matrix}

(A21)

This leads to the formulation of a system of linear equations:

\{\begin{matrix} b_{p_{[n e w]}} & = b_{p} + λ b_{q} \\ b_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = b_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ b_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \end{matrix} .

(A22)

By using the new parameters to construct a new module, the condensation reduction can be completed. The above derivation can easily be extended to situations where multiple neurons have condensed, resulting in

\{\begin{matrix} b_{m a i n_{[n e w]}} & = \sum_{k = 1}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} b_{k} \\ b_{m a i n_{[n e w]}} {\vec{α}}_{m a i n_{[n e w]}}^{C_{i n c h a n n e l}} & = \sum_{k = 1}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} b_{k} {\vec{α}}_{k}^{C_{i n c h a n n e l}} \end{matrix} .

(A23)

It is important to note that the system of linear equations should hold true for all values of b, making it an overdetermined system of linear equations. In practice, it is necessary to use the least squares method to solve it.

Now, we will extend our results to modules that incorporate batch normalization, continuing to consider the case where two neurons in the depthwise layer have condensed. It is important to highlight that all derivations remain valid even after the addition of the ReLU6 activation function. Let us continue to assume

{\vec{v}}_{q}^{Size} = λ {\vec{v}}_{p}^{Size}, λ \in R^{+} .

(A24)

The expression for batch normalization is

B N (tmp) = \frac{tmp - Mean}{\sqrt{Var}} \times γ + η .

(A25)

We can simplify the expression for batch normalization. For the batch normalization following the preceding pointwise layer, let us set

\begin{matrix} f_{p} (tmp) & = ε_{p} \cdot tmp + ζ_{p}, \\ f_{q} (tmp) & = ε_{q} \cdot tmp + ζ_{q}, \\ ε_{p} & = \frac{γ_{p}}{\sqrt{{Var}_{p}}}, \\ ζ_{p} & = - \frac{{mean}_{p}}{\sqrt{{Var}_{p}}} + η_{p} . \end{matrix}

(A26)

For the batch normalization following the depthwise layer, let us similarly set

\begin{matrix} g_{p} (tmp) & = μ_{p} \cdot tmp + ν_{p}, \\ g_{q} (tmp) & = μ_{q} \cdot tmp + ν_{q}, \\ μ_{p} & = \frac{γ_{p}}{\sqrt{{Var}_{p}}}, \\ ν_{p} & = - \frac{{mean}_{p}}{\sqrt{{Var}_{p}}} + η_{p} . \end{matrix}

(A27)

After merging the two depthwise neurons, first, following the condensation reduction method used for fully connected neural networks, ensure that the intermediate output before the batch normalization layer, which follows the depthwise layer, remains consistent with the previous output. Therefore, we have

\begin{matrix} output = & [ε_{p} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p}^{⊤}] + \\ [ε_{q} λ {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + λ {\vec{v}}_{p}^{Size} \cdot ζ_{q}^{⊤}] \\ = & ε_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p_{[n e w]}}^{⊤} . \end{matrix}

(A28)

So as to obtain

\{\begin{matrix} ε_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = ε_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ ε_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \\ ζ_{p_{[n e w]}} & = ζ_{p} + λ ζ_{q} \end{matrix} .

(A29)

At this point, the output of this complete module, which is related to these two condensed neurons, is

\begin{matrix} Output & = b_{p} g_{p} (y_{p}) + b_{q} g_{q} (y_{q}) \\ y_{p} & = {\vec{v}}_{p}^{Size} \cdot f_{p} ({({\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤}) \\ y_{q} & = {\vec{v}}_{q}^{Size} \cdot f_{q} ({({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤}) \\ = λ {\vec{v}}_{p}^{Size} \cdot f_{q} ({({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤}) . \end{matrix}

(A30)

After merging two neurons, the output of the complete module is

\begin{matrix} {Output}_{[n e w]} & = b_{p_{[n e w]}} g_{p_{[n e w]}} (y_{p_{[n e w]}}) \\ y_{p_{[n e w]}} & = {\vec{v}}_{p}^{Size} \cdot f_{p_{[n e w]}} ({({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤}) \\ = {\vec{v}}_{p}^{Size} \cdot {(ε_{p_{[n e w]}} ({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size}) + ζ_{p_{[n e w]}})}^{⊤} \\ = ε_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p_{[n e w]}}^{⊤} . \end{matrix}

(A31)

Therefore, in order to ensure consistent output of the modules before and after merging, there should be

\begin{matrix} Output = & b_{p} g_{p} (ε_{p} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p}^{⊤}) + \\ b_{q} g_{q} (ε_{q} λ {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + λ {\vec{v}}_{p}^{Size} \cdot ζ_{q}^{⊤}) \\ = & b_{p_{[n e w]}} g_{p_{[n e w]}} (ε_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {({\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size})}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p_{[n e w]}}^{⊤}) . \end{matrix}

(A32)

Set

\begin{matrix} {\vec{x}}_{p} & = {\vec{α}}_{p}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size}, \\ {\vec{x}}_{q} & = {\vec{α}}_{q}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size}, \\ \vec{x_{[n e w]}} & = {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} \cdot {Input}^{C_{i n c h a n n e l} \times Size} . \end{matrix}

(A33)

In order to ensure consistent output of the modules before and after merging, there are

\begin{matrix} Output = & b_{p} g_{p} (ε_{p} {\vec{v}}_{p}^{Size} \cdot {\vec{x}}_{p}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p}^{⊤}) + b_{q} g_{q} (ε_{q} λ {\vec{v}}_{p}^{Size} \cdot {\vec{x}}_{q}^{⊤} + λ {\vec{v}}_{p}^{Size} \cdot ζ_{q}^{⊤}) \\ = & b_{p} μ_{p} (ε_{p} {\vec{v}}_{p}^{Size} \cdot {\vec{x}}_{p}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p}^{⊤}) + b_{q} μ_{q} (ε_{q} λ {\vec{v}}_{p}^{Size} \cdot {\vec{x}}_{q}^{⊤} + λ {\vec{v}}_{p}^{Size} \cdot ζ_{q}^{⊤}) + b_{p} ν_{p} + b_{q} ν_{q} \\ = & (b_{p} μ_{p} ε_{p} {\vec{v}}_{p}^{Size} \cdot {\vec{x}}_{p}^{⊤} + λ b_{q} μ_{q} ε_{q} {\vec{v}}_{p}^{Size} \cdot {\vec{x}}_{q}^{⊤}) + \\ (b_{p} μ_{p} {\vec{v}}_{p}^{Size} \cdot ζ_{p}^{⊤} + λ b_{q} μ_{q} {\vec{v}}_{p}^{Size} \cdot ζ_{q}^{⊤}) + b_{p} ν_{p} + b_{q} ν_{q} \\ = & b_{p_{[n e w]}} g_{p_{[n e w]}} (ε_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {\vec{x_{[n e w]}}}^{⊤} + {\vec{v}}_{p}^{Size} \cdot ζ_{p_{[n e w]}}^{⊤}) \\ = & b_{p_{[n e w]}} μ_{p_{[n e w]}} ε_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {\vec{x_{[n e w]}}}^{⊤} + b_{p_{[n e w]}} μ_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot ζ_{p_{[n e w]}}^{⊤} + b_{p_{[n e w]}} ν_{p_{[n e w]}} . \end{matrix}

(A34)

Combine the equations obtained from the intermediate output to obtain a system of linear equations

\{\begin{matrix} b_{p_{[n e w]}} μ_{p_{[n e w]}} ε_{p_{[n e w]}} {\vec{v}}_{p}^{Size} \cdot {\vec{x_{[n e w]}}}^{⊤} & = (b_{p} μ_{p} ε_{p} + λ b_{q} μ_{q} ε_{q}) {\vec{v}}_{p}^{Size} \cdot {\vec{x}}^{⊤} \\ b_{p_{[n e w]}} ν_{p_{[n e w]}} & = b_{p} ν_{p} + b_{q} ν_{q} \\ ε_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = ε_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ ε_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \\ ζ_{p_{[n e w]}} & = ζ_{p} + λ ζ_{q} \end{matrix} .

(A35)

So as to obtain

\{\begin{matrix} b_{p_{[n e w]}} μ_{p_{[n e w]}} ε_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = b_{p} μ_{p} ε_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ b_{q} μ_{q} ε_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \\ b_{p_{[n e w]}} ν_{p_{[n e w]}} & = b_{p} ν_{p} + b_{q} ν_{q} \\ ε_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = ε_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ ε_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \\ ζ_{p_{[n e w]}} & = ζ_{p} + λ ζ_{q} \end{matrix} .

(A36)

Both b and a in the equation are solved using the previously derived equation system without batch standardization. The total equation system is

\{\begin{matrix} b_{p_{[n e w]}} μ_{p_{[n e w]}} ε_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = b_{p} μ_{p} ε_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ b_{q} μ_{q} ε_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \\ b_{p_{[n e w]}} ν_{p_{[n e w]}} & = b_{p} ν_{p} + b_{q} ν_{q} \\ ε_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = ε_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ ε_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \\ ζ_{p_{[n e w]}} & = ζ_{p} + λ ζ_{q} \\ b_{p_{[n e w]}} & = b_{p} + λ b_{q} \\ b_{p_{[n e w]}} {\vec{α}}_{p_{[n e w]}}^{C_{i n c h a n n e l}} & = b_{p} {\vec{α}}_{p}^{C_{i n c h a n n e l}} + λ b_{q} {\vec{α}}_{q}^{C_{i n c h a n n e l}} \end{matrix} .

(A37)

The equation system that extends to the condensation of multiple neurons is

\{\begin{matrix} b_{m a i n_{[n e w]}} & = \sum_{k = 1}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} b_{k} \\ b_{m a i n_{[n e w]}} {\vec{α}}_{m a i n_{[n e w]}}^{C_{i n c h a n n e l}} & = \sum_{k = 1}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} b_{k} {\vec{α}}_{k}^{C_{i n c h a n n e l}} \\ ε_{m a i n_{[n e w]}} {\vec{α}}_{m a i n_{[n e w]}}^{C_{i n c h a n n e l}} & = \sum_{k = 0}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} ε_{k} {\vec{α}}_{k}^{C_{i n c h a n n e l}} \\ ζ_{m a i n_{[n e w]}} & = \sum_{k = 0}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} ζ_{k} \\ b_{m a i n_{[n e w]}} μ_{m a i n_{[n e w]}} ε_{m a i n_{[n e w]}} {\vec{α}}_{m a i n_{[n e w]}}^{C_{i n c h a n n e l}} & = \sum_{k = 0}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} b_{k} μ_{k} ε_{k} {\vec{α}}_{k}^{C_{i n c h a n n e l}} \\ b_{m a i n_{[n e w]}} ν_{m a i n_{[n e w]}} & = \sum_{k = 0}^{N} \frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}} b_{k} ν_{k} \end{matrix} .

(A38)

Each neuron in the pointwise layer corresponds to one equation in this system, effectively constructing an overdetermined system of linear equations. We still utilize the method of least squares for solving.

Afterward, calculate the mean and variance of batch standardization using the equation

\{\begin{matrix} Mean = \sum_{k = 0}^{N} {(\frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}})}^{- 1} {Mean}_{k} \\ Var = \sum_{k = 0}^{N} {(\frac{∥ {\vec{v}}_{k} ∥_{2}}{∥ {\vec{v}}_{m a i n} ∥_{2}})}^{- 2} {Var}_{k} \end{matrix} .

(A39)

Afterward, the remaining parameters can be easily obtained, and a reduced model can be obtained using the new parameter matrix.

References

Reiser, P.; Neubert, M.; Eberhard, A.; Torresi, L.; Zhou, C.; Shao, C.; Metni, H.; van Hoesel, C.; Schopmans, H.; Sommer, T.; et al. Graph neural networks for materials science and chemistry. Commun. Mater. 2022, 3, 93. [Google Scholar] [CrossRef]
Sarvamangala, D.; Kulkarni, R.V. Convolutional neural networks in medical image understanding: A survey. Evol. Intell. 2022, 15, 1–22. [Google Scholar] [CrossRef] [PubMed]
Shlomi, J.; Battaglia, P.; Vlimant, J.R. Graph neural networks in particle physics. Mach. Learn. Sci. Technol. 2020, 2, 021001. [Google Scholar] [CrossRef]
Smith, M.J.; Geach, J.E. Astronomia ex machina: A history, primer and outlook on neural networks in astronomy. R. Soc. Open Sci. 2023, 10, 221454. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.M.; Liang, L.; Liu, L.; Tang, M.J. Graph neural networks and their current applications in bioinformatics. Front. Genet. 2021, 12, 690049. [Google Scholar] [CrossRef]
Zhong, S.; Zhang, K.; Bagheri, M.; Burken, J.G.; Gu, A.; Li, B.; Ma, X.; Marrone, B.L.; Ren, Z.J.; Schrier, J.; et al. Machine learning: New ideas and tools in environmental science and engineering. Environ. Sci. Technol. 2021, 55, 12741–12754. [Google Scholar] [CrossRef] [PubMed]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Blechschmidt, J.; Ernst, O.G. Three ways to solve partial differential equations with neural networks—A review. GAMM-Mitteilungen 2021, 44, e202100006. [Google Scholar] [CrossRef]
Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Neural operator: Graph kernel network for partial differential equations. arXiv 2020, arXiv:2003.03485. [Google Scholar]
Michoski, C.; Milosavljević, M.; Oliver, T.; Hatch, D.R. Solving differential equations using deep neural networks. Neurocomputing 2020, 399, 193–212. [Google Scholar] [CrossRef]
Xu, Z.Q.J.; Yao, J.; Yi, Y.; Hang, L.; Zhang, Y.; Zhang, T. Solving multiscale dynamical systems by deep learning. arXiv 2024, arXiv:2401.01220. [Google Scholar]
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the value of network pruning. arXiv 2018, arXiv:1810.05270. [Google Scholar]
Wang, Z.; Li, C.; Wang, X. Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14913–14922. [Google Scholar]
Wesseling, P. Introduction to Multigrid Methods; Technical Report; NASA: Washington, DC, USA, 1995. [Google Scholar]
Erichson, N.B.; Mathelin, L.; Yao, Z.; Brunton, S.L.; Mahoney, M.W.; Kutz, J.N. Shallow neural networks for fluid flow reconstruction with limited sensors. Proc. R. Soc. A 2020, 476, 20200097. [Google Scholar] [CrossRef] [PubMed]
Roth, W.; Schindler, G.; Klein, B.; Peharz, R.; Tschiatschek, S.; Fröning, H.; Pernkopf, F.; Ghahramani, Z. Resource-efficient neural networks for embedded systems. J. Mach. Learn. Res. 2024, 25, 1–51. [Google Scholar]
Luo, T.; Xu, Z.Q.J.; Ma, Z.; Zhang, Y. Phase diagram for two-layer relu neural networks at infinite-width limit. J. Mach. Learn. Res. 2021, 22, 1–47. [Google Scholar]
Zhang, Y.; Zhang, Z.; Luo, T.; Xu, Z.J. Embedding principle of loss landscape of deep neural networks. Adv. Neural Inf. Process. Syst. 2021, 34, 14848–14859. [Google Scholar]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
Liu, D.; Zheng, M.; Sepulveda, N.A. Using Artificial Neural Network Condensation to Facilitate Adaptation of Machine Learning in Medical Settings by Reducing Computational Burden: Model Design and Evaluation Study. JMIR Form. Res. 2021, 5, e20767. [Google Scholar] [CrossRef] [PubMed]
Cheng, H.; Zhang, M.; Shi, J.Q. A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations. arXiv 2023, arXiv:2308.06767. [Google Scholar]
Hanson, S.; Pratt, L. Comparing biases for minimal network construction with back-propagation. Adv. Neural Inf. Process. Syst. 1988, 1. [Google Scholar]
LeCun, Y.; Denker, J.; Solla, S. Optimal brain damage. Adv. Neural Inf. Process. Syst. 1989, 2. Available online: https://proceedings.neurips.cc/paper_files/paper/1989/file/6c9882bbac1c7093bd25041881277658-Paper.pdf (accessed on 27 April 2024).
You, Z.; Yan, K.; Ye, J.; Ma, M.; Wang, P. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper_files/paper/2019/file/b51a15f382ac914391a58850ab343b00-Paper.pdf (accessed on 27 April 2024).
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Zirwes, T.; Zhang, F.; Habisreuther, P.; Hansinger, M.; Bockhorn, H.; Pfitzner, M.; Trimis, D. Quasi-DNS dataset of a piloted flame with inhomogeneous inlet conditions. Flow Turbul. Combust. 2020, 104, 997–1027. [Google Scholar] [CrossRef]

Figure 1. Flow chart of automatic condensation reduction. The left side shows the main reduction process of automatic condensation reduction, while the right side details the layer-by-layer reduction process carried out in each main reduction.

Figure 2. Turbulent ignition. The image displays the temperature distributions obtained in combination with EBI for drm19, showing results from left to right for CVODE, the original fully connected neural network, and the neural network after the third reduction.

Figure 3. Training loss and validation loss of the FNN models. The subfigures (a–d) illustrate the training and validation losses of the original fully connected neural network model through to its third reduction.

Figure 4. Cosine similarity matrices of 2nd layer on different stages. In the matrix heatmap, the element in the i-th row and j-th column represents the cosine similarity between the i-th neuron and the j-th neuron. The more distinct the blocks in the cosine matrix, the stronger the condensation of the neural network. From left to right, the images correspond to the original fully connected neural network model through to the model after the third reduction. “Epoch 0” indicates that the model has just been reduced and has not yet been trained. “Epoch 5000” represents the fully connected neural network model upon completion of training.

Figure 5. Convolutional block of MobileNetV2. In figure, ‘Conv

1 \times 1

’ refers to layers where the convolutional kernels are

1 \times 1

, with the number of channels and the number of kernels varying in each layer. ‘Conv

3 \times 3

’ indicates layers where the convolutional kernels are

3 \times 3

, with one channel, and the number of kernels may vary per layer. ‘BN’ stands for batch normalization, while ‘ReLU6’ and ‘Linear’ refer to activation functions.

Figure 5. Convolutional block of MobileNetV2. In figure, ‘Conv

1 \times 1

’ refers to layers where the convolutional kernels are

1 \times 1

, with the number of channels and the number of kernels varying in each layer. ‘Conv

3 \times 3

’ indicates layers where the convolutional kernels are

3 \times 3

, with one channel, and the number of kernels may vary per layer. ‘BN’ stands for batch normalization, while ‘ReLU6’ and ‘Linear’ refer to activation functions.

Figure 6. Training loss and accuracy with parameter nums of CNN model. CPN stands for Current Parameter Number, OPN stands for Original Parameter Number, and CPN/OPN represents their ratio. Each red dot in the figure represents a single-layer reduction. Subfigures (a,b), respectively, show the relationship between model parameter nums and accuracy and training error. It is clearly observable that as the number of parameters decreases, accuracy tends to decrease and training error tends to increase. However, both metrics quickly return to their original levels.

Figure 7. First time of main reduction. The red reduction ratio indicates the proportion of the layer’s parameters after reduction relative to the original number of parameters; a smaller ratio suggests a greater reduction. The green “Acc 0 epoch” represents the accuracy of the network after reduction without any further training, while the blue “Acc 1 epoch” represents the accuracy of the network after reduction with only one step of training.

Table 1. Comparison table of reduction process of FNN. “Parameter” represents the number of parameters in the neural network. Reduction rate represents the ratio of the current network parameter count to the original network parameter count. A lower reduction rate indicates a smaller network relative to the original, demonstrating a more effective reduction.

Model	Structure	Validation Loss	Parameter	Reduction Ratio
Original	23-3200-1600-800-400-23	0.02276	6,802,800	100%
1st reduced	23-2205-1600-800-400-23	0.01694	5,187,915	76.26%
2nd reduced	23-1105-1600-800-400-23	0.01454	3,402,615	50.01%
3rd reduced	23-1105-1309-800-400-23	0.01434	2,848,260	41.87%

Table 2. Main structure of MobileNetV2 [25]. ‘

1 \times 1

’ represents a convolutional kernel with a shape of

1 \times 1

, where the number of channels is determined by the input dimensions. ‘

3 \times 3

’ refers to a convolutional kernel that is single-channel and has a shape of

3 \times 3

. ReLU6 and linear are the activation functions used after the convolutional layers. The numbers in the table represent the count of convolutional kernels in each corresponding layer of the convolutional layer.

Table 2. Main structure of MobileNetV2 [25]. ‘

1 \times 1

’ represents a convolutional kernel with a shape of

1 \times 1

, where the number of channels is determined by the input dimensions. ‘

3 \times 3

’ refers to a convolutional kernel that is single-channel and has a shape of

3 \times 3

. ReLU6 and linear are the activation functions used after the convolutional layers. The numbers in the table represent the count of convolutional kernels in each corresponding layer of the convolutional layer.

Layer Index	2	3	4	5–6	7	8–10	11	12–13	14	15–16	17
$1 \times 1$ , ReLU6	96	144	144	192	192	384	384	576	576	960	960
$3 \times 3$ , ReLU6	96	144	144	192	192	384	384	576	576	960	960
$1 \times 1$ , linear	24	24	32	32	64	64	96	96	160	160	320

Table 3. Comparison table of reduction process of CNN. “Parameter” represents the number of parameters in the neural network. Reduction rate represents the ratio of the current network parameter count to the original network parameter count. A lower reduction rate indicates a smaller network relative to the original, demonstrating a more effective reduction. Accuracy refers to the accuracy on the test set. From the table, it is clear that even with significant changes in the reduction rate, the accuracy only fluctuates slightly, showcasing the coalescence reduction algorithm’s effectiveness in reducing the network size while maintaining model performance.

Model	Parameter	Reduction Ratio	Accuracy
Original	2,236,682	100%	88.16%
1st reduced	1,143,421	51.12%	88.01%
2nd reduced	910,370	40.70%	88.00%
3rd reduced	827,032	36.98%	87.24%
4th reduced	776,011	34.69%	86.36%
5th reduced	735,527	32.88%	85.88%
6th reduced	704,004	31.48%	85.80%
7th reduced	669,228	29.92%	83.88%
Final reduced	258,212	11.54%	83.21%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, T.; Xu, Z.-Q.J. Efficient and Flexible Method for Reducing Moderate-Size Deep Neural Networks with Condensation. Entropy 2024, 26, 567. https://doi.org/10.3390/e26070567

AMA Style

Chen T, Xu Z-QJ. Efficient and Flexible Method for Reducing Moderate-Size Deep Neural Networks with Condensation. Entropy. 2024; 26(7):567. https://doi.org/10.3390/e26070567

Chicago/Turabian Style

Chen, Tianyi, and Zhi-Qin John Xu. 2024. "Efficient and Flexible Method for Reducing Moderate-Size Deep Neural Networks with Condensation" Entropy 26, no. 7: 567. https://doi.org/10.3390/e26070567

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient and Flexible Method for Reducing Moderate-Size Deep Neural Networks with Condensation

Abstract

1. Introduction

2. Materials and Methods

2.1. Condensation Reduction in FNN

2.1.1. FNN

2.1.2. Cosine Similarity

2.1.3. Condensation of FNN

2.1.4. Condensation Reduction

2.2. Condensation Reduction in CNN

2.2.1. CNN

2.2.2. Condensation of CNN

2.2.3. Condensation Reduction

2.3. Manual Condensation Reduction and Automatic Condensation Reduction

2.3.1. Manual Condensation Reduction

2.3.2. Automatic Condensation Reduction

3. Results

3.1. Acceleration for Combustion Simulation

3.1.1. Background on Task

3.1.2. Training Setup

3.1.3. Reduction Setup and Result

3.2. Classification of CIFAR10

3.2.1. Background on Task

3.2.2. Training and Reduction Setup

3.2.3. Reduction Result

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Detail of Condensation in CNN

Appendix A.2. Details of Condensation Reduction in Traditional CNN

Appendix A.3. Details of Condensation Reduction in MobileNetV2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI