Small Contributions, Small Networks: Efficient Neural Network Pruning Based on Relative Importance

Mostafa Hussien
ÉTS, University of Quebec, Canada
mostafa.hussien@etsmtl.ca
Mahmoud Afifi
Google
Kim Khoa Nguyen
ÉTS, University of Quebec, Canada
Mohamed Cheriet
ÉTS, University of Quebec, Canada
Abstract

Recent advancements have scaled neural networks to unprecedented sizes, achieving remarkable performance across a wide range of tasks. However, deploying these large-scale models on resource-constrained devices poses significant challenges due to substantial storage and computational requirements. Neural network pruning has emerged as an effective technique to mitigate these limitations by reducing model size and complexity. In this paper, we introduce an intuitive and interpretable pruning method based on activation statistics, rooted in information theory and statistical analysis. Our approach leverages the statistical properties of neuron activations to identify and remove weights with minimal contributions to neuron outputs. Specifically, we build a distribution of weight contributions across the dataset and utilize its parameters to guide the pruning process. Furthermore, we propose a Pruning-aware Training strategy that incorporates an additional regularization term to enhance the effectiveness of our pruning method. Extensive experiments on multiple datasets and network architectures demonstrate that our method consistently outperforms several baseline and state-of-the-art pruning techniques.

1 Introduction

Deep learning has achieved remarkable results across various fields, from computer vision to natural language processing, by generating highly effective models like large language models (LLMs) (e.g., Brown et al. (2020); Touvron et al. (2023); Gemini Team Google (2023)), which have demonstrated significant improvements in multiple applications. These models have shown significant improvements in a wide range of applications, including machine translation (Lewis (2019)), question answering (Raffel et al. (2020)), and image classification (Abdelhamed et al. (2024)). However, as deep neural networks (DNNs) grow in size to handle increasingly complex problems, they require immense computational resources, both in terms of memory and processing power.

Network pruning, also referred to as network or model compression, aims to reduce the size of these networks, thereby decreasing their computational costs. This is achieved by removing specific weights from the model, setting them to zero based on certain pruning criteria. DNN pruning methods can be categorized into different groups based on the nature of the approach (e.g., data-free versus data-driven, or based on the pruning criteria used). We refer the reader to Cheng et al. (2024) for a thorough discussion of these categories. From a high-level perspective, we can categorize pruning methods into structural pruning (e.g., Wang et al. (2020a); Huang and Wang (2018); Liu et al. (2017); Theus et al. (2024); Ganjdanesh et al. (2024); Shi et al. (2024); Gadhikar and Burkholz (2024); Wu et al. (2024); Guo et al. (2023); Fang et al. (2023); He et al. (2017)), where entire filters or channels are removed, and unstructured pruning (e.g., Tanaka et al. (2020); Mason-Williams and Dahlqvist (2024); Choi et al. (2023); Lee et al. (2018); Su et al. (2020); Wang et al. (2020b); Bai et al. (2022); Mocanu et al. (2018); Han et al. (2015); Sun et al. (2024)), which performs weight-wise pruning. In the latter case, the network is typically retrained after pruning and it is common for pruning to be performed iteratively, where a smaller set of weights is selected for removal (i.e., set to zero), followed by retraining or fine-tuning the pruned model. This process is repeated until the target final pruning ratio is reached.

While data can provide valuable insights into how each neuron (or node) contributes to the final result, the majority of unstructured pruning methods rely solely on neuron weights, focusing on defining criteria to measure the significance of individual weight values. For instance, the magnitude-based pruning metric Han et al. (2015) removes weights by eliminating those with magnitudes below a certain threshold.

Recent work, such as Wanda Sun et al. (2024), enhances the traditional weight magnitude pruning metric by incorporating input activations. Designed specifically for LLMs, Wanda is based on the observation that, at a certain scale, a small subset of hidden state features exhibits significantly larger magnitudes than others Dettmers et al. (2022). The pruning score in Wanda is computed as the product of the weight magnitude and the norm of the corresponding input activations, recognizing that input features can vary considerably in the scale of their output features. While Wanda demonstrates promising results, it does not fully capture the true contribution of each neuron weight to the output feature, given the input features.

In this paper, we introduce a data-driven, unstructured pruning method that utilizes training data—or a subset thereof—to approximate the distribution of each weight’s importance in the network based on its contribution to the output of its corresponding node. By applying the Central Limit Theorem, we model the aggregated importance of weights as a normal distribution, which enables us to estimate the mutual information between a weight and the output of its associated node. This mutual information quantifies how much knowing the weight reduces uncertainty about the node’s output. Consequently, the more sensitive the node’s output is to changes in a weight, the more important that weight is and the less likely it is to be pruned. The gradient of the activation function has a clear impact on the performance of the pruning method, as they affect the distribution of the node’s output. Our proposed method is firmly grounded in both statistical analysis and information theory, drawing connections to the Central Limit Theorem and mutual information. Preliminary experiments demonstrate that our method consistently yields more accurate models, even at high compression rates, compared to alternative approaches.

2 Method

2.1 Activation Blind Range

Refer to caption
Figure 1: The blind range of various activation functions, defined as the interval in which the gradient of the activation function is zero. In this range, the function’s output remains constant, providing a “safe zone” for pruning, where changes to the weights do not affect the model’s output. A wider blind range offers greater flexibility for pruning algorithms, allowing for more aggressive weight reduction without impacting performance. The blind range is highlighted by a yellow color in this figure.

The role of nonlinear activation functions has been widely studied in various aspects of neural network architectures, including their impact on training convergence, weight initialization, and stability. For example, activation functions play a crucial role in gradient propagation, influencing issues such as the vanishing and exploding gradient problems, which are critical for training deep networks. However, less focus has been given to the impact of activation functions on the susceptibility of neural networks to pruning. This study explores how the choice of activation function impacts the extent to which an architecture can be pruned without causing significant degradation in accuracy.

We introduce the concept of the “Blind Range” for a typical activation function, which refers to the interval where the derivative of the activation function is zero, see Fig. 1. In other words, the blind range represents the input range over which the activation function’s output remains constant. For instance, in the case of the ReLU activation, this range spans from negative infinity to zero.

We propose that the blind range of activation functions provides a safe zone for pruning, where if pruning a weight causes the activation output to fall within this range, the output of the corresponding node remains unchanged, and as a result, the overall model performance is preserved. Additionally, small deviations from this blind range can be efficiently corrected during the fine-tuning phase. However, the effect of pruning may vary across different data points. To address this variability, it is necessary to adopt a statistical approach. Specifically, we suggest empirically constructing a distribution to quantify the impact of each weight across different subsets of the dataset, enabling more informed and robust pruning decisions. This is explained in more details in the next sections.

2.2 Relative Weight Contributions

Refer to caption
Figure 2: A simplified illustrative example of the proposed pruning method applied for a single-node architecture. The left panel depicts a single node receiving three inputs, each connected by a corresponding weight. The center panel shows the calculation of the node output, ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, prior to any pruning. The right panel demonstrates the effect of pruning the second weight, w1,0=1.0subscript𝑤101.0w_{1,0}=1.0italic_w start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT = 1.0, and its subsequent impact on the final node output.

Figure 2 illustrates how a node contributes to the activation of its associated neuron within the neural network architecture. Specifically, we analyze the contribution of a weight wi,jsubscript𝑤𝑖𝑗w_{i,j}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in a layer that receives an input vector of size I𝐼Iitalic_I and adopts an activation function f𝑓fitalic_f. We define the contribution function ς()𝜍\varsigma(\cdot)italic_ς ( ⋅ ) of a weight wi,jsubscript𝑤𝑖𝑗w_{i,j}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT as:

aj=f(n=1Ixn×w(n,j)),aj¯=f(n=1Ixn×w(n,j))niς(wn,j)=|(ajaj¯)/aj|,formulae-sequencesubscript𝑎𝑗𝑓superscriptsubscript𝑛1𝐼subscript𝑥𝑛subscript𝑤𝑛𝑗¯subscript𝑎𝑗𝑓superscriptsubscript𝑛1𝐼subscript𝑥𝑛subscript𝑤𝑛𝑗𝑛𝑖𝜍subscript𝑤𝑛𝑗subscript𝑎𝑗¯subscript𝑎𝑗subscript𝑎𝑗\begin{split}a_{j}=f(\sum_{n=1}^{I}x_{n}\times w_{(n,j)}),\\ \bar{a_{j}}=f(\sum_{n=1}^{I}x_{n}\times w_{(n,j)})\;\;\;n\neq i\\ \varsigma(w_{n,j})=\left|\left(a_{j}-\bar{a_{j}}\right)/a_{j}\right|,\end{split}start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_f ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT ( italic_n , italic_j ) end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = italic_f ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_w start_POSTSUBSCRIPT ( italic_n , italic_j ) end_POSTSUBSCRIPT ) italic_n ≠ italic_i end_CELL end_ROW start_ROW start_CELL italic_ς ( italic_w start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ) = | ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) / italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | , end_CELL end_ROW (1)

where xnsubscript𝑥𝑛x_{n}italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT input of the layer, wn,jsubscript𝑤𝑛𝑗w_{n,j}italic_w start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT is the weight connecting the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-input to the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-node in a typical layer. The magnitude of this contribution determines the actual importance of the corresponding weight in the final node activations and, consequently, indicates the extent to which the weight can be pruned. Given that the contributions of each weight vary with different data points, and considering the large number of data points, the distribution of these contributions over the epochs approaches a Gaussian distribution according to the Central Limit Theorem Sirignano and Spiliopoulos (2020). Utilizing the first-order statistics of the contributions’ distribution, we define a weight function that assigns a scalar value to each weight, representing its importance, as shown in Equation 2.

𝕀(wi,j)=s×[α×𝔼(ς(wi,j))+β×1ϵ+σ(ς(wi,j))],𝕀subscript𝑤𝑖𝑗𝑠delimited-[]𝛼𝔼𝜍subscript𝑤𝑖𝑗𝛽1italic-ϵ𝜎𝜍subscript𝑤𝑖𝑗\mathbb{I}(w_{i,j})=s\times[\alpha\times\mathbb{E}(\varsigma(w_{i,j}))+\beta% \times\frac{1}{\epsilon+\sigma(\varsigma(w_{i,j}))}],blackboard_I ( italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = italic_s × [ italic_α × blackboard_E ( italic_ς ( italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) + italic_β × divide start_ARG 1 end_ARG start_ARG italic_ϵ + italic_σ ( italic_ς ( italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) end_ARG ] , (2)

where s=2i𝑠superscript2𝑖s=2^{i}italic_s = 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a decaying factor that controls the contribution of each layer, α𝛼\alphaitalic_α, β𝛽\betaitalic_β, weight parameters to control the importance of the mean and the standard deviation, respectively. The term ϵitalic-ϵ\epsilonitalic_ϵ is a small number to avoid division by zero. After calculating the importance value of each weight based on its contribution, the pruning becomes a straight forward process by applying iterative weight pruning given by Algorithms. 1.

Refer to caption
Figure 3: We propose a pruning metric based on the relative weight contribution of each neuron. The contribution function is computed by measuring the relative contribution of each neuron in every network layer. The illustration shows the i𝑖iitalic_i-th column in the fully connected weight matrix. We feed training samples into this layer to compute the output features (in blue). Then, we mask the i𝑖iitalic_i-th column (set to zero) and compute the output without its influence. The relative contribution of the i𝑖iitalic_i-th weights (in red) is then computed for each training example, representing the distribution of neuron contributions of this column.
Algorithm 1 Iterative Weight Pruning Algorithm
0:  Trained model Model consisting of L𝐿Litalic_L layers; Dataset 𝒟𝒟\mathcal{D}caligraphic_D; Target pruning percentage Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; Pruning per iteration Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
0:  Pruned model ModelpsubscriptModel𝑝\text{Model}_{p}Model start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
1:  ModelpModelsubscriptModel𝑝Model\text{Model}_{p}\leftarrow\text{Model}Model start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ← Model
2:  while Pt>0subscript𝑃𝑡0P_{t}>0italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 do
3:     if PtPisubscript𝑃𝑡subscript𝑃𝑖P_{t}\leq P_{i}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT then
4:        PcurrentPtsubscript𝑃currentsubscript𝑃𝑡P_{\text{current}}\leftarrow P_{t}italic_P start_POSTSUBSCRIPT current end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
5:        Pt0subscript𝑃𝑡0P_{t}\leftarrow 0italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← 0
6:     else
7:        PcurrentPisubscript𝑃currentsubscript𝑃𝑖P_{\text{current}}\leftarrow P_{i}italic_P start_POSTSUBSCRIPT current end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
8:        PtPtPisubscript𝑃𝑡subscript𝑃𝑡subscript𝑃𝑖P_{t}\leftarrow P_{t}-P_{i}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
9:     end if
10:     for all weights w𝑤witalic_w in ModelpsubscriptModel𝑝\text{Model}_{p}Model start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT do
11:        Compute 𝕀(w)𝕀𝑤\mathbb{I}(w)blackboard_I ( italic_w )
12:     end for
13:     threshold = F𝕀1(Pcurrent)subscriptsuperscript𝐹1𝕀subscript𝑃currentF^{-1}_{\mathbb{I}}(P_{\text{current}})italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_I end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT current end_POSTSUBSCRIPT )   //The Pcurrentsubscript𝑃currentP_{\text{current}}italic_P start_POSTSUBSCRIPT current end_POSTSUBSCRIPT-th percentile of the weight importance distribution
14:     for all weights w𝑤witalic_w in ModelpsubscriptModel𝑝\text{Model}_{p}Model start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT do
15:        if 𝕀(w)<threshold𝕀𝑤threshold\mathbb{I}(w)<\text{threshold}blackboard_I ( italic_w ) < threshold then
16:           Set w0𝑤0w\leftarrow 0italic_w ← 0
17:        end if
18:     end for
19:     Retrain ModelpsubscriptModel𝑝\text{Model}_{p}Model start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT on dataset 𝒟𝒟\mathcal{D}caligraphic_D
20:  end while
21:  return  ModelpsubscriptModel𝑝\text{Model}_{p}Model start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT

2.3 Pruning-aware Training

In this section, we introduce a novel regularization term for training that enhances the effectiveness of our pruning algorithm. As previously discussed, the probability of pruning a weight increases as its mutual information with the node’s output decreases. In other words, if a node’s output frequently falls within the blind range, the weights associated with that node are more likely to be pruned. Therefore, increasing the probability of a neuron’s output being in the blind range improves the pruning process.

To achieve this, we propose a loss function that incorporates an additional term penalizing the model based on the magnitude of the nodes’ outputs. The modified loss function is defined as:

n=o+λrL1i|ai|subscript𝑛subscriptosubscript𝜆𝑟𝐿1subscript𝑖subscript𝑎𝑖\mathcal{L}_{n}=\mathcal{L}_{\text{o}}+\lambda_{rL1}\sum_{i}\left|a_{i}\right|caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT o end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_r italic_L 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (3)

where nsubscriptn\mathcal{L}_{\text{n}}caligraphic_L start_POSTSUBSCRIPT n end_POSTSUBSCRIPT is the new loss function used for training the model, osubscripto\mathcal{L}_{\text{o}}caligraphic_L start_POSTSUBSCRIPT o end_POSTSUBSCRIPT is the original loss function, λrL1subscript𝜆𝑟𝐿1\lambda_{rL1}italic_λ start_POSTSUBSCRIPT italic_r italic_L 1 end_POSTSUBSCRIPT is a regularization hyperparameter that controls the contribution of the regularization term to the final loss, and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the output of the i𝑖iitalic_i-th neuron. Note that the regularization component, which is the L1𝐿1L1italic_L 1 norm of the nodes output, encourages the dispersion of neuron outputs. As demonstrated in the results section, incorporating this regularization term significantly enhances the performance of our proposed pruning algorithm. The following section presents the theoretical foundations of the proposed pruning technique in era of Information Theory.

Speedup: To enhance the efficiency of the proposed algorithm, we replaced the weight-by-weight computation of weight contributions with an optimized matrix multiplication algorithm that computes the contributions of entire columns of weights simultaneously. This modification resulted in a significant improvement in computational speed compared to the one-by-one method (see Fig.3).

3 Mutual Information

In this section, we analyze the mutual information between the activation output of the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-node in the lthsuperscript𝑙𝑡l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-layer, ailsubscriptsuperscript𝑎𝑙𝑖a^{l}_{i}italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the weight wijlsubscriptsuperscript𝑤𝑙𝑖𝑗w^{l}_{ij}italic_w start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT connecting neuron i𝑖iitalic_i to the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT input Czyż et al. (2024). As known, the activation output is given by:

ail=ϕ(zi),wherezi=jwijxj,formulae-sequencesubscriptsuperscript𝑎𝑙𝑖italic-ϕsubscript𝑧𝑖wheresubscript𝑧𝑖subscript𝑗subscript𝑤𝑖𝑗subscript𝑥𝑗a^{l}_{i}=\phi(z_{i}),\text{where}\;z_{i}=\sum_{j}w_{ij}x_{j},italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (4)

with ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) representing the activation function and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT-input from the preceding layer. The mutual information I(ai;wij)𝐼subscript𝑎𝑖subscript𝑤𝑖𝑗I(a_{i};w_{ij})italic_I ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) quantifies the reduction in uncertainty of aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT due to knowledge of wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, calculated as:

I(ai;wij)=H(ai)H(aiwij).𝐼subscript𝑎𝑖subscript𝑤𝑖𝑗𝐻subscript𝑎𝑖𝐻conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗I(a_{i};w_{ij})=H(a_{i})-H(a_{i}\mid w_{ij}).italic_I ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_H ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_H ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) . (5)

Here, H(ai)𝐻subscript𝑎𝑖H(a_{i})italic_H ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the entropy of the activation output aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, defined as H(ai)=aiP(ai)logP(ai)𝐻subscript𝑎𝑖subscriptsubscript𝑎𝑖𝑃subscript𝑎𝑖𝑃subscript𝑎𝑖H(a_{i})=-\sum_{a_{i}}P(a_{i})\log P(a_{i})italic_H ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and H(aiwij)𝐻conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗H(a_{i}\mid w_{ij})italic_H ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is the conditional entropy given by:

H(aiwij)=wijP(wij)aiP(aiwij)logP(aiwij)dwij.𝐻conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗subscriptsubscript𝑤𝑖𝑗𝑃subscript𝑤𝑖𝑗subscriptsubscript𝑎𝑖𝑃conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗𝑃conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗𝑑subscript𝑤𝑖𝑗H(a_{i}\mid w_{ij})=-\int_{w_{ij}}P(w_{ij})\sum_{a_{i}}P(a_{i}\mid w_{ij})\log P% (a_{i}\mid w_{ij})\,dw_{ij}.italic_H ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = - ∫ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) roman_log italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) italic_d italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT . (6)

A high mutual information indicates a significant dependency between aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, implying that pruning wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT would substantially alter aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and potentially degrade network performance. Conversely, a low mutual information suggests that wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT has little influence on aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, making it a suitable candidate for pruning without affecting the model’s accuracy. Incorporating the concept of the Blind Range—where the activation function’s derivative ϕ(zi)=0superscriptitalic-ϕsubscript𝑧𝑖0\phi^{\prime}(z_{i})=0italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 0 and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT remains constant—we observe that mutual information I(ai;wij)𝐼subscript𝑎𝑖subscript𝑤𝑖𝑗I(a_{i};w_{ij})italic_I ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is minimal when aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT operates predominantly within this range. This reinforces our strategy to prune weights with large likelihood of being associated with activations in the Blind Range. To compute I(ai;wij)𝐼subscript𝑎𝑖subscript𝑤𝑖𝑗I(a_{i};w_{ij})italic_I ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) in practice, we estimate the probability distributions P(ai)𝑃subscript𝑎𝑖P(a_{i})italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and P(aiwij)𝑃conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗P(a_{i}\mid w_{ij})italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) by collecting samples of activation outputs and corresponding weights across the dataset. For continuous variables, we compute differential entropy:

h(ai)=P(ai)logP(ai)𝑑ai,subscript𝑎𝑖𝑃subscript𝑎𝑖𝑃subscript𝑎𝑖differential-dsubscript𝑎𝑖h(a_{i})=-\int P(a_{i})\log P(a_{i})\,da_{i},italic_h ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - ∫ italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (7)
h(aiwij)=P(wij)P(aiwij)logP(aiwij)𝑑ai𝑑wij,conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗𝑃subscript𝑤𝑖𝑗𝑃conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗𝑃conditionalsubscript𝑎𝑖subscript𝑤𝑖𝑗differential-dsubscript𝑎𝑖differential-dsubscript𝑤𝑖𝑗h(a_{i}\mid w_{ij})=-\int P(w_{ij})\int P(a_{i}\mid w_{ij})\log P(a_{i}\mid w_% {ij})\,da_{i}\,dw_{ij},italic_h ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = - ∫ italic_P ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ∫ italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) roman_log italic_P ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (8)

By calculating these entropy values and the resulting mutual information, we can rank the weights based on their I(ai;wij)𝐼subscript𝑎𝑖subscript𝑤𝑖𝑗I(a_{i};w_{ij})italic_I ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) values. We then define a threshold τ𝜏\tauitalic_τ below which weights are considered for pruning. This method allows us to systematically identify and remove redundant weights, enhancing model efficiency while maintaining performance. This analysis of mutual information provides a robust, mathematically grounded framework for neural network pruning based on the concept of the activation’s Blind Range and contributing to the development of more efficient and compact models.

4 Experimental Results

In this section, we validate our method on the MNIST dataset LeCun (1998) for the image classification task. We fixed the network architecture to a simple fully connected layer network consisting of 3 layers. The first layer has 784 ×\times× 392 weight neurons, accepting flattened images of size 28 ×\times× 28, and outputs 392 features, followed by a ReLU activation function. The second layer processes the latent representation and produces 196 output features, again followed by a ReLU activation function. The final layer outputs a 10-dimensional vector representing the logits for the MNIST classes. We utilized cross-entropy loss with a learning rate of 0.001 for 10 epochs and optimized the network using the Adam optimizer with betas = (0.9, 0.999).

To perform pruning, we adopted an iterative pruning strategy. After training the original model, we iteratively prune the network as follows: First, we compute a score for each weight neuron based on the pruning metric, either using our method’s contribution metric or the criteria of other methods for comparison. Next, we determine a threshold value from these scores, based on a fixed pruning ratio per iteration, and create a mask that zeros out neurons with scores below the threshold. This process is repeated iteratively, pruning neurons at each step until the final target pruning ratio is reached. After each pruning iteration, we fine-tune the model by applying the current pruning mask and training for one epoch using a learning rate of 0.0001. The mask is updated in each iteration to include the newly pruned neurons, representing all pruned neurons by the end of the process.

Note that since our method relies on input data to compute the contribution of each neuron, we investigate the impact of using different subsets of the training dataset for this purpose in the pruning process. The final mean and standard deviation of the computed contributions are calculated across the training examples used in our pruning procedure to derive the final contribution score, as described in Equation 2. We apply the same approach for examining different subsets of training data for Wanda Sun et al. (2024), as it also depends on input data to compute the pruning metric for weights.

Table 1 shows the results of using different values of β𝛽\betaitalic_β in Equation 2 and the impact of the layer decay factor, s𝑠sitalic_s. The best results were achieved with β=1e7𝛽1𝑒7\beta=1e-7italic_β = 1 italic_e - 7 and incorporating the layer decay factor, s𝑠sitalic_s, which intuitively makes sense, as the importance of neurons should be scaled based on the layer depth in the network.

We consider the following pruning metrics for comparisons: (1) random pruning, where neurons are randomly selected for pruning based on the target ratio; (2) magnitude-based pruning Han et al. (2015), where neurons with the smallest weight magnitudes are pruned; and (3) Wanda Sun et al. (2024), which extends magnitude-based pruning by additionally considering the input feature norm statistics. The results of these comparisons are shown in Table 2, where we report performance with different pruning ratios per iteration (and thus different numbers of pruning iterations), as well as using varying portions of the training dataset for data-driven methods (i.e., Wanda Sun et al. (2024) and ours). As shown, our method achieves the best results, even when using only 2% of the training data to compute the contribution score. Our method consistently outperforms other methods across all settings.

Table 1: Ablation studies on the impact of β𝛽\betaitalic_β and the layer decay factor, s𝑠sitalic_s, on our results are presented. The experiments were conducted using the ReLU activation function with a simple fully connected network, which achieved 97.02% accuracy without pruning. In these experiments, the target pruning ratio was set to 50% (i.e., eliminating 50% of the network weights). All reported results are presented as percentages. The best results are highlighted in yellow.
Pruning ratio per iteration
Configuration 25% 15% 10% 5%
α𝛼\alphaitalic_α = 1, β𝛽\betaitalic_β = 1e-7, w/o s𝑠sitalic_s 97.61 97.45 97.91 97.93
α𝛼\alphaitalic_α = 1, β𝛽\betaitalic_β = 0, w/ s𝑠sitalic_s 97.62 97.49 97.90 98.0
α𝛼\alphaitalic_α = 1, β𝛽\betaitalic_β = 1e-5, w/ s𝑠sitalic_s 96.14 96.63 97.28 97.59
α𝛼\alphaitalic_α = 1, β𝛽\betaitalic_β = 1e-7, w/ s𝑠sitalic_s 97.84 97.52 97.92 98.03
Table 2: Comparisons with other methods. The original model achieved an accuracy of 97.02% without pruning, and the target pruning ratio was set to 50% (i.e., eliminating 50% of the network weights). We compare our method with random pruning, magnitude pruning Han et al. (2015), and Wanda Sun et al. (2024). For the data-driven methods (Wanda and ours), we evaluated performance with varying percentages of training data, noted alongside each method’s name. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.
Pruning ratio per iteration
Method 25% 15% 10% 5%
Random 93.19 91.24 79.22 9.80
Magnitude Han et al. (2015) 97.29 97.77 97.99 98.11
Wanda (0.5%) Sun et al. (2024) 97.47 97.72 98.16 98.3
Wanda (2%) Sun et al. (2024) 97.44 97.82 98.18 98.32
Wanda (10%) Sun et al. (2024) 97.4 97.87 98.21 98.33
Wanda (20%) Sun et al. (2024) 97.38 97.86 98.2 98.28
Wanda (50%) Sun et al. (2024) 97.41 97.91 97.92 98.26
Wanda (100%) Sun et al. (2024) 97.42 97.96 97.98 98.31
Ours (0.5%) 97.84 97.52 97.92 98.03
Ours (2%) 98.04 98.11 98.37 98.33
Ours (10%) 98.26 98.36 98.41 98.34
Ours (20%) 98.24 98.35 98.47 98.41
Ours (50%) 98.36 98.34 98.38 98.40
Ours (100%) 98.29 98.36 98.30 98.44

Table 3 shows the results of experiments using activation functions other than ReLU, which was used in previous experiments. To evaluate the generalizability of our approach, we report results using Leaky ReLU, Sigmoid, and Tanh activation functions. As shown, our method continues to outperform other pruning techniques across all tested activation functions.

Table 3: Results using different activation functions with accuracy without pruning indicated alongside each activation function. In these experiments, the final pruning ratio was set to 50% (i.e., eliminating 50% of the network weights). We compare our method with random pruning, magnitude pruning Han et al. (2015), and Wanda Sun et al. (2024). For data-driven methods (Wanda and ours), we evaluated the performance with varying percentages of training data, indicated alongside each method’s name. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.
Pruning ratio per iteration
Leaky ReLU (97.35%) Sigmoid (97.64%) Tanh (96.12%)
Method 25% 15% 10% 5% 25% 15% 10% 5% 25% 15% 10% 5%
Random 93.19 92.69 88.09 9.80 93.94 83.55 84.70 9.80 93.36 92.21 39.73 9.80
Magnitude Han et al. (2015) 97.15 97.58 97.89 97.87 89.45 92.28 94.26 96.09 92.74 94.14 95.66 97.20
Wanda (0.5%) Sun et al. (2024) 96.77 97.21 97.78 97.88 91.19 94.66 96.22 96.91 93.11 94.9 96.47 97.35
Wanda (20%) Sun et al. (2024) 96.68 97.15 97.72 97.86 91.37 94.44 95.97 96.89 93.3 94.68 96.39 97.20
Wanda (100%) Sun et al. (2024) 96.63 97.20 97.79 97.89 91.40 94.42 96.18 96.83 93.21 94.67 96.33 97.26
Ours (0.5%) 97.03 97.5 97.81 98.01 94.64 95.44 96.09 96.62 96.17 96.36 96.86 97.1
Ours (20%) 97.97 98.18 98.12 98.22 95.79 96.47 97.14 97.56 97.09 97.35 97.51 97.76
Ours (100%) 98.01 98.12 98.21 98.18 96.08 96.66 97.15 97.70 97.08 97.50 97.54 97.82

So far, we have fixed the final target pruning ratio at 50% of the original neurons. In Table 4,we show the results for various final pruning ratios. In this experiment, the pruning ratio per iteration was set to 5%. The results indicate that even with an aggressive pruning ratio of 75% (reducing the original network size), our method achieves an accuracy of 97.63%, which is better than with the original model accuracy of 97.02%. In contrast, other methods experience higher reductions in accuracy, with magnitude-based pruning Han et al. (2015) yielding 92.52% and Wanda Sun et al. (2024) achieving 90.32%.

Table 4: Results using different target pruning ratio (10%, 50%, 75%). Here, we used a fixed pruning ratio per iteration (5%) and compare our method with magnitude pruning Han et al. (2015) and Wanda Sun et al. (2024). For data-driven methods (Wanda and ours), we evaluated the performance by using the full training data for pruning. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.
Target pruning ratio
Method 10% 50% 75%
Magnitude Han et al. (2015) 98.58 98.11 92.52
Wanda Sun et al. (2024) 98.60 98.31 90.32
Ours 98.66 98.44 97.63
Table 5: Results with λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT using different activation functions. For each activation function, the accuracy of original model (without pruning) with and without λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are as follows: ReLU (w/o λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: 97.02%, w/ λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: 97.4%), Tanh (w/o λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: 96.12% , w/ λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: 95.78 %), and Sigmoid (w/o λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: 97.64% , w/ λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT: ). The target pruning ratios were set to 50% and 75%, with a 25% pruning ratio per iteration. We compare our method with random pruning, magnitude pruning Han et al. (2015), and Wanda Sun et al. (2024). For data-driven methods (Wanda and ours), we evaluated the performance with varying percentages of training data, indicated alongside each method’s name. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.
Target pruning ratio (50%) Target pruning ratio (75%)
Method ReLU Tanh Sigmoid ReLU Tanh Sigmoid
Random 93.34 92.92 81.94 84.71 83.97 68.75
Magnitude Han et al. (2015) 96.43 94.87 93.54 68.1 74.05 60.23
Wanda (0.5%) Sun et al. (2024) 97.16 95.34 96.02 90.83 91.07 85.46
Wanda (100%) Sun et al. (2024) 97.21 95.48 96.07 90.74 91.07 84.84
Ours (0.5%) 97.80 97.10 97.39 96.52 95.02 92.52
Ours (100%) 97.87 96.98 97.51 98.15 94.79 92.62

Table 5 demonstrates the results of applying the proposed loss function in Equation 3 with the indicated regularization term. As discussed in Section 2, the hyperparameter λrL1subscript𝜆subscript𝑟𝐿1\lambda_{r_{L1}}italic_λ start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT controls the contribution of the regularization to the overall loss function. By utilizing this regularization term, we achieved a 75% reduction in model size without significant loss in accuracy, as evidenced by the data presented in Table 5.

5 Conclusion

In this paper, we have introduced an interpretable, simple, yet effective pruning method. By defining the concept of the activation blind range, we investigated the underexplored aspect of how activation functions affect an architecture’s susceptibility to pruning. We presented a statistical framework for the proposed pruning method, grounded in the Central Limit Theorem and Mutual Information concepts. Our findings conclude that, for unstructured pruning, considering the mutual information between each weight and its associated node leads to a simple and powerful pruning strategy. Moreover, considerable improvements have been obtained by leveraging what we named as ’Pruning-aware Training’ that incorporates an extra term that encourage the model to push the nodes output toward the blind range of the activations. The experimental results confirm the effectiveness of the proposed method across various experimental settings.

References

  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Gemini Team Google [2023] Gemini Team Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Lewis [2019] M Lewis. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  • Abdelhamed et al. [2024] Abdelrahman Abdelhamed, Mahmoud Afifi, and Alec Go. What do you see? Enhancing zero-shot image classification with multimodal large language models. arXiv preprint arXiv:2405.15668, 2024.
  • Cheng et al. [2024] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Wang et al. [2020a] Yulong Wang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from scratch. In AAAI, 2020a.
  • Huang and Wang [2018] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In ECCV, 2018.
  • Liu et al. [2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  • Theus et al. [2024] Alexander Theus, Olin Geimer, Friedrich Wicke, Thomas Hofmann, Sotiris Anagnostidis, and Sidak Pal Singh. Towards meta-pruning via optimal transport. arXiv preprint arXiv:2402.07839, 2024.
  • Ganjdanesh et al. [2024] Alireza Ganjdanesh, Shangqian Gao, and Heng Huang. Jointly training and pruning cnns via learnable agent guidance and alignment. In CVPR, pages 16058–16069, 2024.
  • Shi et al. [2024] Xinyu Shi, Jianhao Ding, Zecheng Hao, and Zhaofei Yu. Towards energy efficient spiking neural networks: An unstructured pruning framework. In ICLR, 2024.
  • Gadhikar and Burkholz [2024] Advait Gadhikar and Rebekka Burkholz. Masks, signs, and learning rate rewinding. arXiv preprint arXiv:2402.19262, 2024.
  • Wu et al. [2024] Xidong Wu, Shangqian Gao, Zeyu Zhang, Zhenzhen Li, Runxue Bao, Yanfu Zhang, Xiaoqian Wang, and Heng Huang. Auto-train-once: Controller network guided automatic network pruning from scratch. In CVPR, 2024.
  • Guo et al. [2023] Song Guo, Lei Zhang, Xiawu Zheng, Yan Wang, Yuchao Li, Fei Chao, Chenglin Wu, Shengchuan Zhang, and Rongrong Ji. Automatic network pruning via hilbert-schmidt independence criterion lasso under information bottleneck principle. In ICCV, 2023.
  • Fang et al. [2023] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. In CVPR, 2023.
  • He et al. [2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
  • Tanaka et al. [2020] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In NeurIPS, 2020.
  • Mason-Williams and Dahlqvist [2024] Gabryel Mason-Williams and Fredrik Dahlqvist. What makes a good prune? maximal unstructured pruning for maximal cosine similarity. In The Twelfth International Conference on Learning Representations, 2024.
  • Choi et al. [2023] Moonseok Choi, Hyungi Lee, Giung Nam, and Juho Lee. Sparse weight averaging with multiple particles for iterative magnitude pruning. arXiv preprint arXiv:2305.14852, 2023.
  • Lee et al. [2018] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
  • Su et al. [2020] Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D Lee. Sanity-checking pruning methods: Random tickets can win the jackpot. In NeurIPS, 2020.
  • Wang et al. [2020b] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020b.
  • Bai et al. [2022] Yue Bai, Huan Wang, Zhiqiang Tao, Kunpeng Li, and Yun Fu. Dual lottery ticket hypothesis. arXiv preprint arXiv:2203.04248, 2022.
  • Mocanu et al. [2018] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018.
  • Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
  • Sun et al. [2024] Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In ICLR, 2024.
  • Dettmers et al. [2022] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022.
  • Sirignano and Spiliopoulos [2020] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
  • Czyż et al. [2024] Paweł Czyż, Frederic Grabowski, Julia Vogt, Niko Beerenwinkel, and Alexander Marx. Beyond normal: On the evaluation of mutual information estimators. Advances in Neural Information Processing Systems, 36, 2024.
  • LeCun [1998] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.