Small Contributions, Small Networks: Efficient Neural Network Pruning Based on Relative Importance

Mostafa Hussien
ÉTS, University of Quebec, Canada
mostafa.hussien@etsmtl.ca
Mahmoud Afifi
Google
Kim Khoa Nguyen
ÉTS, University of Quebec, Canada
Mohamed Cheriet
ÉTS, University of Quebec, Canada

Abstract

Recent advancements have scaled neural networks to unprecedented sizes, achieving remarkable performance across a wide range of tasks. However, deploying these large-scale models on resource-constrained devices poses significant challenges due to substantial storage and computational requirements. Neural network pruning has emerged as an effective technique to mitigate these limitations by reducing model size and complexity. In this paper, we introduce an intuitive and interpretable pruning method based on activation statistics, rooted in information theory and statistical analysis. Our approach leverages the statistical properties of neuron activations to identify and remove weights with minimal contributions to neuron outputs. Specifically, we build a distribution of weight contributions across the dataset and utilize its parameters to guide the pruning process. Furthermore, we propose a Pruning-aware Training strategy that incorporates an additional regularization term to enhance the effectiveness of our pruning method. Extensive experiments on multiple datasets and network architectures demonstrate that our method consistently outperforms several baseline and state-of-the-art pruning techniques.

1 Introduction

Deep learning has achieved remarkable results across various fields, from computer vision to natural language processing, by generating highly effective models like large language models (LLMs) (e.g., Brown et al. (2020); Touvron et al. (2023); Gemini Team Google (2023)), which have demonstrated significant improvements in multiple applications. These models have shown significant improvements in a wide range of applications, including machine translation (Lewis (2019)), question answering (Raffel et al. (2020)), and image classification (Abdelhamed et al. (2024)). However, as deep neural networks (DNNs) grow in size to handle increasingly complex problems, they require immense computational resources, both in terms of memory and processing power.

Network pruning, also referred to as network or model compression, aims to reduce the size of these networks, thereby decreasing their computational costs. This is achieved by removing specific weights from the model, setting them to zero based on certain pruning criteria. DNN pruning methods can be categorized into different groups based on the nature of the approach (e.g., data-free versus data-driven, or based on the pruning criteria used). We refer the reader to Cheng et al. (2024) for a thorough discussion of these categories. From a high-level perspective, we can categorize pruning methods into structural pruning (e.g., Wang et al. (2020a); Huang and Wang (2018); Liu et al. (2017); Theus et al. (2024); Ganjdanesh et al. (2024); Shi et al. (2024); Gadhikar and Burkholz (2024); Wu et al. (2024); Guo et al. (2023); Fang et al. (2023); He et al. (2017)), where entire filters or channels are removed, and unstructured pruning (e.g., Tanaka et al. (2020); Mason-Williams and Dahlqvist (2024); Choi et al. (2023); Lee et al. (2018); Su et al. (2020); Wang et al. (2020b); Bai et al. (2022); Mocanu et al. (2018); Han et al. (2015); Sun et al. (2024)), which performs weight-wise pruning. In the latter case, the network is typically retrained after pruning and it is common for pruning to be performed iteratively, where a smaller set of weights is selected for removal (i.e., set to zero), followed by retraining or fine-tuning the pruned model. This process is repeated until the target final pruning ratio is reached.

While data can provide valuable insights into how each neuron (or node) contributes to the final result, the majority of unstructured pruning methods rely solely on neuron weights, focusing on defining criteria to measure the significance of individual weight values. For instance, the magnitude-based pruning metric Han et al. (2015) removes weights by eliminating those with magnitudes below a certain threshold.

Recent work, such as Wanda Sun et al. (2024), enhances the traditional weight magnitude pruning metric by incorporating input activations. Designed specifically for LLMs, Wanda is based on the observation that, at a certain scale, a small subset of hidden state features exhibits significantly larger magnitudes than others Dettmers et al. (2022). The pruning score in Wanda is computed as the product of the weight magnitude and the norm of the corresponding input activations, recognizing that input features can vary considerably in the scale of their output features. While Wanda demonstrates promising results, it does not fully capture the true contribution of each neuron weight to the output feature, given the input features.

In this paper, we introduce a data-driven, unstructured pruning method that utilizes training data—or a subset thereof—to approximate the distribution of each weight’s importance in the network based on its contribution to the output of its corresponding node. By applying the Central Limit Theorem, we model the aggregated importance of weights as a normal distribution, which enables us to estimate the mutual information between a weight and the output of its associated node. This mutual information quantifies how much knowing the weight reduces uncertainty about the node’s output. Consequently, the more sensitive the node’s output is to changes in a weight, the more important that weight is and the less likely it is to be pruned. The gradient of the activation function has a clear impact on the performance of the pruning method, as they affect the distribution of the node’s output. Our proposed method is firmly grounded in both statistical analysis and information theory, drawing connections to the Central Limit Theorem and mutual information. Preliminary experiments demonstrate that our method consistently yields more accurate models, even at high compression rates, compared to alternative approaches.

2 Method

2.1 Activation Blind Range

Refer to caption — Figure 1: The blind range of various activation functions, defined as the interval in which the gradient of the activation function is zero. In this range, the function’s output remains constant, providing a “safe zone” for pruning, where changes to the weights do not affect the model’s output. A wider blind range offers greater flexibility for pruning algorithms, allowing for more aggressive weight reduction without impacting performance. The blind range is highlighted by a yellow color in this figure.

The role of nonlinear activation functions has been widely studied in various aspects of neural network architectures, including their impact on training convergence, weight initialization, and stability. For example, activation functions play a crucial role in gradient propagation, influencing issues such as the vanishing and exploding gradient problems, which are critical for training deep networks. However, less focus has been given to the impact of activation functions on the susceptibility of neural networks to pruning. This study explores how the choice of activation function impacts the extent to which an architecture can be pruned without causing significant degradation in accuracy.

We introduce the concept of the “Blind Range” for a typical activation function, which refers to the interval where the derivative of the activation function is zero, see Fig. 1. In other words, the blind range represents the input range over which the activation function’s output remains constant. For instance, in the case of the ReLU activation, this range spans from negative infinity to zero.

We propose that the blind range of activation functions provides a safe zone for pruning, where if pruning a weight causes the activation output to fall within this range, the output of the corresponding node remains unchanged, and as a result, the overall model performance is preserved. Additionally, small deviations from this blind range can be efficiently corrected during the fine-tuning phase. However, the effect of pruning may vary across different data points. To address this variability, it is necessary to adopt a statistical approach. Specifically, we suggest empirically constructing a distribution to quantify the impact of each weight across different subsets of the dataset, enabling more informed and robust pruning decisions. This is explained in more details in the next sections.

2.2 Relative Weight Contributions

Figure 2 illustrates how a node contributes to the activation of its associated neuron within the neural network architecture. Specifically, we analyze the contribution of a weight $w_{i,j}$ in a layer that receives an input vector of size $I$ and adopts an activation function $f$ . We define the contribution function $\varsigma(\cdot)$ of a weight $w_{i,j}$ as:

\begin{split}a_{j}=f(\sum_{n=1}^{I}x_{n}\times w_{(n,j)}),\\ \bar{a_{j}}=f(\sum_{n=1}^{I}x_{n}\times w_{(n,j)})\;\;\;n\neq i\\ \varsigma(w_{n,j})=\left|\left(a_{j}-\bar{a_{j}}\right)/a_{j}\right|,\end{split}

(1)

where $x_{n}$ is the $n^{th}$ input of the layer, $w_{n,j}$ is the weight connecting the $n^{th}$ -input to the $j^{th}$ -node in a typical layer. The magnitude of this contribution determines the actual importance of the corresponding weight in the final node activations and, consequently, indicates the extent to which the weight can be pruned. Given that the contributions of each weight vary with different data points, and considering the large number of data points, the distribution of these contributions over the epochs approaches a Gaussian distribution according to the Central Limit Theorem Sirignano and Spiliopoulos (2020). Utilizing the first-order statistics of the contributions’ distribution, we define a weight function that assigns a scalar value to each weight, representing its importance, as shown in Equation 2.

\mathbb{I}(w_{i,j})=s\times[\alpha\times\mathbb{E}(\varsigma(w_{i,j}))+\beta% \times\frac{1}{\epsilon+\sigma(\varsigma(w_{i,j}))}],

(2)

where $s=2^{i}$ is a decaying factor that controls the contribution of each layer, $\alpha$ , $\beta$ , weight parameters to control the importance of the mean and the standard deviation, respectively. The term $\epsilon$ is a small number to avoid division by zero. After calculating the importance value of each weight based on its contribution, the pruning becomes a straight forward process by applying iterative weight pruning given by Algorithms. 1.

Algorithm 1 Iterative Weight Pruning Algorithm

0: Trained model Model consisting of

L

layers; Dataset

\mathcal{D}

; Target pruning percentage

P_{t}

; Pruning per iteration

P_{i}

0: Pruned model

\text{Model}_{p}

\text{Model}_{p}\leftarrow\text{Model}

2: while

P_{t}>0

3: if

P_{t}\leq P_{i}

then

P_{\text{current}}\leftarrow P_{t}

P_{t}\leftarrow 0

6: else

P_{\text{current}}\leftarrow P_{i}

P_{t}\leftarrow P_{t}-P_{i}

9: end if

10: for all weights

w

\text{Model}_{p}

11: Compute

\mathbb{I}(w)

12: end for

13: threshold =

F^{-1}_{\mathbb{I}}(P_{\text{current}})

//The

P_{\text{current}}

-th percentile of the weight importance distribution

14: for all weights

w

\text{Model}_{p}

15: if

\mathbb{I}(w)<\text{threshold}

then

16: Set

w\leftarrow 0

17: end if

18: end for

19: Retrain

\text{Model}_{p}

on dataset

\mathcal{D}

20: end while

21: return

\text{Model}_{p}

2.3 Pruning-aware Training

In this section, we introduce a novel regularization term for training that enhances the effectiveness of our pruning algorithm. As previously discussed, the probability of pruning a weight increases as its mutual information with the node’s output decreases. In other words, if a node’s output frequently falls within the blind range, the weights associated with that node are more likely to be pruned. Therefore, increasing the probability of a neuron’s output being in the blind range improves the pruning process.

To achieve this, we propose a loss function that incorporates an additional term penalizing the model based on the magnitude of the nodes’ outputs. The modified loss function is defined as:

\mathcal{L}_{n}=\mathcal{L}_{\text{o}}+\lambda_{rL1}\sum_{i}\left|a_{i}\right|

(3)

where $\mathcal{L}_{\text{n}}$ is the new loss function used for training the model, $\mathcal{L}_{\text{o}}$ is the original loss function, $\lambda_{rL1}$ is a regularization hyperparameter that controls the contribution of the regularization term to the final loss, and $a_{i}$ is the output of the $i$ -th neuron. Note that the regularization component, which is the $L1$ norm of the nodes output, encourages the dispersion of neuron outputs. As demonstrated in the results section, incorporating this regularization term significantly enhances the performance of our proposed pruning algorithm. The following section presents the theoretical foundations of the proposed pruning technique in era of Information Theory.

Speedup: To enhance the efficiency of the proposed algorithm, we replaced the weight-by-weight computation of weight contributions with an optimized matrix multiplication algorithm that computes the contributions of entire columns of weights simultaneously. This modification resulted in a significant improvement in computational speed compared to the one-by-one method (see Fig.3).

3 Mutual Information

In this section, we analyze the mutual information between the activation output of the $i^{th}$ -node in the $l^{th}$ -layer, $a^{l}_{i}$ and the weight $w^{l}_{ij}$ connecting neuron $i$ to the $j^{th}$ input Czyż et al. (2024). As known, the activation output is given by:

a^{l}_{i}=\phi(z_{i}),\text{where}\;z_{i}=\sum_{j}w_{ij}x_{j},

(4)

with $\phi(\cdot)$ representing the activation function and $x_{j}$ is the $j^{th}$ -input from the preceding layer. The mutual information $I(a_{i};w_{ij})$ quantifies the reduction in uncertainty of $a_{i}$ due to knowledge of $w_{ij}$ , calculated as:

I(a_{i};w_{ij})=H(a_{i})-H(a_{i}\mid w_{ij}).

(5)

Here, $H(a_{i})$ is the entropy of the activation output $a_{i}$ , defined as $H(a_{i})=-\sum_{a_{i}}P(a_{i})\log P(a_{i})$ , and $H(a_{i}\mid w_{ij})$ is the conditional entropy given by:

H(a_{i}\mid w_{ij})=-\int_{w_{ij}}P(w_{ij})\sum_{a_{i}}P(a_{i}\mid w_{ij})\log P% (a_{i}\mid w_{ij})\,dw_{ij}.

(6)

A high mutual information indicates a significant dependency between $a_{i}$ and $w_{ij}$ , implying that pruning $w_{ij}$ would substantially alter $a_{i}$ and potentially degrade network performance. Conversely, a low mutual information suggests that $w_{ij}$ has little influence on $a_{i}$ , making it a suitable candidate for pruning without affecting the model’s accuracy. Incorporating the concept of the Blind Range—where the activation function’s derivative $\phi^{\prime}(z_{i})=0$ and $a_{i}$ remains constant—we observe that mutual information $I(a_{i};w_{ij})$ is minimal when $a_{i}$ operates predominantly within this range. This reinforces our strategy to prune weights with large likelihood of being associated with activations in the Blind Range. To compute $I(a_{i};w_{ij})$ in practice, we estimate the probability distributions $P(a_{i})$ and $P(a_{i}\mid w_{ij})$ by collecting samples of activation outputs and corresponding weights across the dataset. For continuous variables, we compute differential entropy:

h(a_{i})=-\int P(a_{i})\log P(a_{i})\,da_{i},

(7)

h(a_{i}\mid w_{ij})=-\int P(w_{ij})\int P(a_{i}\mid w_{ij})\log P(a_{i}\mid w_% {ij})\,da_{i}\,dw_{ij},

(8)

By calculating these entropy values and the resulting mutual information, we can rank the weights based on their $I(a_{i};w_{ij})$ values. We then define a threshold $\tau$ below which weights are considered for pruning. This method allows us to systematically identify and remove redundant weights, enhancing model efficiency while maintaining performance. This analysis of mutual information provides a robust, mathematically grounded framework for neural network pruning based on the concept of the activation’s Blind Range and contributing to the development of more efficient and compact models.

4 Experimental Results

In this section, we validate our method on the MNIST dataset LeCun (1998) for the image classification task. We fixed the network architecture to a simple fully connected layer network consisting of 3 layers. The first layer has 784 $\times$ 392 weight neurons, accepting flattened images of size 28 $\times$ 28, and outputs 392 features, followed by a ReLU activation function. The second layer processes the latent representation and produces 196 output features, again followed by a ReLU activation function. The final layer outputs a 10-dimensional vector representing the logits for the MNIST classes. We utilized cross-entropy loss with a learning rate of 0.001 for 10 epochs and optimized the network using the Adam optimizer with betas = (0.9, 0.999).

To perform pruning, we adopted an iterative pruning strategy. After training the original model, we iteratively prune the network as follows: First, we compute a score for each weight neuron based on the pruning metric, either using our method’s contribution metric or the criteria of other methods for comparison. Next, we determine a threshold value from these scores, based on a fixed pruning ratio per iteration, and create a mask that zeros out neurons with scores below the threshold. This process is repeated iteratively, pruning neurons at each step until the final target pruning ratio is reached. After each pruning iteration, we fine-tune the model by applying the current pruning mask and training for one epoch using a learning rate of 0.0001. The mask is updated in each iteration to include the newly pruned neurons, representing all pruned neurons by the end of the process.

Note that since our method relies on input data to compute the contribution of each neuron, we investigate the impact of using different subsets of the training dataset for this purpose in the pruning process. The final mean and standard deviation of the computed contributions are calculated across the training examples used in our pruning procedure to derive the final contribution score, as described in Equation 2. We apply the same approach for examining different subsets of training data for Wanda Sun et al. (2024), as it also depends on input data to compute the pruning metric for weights.

Table 1 shows the results of using different values of $\beta$ in Equation 2 and the impact of the layer decay factor, $s$ . The best results were achieved with $\beta=1e-7$ and incorporating the layer decay factor, $s$ , which intuitively makes sense, as the importance of neurons should be scaled based on the layer depth in the network.

We consider the following pruning metrics for comparisons: (1) random pruning, where neurons are randomly selected for pruning based on the target ratio; (2) magnitude-based pruning Han et al. (2015), where neurons with the smallest weight magnitudes are pruned; and (3) Wanda Sun et al. (2024), which extends magnitude-based pruning by additionally considering the input feature norm statistics. The results of these comparisons are shown in Table 2, where we report performance with different pruning ratios per iteration (and thus different numbers of pruning iterations), as well as using varying portions of the training dataset for data-driven methods (i.e., Wanda Sun et al. (2024) and ours). As shown, our method achieves the best results, even when using only 2% of the training data to compute the contribution score. Our method consistently outperforms other methods across all settings.

Table 1: Ablation studies on the impact of

\beta

and the layer decay factor,

s

, on our results are presented. The experiments were conducted using the ReLU activation function with a simple fully connected network, which achieved 97.02% accuracy without pruning. In these experiments, the target pruning ratio was set to 50% (i.e., eliminating 50% of the network weights). All reported results are presented as percentages. The best results are highlighted in yellow.

	Pruning ratio per iteration
Configuration	25%	15%	10%	5%
$\alpha$ = 1, $\beta$ = 1e-7, w/o $s$	97.61	97.45	97.91	97.93
$\alpha$ = 1, $\beta$ = 0, w/ $s$	97.62	97.49	97.90	98.0
$\alpha$ = 1, $\beta$ = 1e-5, w/ $s$	96.14	96.63	97.28	97.59
$\alpha$ = 1, $\beta$ = 1e-7, w/ $s$	97.84	97.52	97.92	98.03

Table 2: Comparisons with other methods. The original model achieved an accuracy of 97.02% without pruning, and the target pruning ratio was set to 50% (i.e., eliminating 50% of the network weights). We compare our method with random pruning, magnitude pruning Han et al. (2015), and Wanda Sun et al. (2024). For the data-driven methods (Wanda and ours), we evaluated performance with varying percentages of training data, noted alongside each method’s name. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.

	Pruning ratio per iteration
Method	25%	15%	10%	5%
Random	93.19	91.24	79.22	9.80
Magnitude Han et al. (2015)	97.29	97.77	97.99	98.11
Wanda (0.5%) Sun et al. (2024)	97.47	97.72	98.16	98.3
Wanda (2%) Sun et al. (2024)	97.44	97.82	98.18	98.32
Wanda (10%) Sun et al. (2024)	97.4	97.87	98.21	98.33
Wanda (20%) Sun et al. (2024)	97.38	97.86	98.2	98.28
Wanda (50%) Sun et al. (2024)	97.41	97.91	97.92	98.26
Wanda (100%) Sun et al. (2024)	97.42	97.96	97.98	98.31
Ours (0.5%)	97.84	97.52	97.92	98.03
Ours (2%)	98.04	98.11	98.37	98.33
Ours (10%)	98.26	98.36	98.41	98.34
Ours (20%)	98.24	98.35	98.47	98.41
Ours (50%)	98.36	98.34	98.38	98.40
Ours (100%)	98.29	98.36	98.30	98.44

Table 3 shows the results of experiments using activation functions other than ReLU, which was used in previous experiments. To evaluate the generalizability of our approach, we report results using Leaky ReLU, Sigmoid, and Tanh activation functions. As shown, our method continues to outperform other pruning techniques across all tested activation functions.

Table 3: Results using different activation functions with accuracy without pruning indicated alongside each activation function. In these experiments, the final pruning ratio was set to 50% (i.e., eliminating 50% of the network weights). We compare our method with random pruning, magnitude pruning Han et al. (2015), and Wanda Sun et al. (2024). For data-driven methods (Wanda and ours), we evaluated the performance with varying percentages of training data, indicated alongside each method’s name. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.

	Pruning ratio per iteration
	Leaky ReLU (97.35%)				Sigmoid (97.64%)				Tanh (96.12%)
Method	25%	15%	10%	5%	25%	15%	10%	5%	25%	15%	10%	5%
Random	93.19	92.69	88.09	9.80	93.94	83.55	84.70	9.80	93.36	92.21	39.73	9.80
Magnitude Han et al. (2015)	97.15	97.58	97.89	97.87	89.45	92.28	94.26	96.09	92.74	94.14	95.66	97.20
Wanda (0.5%) Sun et al. (2024)	96.77	97.21	97.78	97.88	91.19	94.66	96.22	96.91	93.11	94.9	96.47	97.35
Wanda (20%) Sun et al. (2024)	96.68	97.15	97.72	97.86	91.37	94.44	95.97	96.89	93.3	94.68	96.39	97.20
Wanda (100%) Sun et al. (2024)	96.63	97.20	97.79	97.89	91.40	94.42	96.18	96.83	93.21	94.67	96.33	97.26
Ours (0.5%)	97.03	97.5	97.81	98.01	94.64	95.44	96.09	96.62	96.17	96.36	96.86	97.1
Ours (20%)	97.97	98.18	98.12	98.22	95.79	96.47	97.14	97.56	97.09	97.35	97.51	97.76
Ours (100%)	98.01	98.12	98.21	98.18	96.08	96.66	97.15	97.70	97.08	97.50	97.54	97.82

So far, we have fixed the final target pruning ratio at 50% of the original neurons. In Table 4,we show the results for various final pruning ratios. In this experiment, the pruning ratio per iteration was set to 5%. The results indicate that even with an aggressive pruning ratio of 75% (reducing the original network size), our method achieves an accuracy of 97.63%, which is better than with the original model accuracy of 97.02%. In contrast, other methods experience higher reductions in accuracy, with magnitude-based pruning Han et al. (2015) yielding 92.52% and Wanda Sun et al. (2024) achieving 90.32%.

Table 4: Results using different target pruning ratio (10%, 50%, 75%). Here, we used a fixed pruning ratio per iteration (5%) and compare our method with magnitude pruning Han et al. (2015) and Wanda Sun et al. (2024). For data-driven methods (Wanda and ours), we evaluated the performance by using the full training data for pruning. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.

	Target pruning ratio
Method	10%	50%	75%
Magnitude Han et al. (2015)	98.58	98.11	92.52
Wanda Sun et al. (2024)	98.60	98.31	90.32
Ours	98.66	98.44	97.63

Table 5: Results with

\lambda_{r_{L1}}

using different activation functions. For each activation function, the accuracy of original model (without pruning) with and without

\lambda_{r_{L1}}

are as follows: ReLU (w/o

\lambda_{r_{L1}}

: 97.02%, w/

\lambda_{r_{L1}}

: 97.4%), Tanh (w/o

\lambda_{r_{L1}}

: 96.12% , w/

\lambda_{r_{L1}}

: 95.78 %), and Sigmoid (w/o

\lambda_{r_{L1}}

: 97.64% , w/

\lambda_{r_{L1}}

: ). The target pruning ratios were set to 50% and 75%, with a 25% pruning ratio per iteration. We compare our method with random pruning, magnitude pruning Han et al. (2015), and Wanda Sun et al. (2024). For data-driven methods (Wanda and ours), we evaluated the performance with varying percentages of training data, indicated alongside each method’s name. All reported results are presented as percentages. The best results are highlighted in yellow, while the second-best results are highlighted in green.

	Target pruning ratio (50%)			Target pruning ratio (75%)
Method	ReLU	Tanh	Sigmoid	ReLU	Tanh	Sigmoid
Random	93.34	92.92	81.94	84.71	83.97	68.75
Magnitude Han et al. (2015)	96.43	94.87	93.54	68.1	74.05	60.23
Wanda (0.5%) Sun et al. (2024)	97.16	95.34	96.02	90.83	91.07	85.46
Wanda (100%) Sun et al. (2024)	97.21	95.48	96.07	90.74	91.07	84.84
Ours (0.5%)	97.80	97.10	97.39	96.52	95.02	92.52
Ours (100%)	97.87	96.98	97.51	98.15	94.79	92.62

Table 5 demonstrates the results of applying the proposed loss function in Equation 3 with the indicated regularization term. As discussed in Section 2, the hyperparameter $\lambda_{r_{L1}}$ controls the contribution of the regularization to the overall loss function. By utilizing this regularization term, we achieved a 75% reduction in model size without significant loss in accuracy, as evidenced by the data presented in Table 5.

5 Conclusion

In this paper, we have introduced an interpretable, simple, yet effective pruning method. By defining the concept of the activation blind range, we investigated the underexplored aspect of how activation functions affect an architecture’s susceptibility to pruning. We presented a statistical framework for the proposed pruning method, grounded in the Central Limit Theorem and Mutual Information concepts. Our findings conclude that, for unstructured pruning, considering the mutual information between each weight and its associated node leads to a simple and powerful pruning strategy. Moreover, considerable improvements have been obtained by leveraging what we named as ’Pruning-aware Training’ that incorporates an extra term that encourage the model to push the nodes output toward the blind range of the activations. The experimental results confirm the effectiveness of the proposed method across various experimental settings.

References

Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Gemini Team Google [2023] Gemini Team Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Lewis [2019] M Lewis. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
Abdelhamed et al. [2024] Abdelrahman Abdelhamed, Mahmoud Afifi, and Alec Go. What do you see? Enhancing zero-shot image classification with multimodal large language models. arXiv preprint arXiv:2405.15668, 2024.
Cheng et al. [2024] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
Wang et al. [2020a] Yulong Wang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from scratch. In AAAI, 2020a.
Huang and Wang [2018] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In ECCV, 2018.
Liu et al. [2017] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
Theus et al. [2024] Alexander Theus, Olin Geimer, Friedrich Wicke, Thomas Hofmann, Sotiris Anagnostidis, and Sidak Pal Singh. Towards meta-pruning via optimal transport. arXiv preprint arXiv:2402.07839, 2024.
Ganjdanesh et al. [2024] Alireza Ganjdanesh, Shangqian Gao, and Heng Huang. Jointly training and pruning cnns via learnable agent guidance and alignment. In CVPR, pages 16058–16069, 2024.
Shi et al. [2024] Xinyu Shi, Jianhao Ding, Zecheng Hao, and Zhaofei Yu. Towards energy efficient spiking neural networks: An unstructured pruning framework. In ICLR, 2024.
Gadhikar and Burkholz [2024] Advait Gadhikar and Rebekka Burkholz. Masks, signs, and learning rate rewinding. arXiv preprint arXiv:2402.19262, 2024.
Wu et al. [2024] Xidong Wu, Shangqian Gao, Zeyu Zhang, Zhenzhen Li, Runxue Bao, Yanfu Zhang, Xiaoqian Wang, and Heng Huang. Auto-train-once: Controller network guided automatic network pruning from scratch. In CVPR, 2024.
Guo et al. [2023] Song Guo, Lei Zhang, Xiawu Zheng, Yan Wang, Yuchao Li, Fei Chao, Chenglin Wu, Shengchuan Zhang, and Rongrong Ji. Automatic network pruning via hilbert-schmidt independence criterion lasso under information bottleneck principle. In ICCV, 2023.
Fang et al. [2023] Gongfan Fang, Xinyin Ma, Mingli Song, Michael Bi Mi, and Xinchao Wang. Depgraph: Towards any structural pruning. In CVPR, 2023.
He et al. [2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
Tanaka et al. [2020] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In NeurIPS, 2020.
Mason-Williams and Dahlqvist [2024] Gabryel Mason-Williams and Fredrik Dahlqvist. What makes a good prune? maximal unstructured pruning for maximal cosine similarity. In The Twelfth International Conference on Learning Representations, 2024.
Choi et al. [2023] Moonseok Choi, Hyungi Lee, Giung Nam, and Juho Lee. Sparse weight averaging with multiple particles for iterative magnitude pruning. arXiv preprint arXiv:2305.14852, 2023.
Lee et al. [2018] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS Torr. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
Su et al. [2020] Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, and Jason D Lee. Sanity-checking pruning methods: Random tickets can win the jackpot. In NeurIPS, 2020.
Wang et al. [2020b] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020b.
Bai et al. [2022] Yue Bai, Huan Wang, Zhiqiang Tao, Kunpeng Li, and Yun Fu. Dual lottery ticket hypothesis. arXiv preprint arXiv:2203.04248, 2022.
Mocanu et al. [2018] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018.
Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NeurIPS, 2015.
Sun et al. [2024] Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In ICLR, 2024.
Dettmers et al. [2022] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS, 2022.
Sirignano and Spiliopoulos [2020] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
Czyż et al. [2024] Paweł Czyż, Frederic Grabowski, Julia Vogt, Niko Beerenwinkel, and Alexander Marx. Beyond normal: On the evaluation of mutual information estimators. Advances in Neural Information Processing Systems, 36, 2024.
LeCun [1998] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.