Structured Pruning Method for Large Language Models with Adaptive Compression Ratios

Search by:

Year of publication

Author name

Paper title

https://doi.org/10.15407/jai2025.04.088

Structured Pruning Method for Large Language Models with Adaptive Compression Ratios

Shvets V.¹, Shapoval N.¹

¹ National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

shvets.vitaliy@lll.kpi.ua; shovgun@gmail.com

https://orcid.org/0009-0009-2998-158X

Full text (PDF)

UDC: 004.8
Publication Language: English
Stuc. intelekt. 2025; 30(4):88-98

Abstract: The article addresses the important challenge of deploying large language models (LLMs) on resource-constrained devices. We analyze the evolution of neural network pruning methods from classical approaches (Optimal Brain Damage, Optimal Brain Surgeon) to modern one-shot techniques for LLMs (SparseGPT, Wanda, SliceGPT, 2SSP). The research demonstrates that while unstructured pruning achieves high compression ratios with small quality loss, it fails to provide real size reduction and inference acceleration on standard hardware due to chaotic sparse matrix structures. In contrast, structured pruning methods ensure hardware efficiency by removing entire structural blocks. We propose the Adaptive 2SSP method (modification of the 2SSP method), which combines adaptive compression ratio selection based on block redundancy with two-stage structured pruning: attention block removal (depth pruning) followed by FFN layer neuron removal (width pruning). Experimental validation on Llama-3.2-3B, Llama-2-7B, and Qwen2.5-3B models demonstrates the method's superiority over existing alternatives (GLU Aware Pruning, Dynamic Slicing, original 2SSP). When removing 40% of Llama-3.2-3B parameters, the method maintains perplexity at 26.35 and average benchmark accuracy at 39.57%, representing the best results among compared methods. Hardware efficiency evaluation for Llama-3.2-3B achieved 35.12% reduction in VRAM consumption and 34.78% acceleration in token generation. For Llama-2-7B, a 3.7-fold speedup was obtained at 20% pruning by overcoming VRAM limitations. The results demonstrate that the proposed method provides an optimal balance between compression degree, execution speed, and model quality preservation, making it an effective tool for adapting modern LLMs to deployment on devices with limited computational resources.

Keywords: language models, pruning, model compression, structured pruning, adaptive pruning, LLM optimization, transformers, neural network compression.

References:

Wei, J., et al. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
Patterson, D., et al. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., & Roberts, D.A. (2024). The unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887.
LeCun, Y., Denker, J.S., & Solla, S.A. (1989). Optimal brain damage. Neural Information Processing Systems, 2, 598–605.
Hassibi, B., Stork, D.G., & Wolff, G.J. (1993). Optimal Brain Surgeon and general network pruning. IEEE International Conference on Neural Networks, 293–299.
Han, S., Pool, J., Tran, J., & Dally, W.J. (2015). Learning both Weights and Connections for Efficient Neural Networks. Advances in neural information processing systems, 28.
Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
Frantar, E., & Alistarh, D. (2023). SparseGPT: Massive language models can be accurately pruned in One–Shot. arXiv preprint arXiv:2301.00774.
Siddiqui, M. H. (2024, March 22). Difference Between Structured and Unstructured Pruning in Neural. Medium.https://medium.com/@mhammadsiddiqui/difference-between-structured-and-unstructured-pruning-in-neural-cca5603581fb.
Vaswani, A., et al. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Sun, M., Liu, Z., Bair, A., & Kolter, J.Z. (2023). A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
Ashkboos, S., et al. (2024). SliceGPT: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024.
Dumitru, R.–G., Clotan, P.–I., Yadav, V., Peteleaza, D., & Surdeanu, M. (2024). Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy. arXiv preprint arXiv:2411.03513.
Sandri, F., Cunegatti, E., & Iacca, G. (2025). 2SSP: A Two–Stage Framework for Structured Pruning of LLMs. arXiv preprint arXiv:2501.17771.
He, S., Sun, G., Shen, Z., & Li, A. (2024). What Matters in Transformers? Not All Attention is Needed. arXiv preprint arXiv:2406.15786.
Making LLMs Smaller Without Breaking Them: A GLU–Aware Pruning Approach. (2024). Huggingface.co. https://huggingface.co/blog/oopere/making–llms–smaller–without–breaking–them
Jelinek, F., Mercer, R.L., Bahl, L.R., & Baker, J.K. (1977). Perplexity –a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, 62(S1), S63.
Hendrycks, D., et al. (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
Clark, P., et al. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv preprint arXiv:1803.05457.
Clark, C., Lee, K., Chang, M.–W., Kwiatkowski, T., Collins, M., & Toutanova, K. (2019). BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. arXiv preprint arXiv:1905.10044.
Amini, A., Gabriel, S., Lin, P., Koncel–Kedziorski, R., Choi, Y., & Hajishirzi, H. (2019). MathQA: Towards Interpretable Math Word Problem Solving with Operation–Based Formalisms. arXiv preprint arXiv:1905.13319.

View full text (PDF)

Artificial intelligence

Scientific journal

Search by:

Structured Pruning Method for Large Language Models with Adaptive Compression Ratios