Research on Two-Stage Knowledge Distillation and Hybrid Compression of Convolutional Neural Networks to Reduce their Computational Complexity

Search by:

Year of publication

Author name

Paper title

https://doi.org/10.15407/jai2026.01.059

Research on Two-Stage Knowledge Distillation and Hybrid Compression of Convolutional Neural Networks to Reduce their Computational Complexity

Kozynets A.¹, Demkiv L.¹

¹ Ivan Franko Lviv National University

andrian.kozynets@gmail.com; lidia.demkiv@gmail.com

https://orcid.org/0009-0004-7994-3534 https://orcid.org/0009-0002-0185-6364

Full text (PDF)

UDC: 004.032.26:004.89:004.93
Publication Language: English
Stuc. intelekt. 2026; 31(1):59-69

Abstract: The article considers the problem of effective knowledge transfer between deep convolutional neural networks of different capacities for their further deployment on hardware platforms with limited computing resources. The main problem of standard knowledge distillation protocols is the occurrence of "gradient shock" during the initialization of the student model on specific data sets, which leads to the destruction of the feature space and the loss of final accuracy. To overcome this limitation, the Two-Stage Distillation algorithm was developed and implemented. The proposed approach divides the learning process into the classifier stabilization phase and the deep distillation phase. Experimental research was conducted on the ResNet and VGG architectural families. The results obtained confirm that the use of the proposed algorithm allows to increase the accuracy of compact models by 1.5–2.6% compared to standard training. In addition, the work investigated and experimentally confirmed the phenomenon of "distillation recovery" - the ability of the algorithm to restore the accuracy of the model after aggressive structural pruning. It is proven that the use of "soft goals" of the teacher in a narrowed search space allows the sparse ResNet18 model to achieve an accuracy of 77.8%, which exceeds the basic full-size student model.

Keywords: knowledge distillation, neural networks, gradient shock, model compression, ResNet, VGG

References:

Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICLR). [Online]. Available: http://arxiv.org/abs/1409.1556
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. doi: 10.1109/CVPR.2016.90. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90
Deng, L., Li, G., Han, S., Shi, L., & Xie, Y. (2020). Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proceedings of the IEEE, 108(4), 485–532. doi: 10.1109/JPROC.2020.2976475. [Online]. Available: https://doi.org/10.1109/JPROC.2020.2976475
Park, W., Kim, D., Lu, Y., & Cho, M. (2019). Relational Knowledge Distillation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3967–3976. doi: 10.1109/CVPR.2019.00409. [Online]. Available: https://doi.org/10.1109/CVPR.2019.00409
Mirzadeh, S. I., Farajtabar, M., Li, A., & Ghasemzadeh, H. (2020). Improved Knowledge Distillation via Teacher Assistant. AAAI Conference on Artificial Intelligence, 34(04), 5191–5198. doi: 10.1609/aaai.v34i04.5963. [Online]. Available: https://doi.org/10.1609/aaai.v34i04.5963
Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., & Guttag, J. (2020). What is the State of Neural Network Pruning? Proceedings of Machine Learning and Systems (MLSys), 2, 129–146. [Online]. Available: https://proceedings.mlsys.org/paper/2020/file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf
Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2021). Rethinking the Value of Network Pruning. International Conference on Learning Representations (ICLR). [Online]. Available: https://openreview.net/forum?id=rJlnB3C5Ym
Menon, A. K., Rawat, A. S., Reddi, S. J., & Kumar, S. (2021). Why Distillation Helps: A Statistical Perspective. International Conference on Machine Learning (ICML), 139, 7651–7662. [Online]. Available: http://proceedings.mlr.press/v139/menon21a.html
Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A. A., & Wilson, A. G. (2021). Does Knowledge Distillation Really Work? Advances in Neural Information Processing Systems (NeurIPS), 34, 6906–6919. [Online]. Available: https://doi.org/10.48550/arXiv.2106.05945
Frankle, J., & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. International Conference on Learning Representations (ICLR). [Online]. Available: https://doi.org/10.48550/arXiv.1803.03635
Zagoruyko, S., & Komodakis, N. (2017). Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. International Conference on Learning Representations (ICLR). [Online]. Available: https://arxiv.org/abs/1612.03928
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning Workshop. [Online]. Available: http://arxiv.org/abs/1503.02531

View full text (PDF)

Artificial intelligence

Scientific journal

Search by:

Research on Two-Stage Knowledge Distillation and Hybrid Compression of Convolutional Neural Networks to Reduce their Computational Complexity