Search by:
Research on Two-Stage Knowledge Distillation and Hybrid Compression of Convolutional Neural Networks to Reduce their Computational Complexity
Full text (PDF)
UDC: 004.032.26:004.89:004.93
Publication Language: English
Stuc. intelekt. 2026; 31(1):59-69
Abstract: The article considers the problem of effective knowledge transfer between deep convolutional neural networks of different capacities for their further deployment on hardware platforms with limited computing resources. The main problem of standard knowledge distillation protocols is the occurrence of "gradient shock" during the initialization of the student model on specific data sets, which leads to the destruction of the feature space and the loss of final accuracy. To overcome this limitation, the Two-Stage Distillation algorithm was developed and implemented. The proposed approach divides the learning process into the classifier stabilization phase and the deep distillation phase. Experimental research was conducted on the ResNet and VGG architectural families. The results obtained confirm that the use of the proposed algorithm allows to increase the accuracy of compact models by 1.5–2.6% compared to standard training. In addition, the work investigated and experimentally confirmed the phenomenon of "distillation recovery" - the ability of the algorithm to restore the accuracy of the model after aggressive structural pruning. It is proven that the use of "soft goals" of the teacher in a narrowed search space allows the sparse ResNet18 model to achieve an accuracy of 77.8%, which exceeds the basic full-size student model.
Keywords: knowledge distillation, neural networks, gradient shock, model compression, ResNet, VGG
References:
- Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICLR). [Online]. Available: http://arxiv.org/abs/1409.1556
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. doi: 10.1109/CVPR.2016.90. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90
- Deng, L., Li, G., Han, S., Shi, L., & Xie, Y. (2020). Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. Proceedings of the IEEE, 108(4), 485–532. doi: 10.1109/JPROC.2020.2976475. [Online]. Available: https://doi.org/10.1109/JPROC.2020.2976475
- Park, W., Kim, D., Lu, Y., & Cho, M. (2019). Relational Knowledge Distillation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3967–3976. doi: 10.1109/CVPR.2019.00409. [Online]. Available: https://doi.org/10.1109/CVPR.2019.00409
- Mirzadeh, S. I., Farajtabar, M., Li, A., & Ghasemzadeh, H. (2020). Improved Knowledge Distillation via Teacher Assistant. AAAI Conference on Artificial Intelligence, 34(04), 5191–5198. doi: 10.1609/aaai.v34i04.5963. [Online]. Available: https://doi.org/10.1609/aaai.v34i04.5963
- Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., & Guttag, J. (2020). What is the State of Neural Network Pruning? Proceedings of Machine Learning and Systems (MLSys), 2, 129–146. [Online]. Available: https://proceedings.mlsys.org/paper/2020/file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf
- Liu, Z., Sun, M., Zhou, T., Huang, G., & Darrell, T. (2021). Rethinking the Value of Network Pruning. International Conference on Learning Representations (ICLR). [Online]. Available: https://openreview.net/forum?id=rJlnB3C5Ym
- Menon, A. K., Rawat, A. S., Reddi, S. J., & Kumar, S. (2021). Why Distillation Helps: A Statistical Perspective. International Conference on Machine Learning (ICML), 139, 7651–7662. [Online]. Available: http://proceedings.mlr.press/v139/menon21a.html
- Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A. A., & Wilson, A. G. (2021). Does Knowledge Distillation Really Work? Advances in Neural Information Processing Systems (NeurIPS), 34, 6906–6919. [Online]. Available: https://doi.org/10.48550/arXiv.2106.05945
- Frankle, J., & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. International Conference on Learning Representations (ICLR). [Online]. Available: https://doi.org/10.48550/arXiv.1803.03635
- Zagoruyko, S., & Komodakis, N. (2017). Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. International Conference on Learning Representations (ICLR). [Online]. Available: https://arxiv.org/abs/1612.03928
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning Workshop. [Online]. Available: http://arxiv.org/abs/1503.02531