基于低秩自适应的伸缩感知蒸馏方法

李佳明; 鲍志强; 黄震华; 孙圣力; 陈运文

doi:10.12263/DZXB.20240894

您当前的位置：

首页 >

文章列表页 >

基于低秩自适应的伸缩感知蒸馏方法

学术论文 | 更新时间：2025-07-24

- 基于低秩自适应的伸缩感知蒸馏方法
- Low-Rank Adaptation Based Flexibility-Aware Distillation Method
- 电子学报 2025年53卷第4期页码：1337-1346
- 作者机构：
  
  1.华南师范大学计算机学院，广东广州 510631
  2.恒生电子股份有限公司研究院，浙江杭州 310053
  3.北京大学软件与微电子学院，北京 102600
  4.达观数据有限公司，上海 201203
- 作者简介：
  
  [ "李佳明男，2002年10月生，广东梅州人.硕士研究生.主要研究方向为知识蒸馏和计算机视觉." ]
  [ "鲍志强男，1995年1月生，江西九江人.博士.主要研究方向为知识蒸馏和模型压缩. E-mail: zhiqiangbao1995@163.com" ]
  [ "黄震华男，1980年9月生，福建莆田人.教授，博士生导师.主要研究方向为机器学习、数据挖掘、推荐系统. E-mail: jukiehuang@163.com" ]
  [ "孙圣力男，1978年12月生，湖南常德人.博士，教授.主要研究方向为机器学习、数据挖掘、数据库. E-mail: slsun@ss.pku.edu.cn" ]
  [ "陈运文男，1981年7月生，江苏南京人.博士，高级工程师.主要研究方向为机器学习、数据挖掘、自然语言处理. E-mail: chenyunwen@datagrand.com" ]
- 基金信息：
  
  国家自然科学基金(62172166)
- DOI：10.12263/DZXB.20240894
  中图分类号： TP391;
- 收稿：2024-10-07，
  
  修回：2025-01-24，
  
  纸质出版：2025-04-25
- 稿件说明：
移动端阅览
李佳明, 鲍志强, 黄震华, 等. 基于低秩自适应的伸缩感知蒸馏方法[J]. 电子学报, 2025, 53(04): 1337-1346.

LI Jia-ming, BAO Zhi-qiang, HUANG Zhen-hua, et al. Low-Rank Adaptation Based Flexibility-Aware Distillation Method[J]. Acta Electronica Sinica, 2025, 53(04): 1337-1346.
李佳明, 鲍志强, 黄震华, 等. 基于低秩自适应的伸缩感知蒸馏方法[J]. 电子学报, 2025, 53(04): 1337-1346. DOI：10.12263/DZXB.20240894

LI Jia-ming, BAO Zhi-qiang, HUANG Zhen-hua, et al. Low-Rank Adaptation Based Flexibility-Aware Distillation Method[J]. Acta Electronica Sinica, 2025, 53(04): 1337-1346. DOI：10.12263/DZXB.20240894

摘要

知识蒸馏是一种从复杂深层教师模型向轻量级学生模型迁移知识以提升性能的学习范式.针对教师模型分布知识多样性不足，以及构建学生模型架构的搜索空间导致大量资源消耗的问题，本文提出了一种基于低秩自适应的伸缩感知蒸馏（Low-rank Adaptation based Flexibility-Aware distillation，LAFA）方法.LAFA方法通过构建低秩变换矩阵，将教师知识分别变换到学生模型的知识和类别标签，以提高分布知识的多样性.同时，LAFA引入决策辅助器，动态伸缩学生模型容量，从而实现蒸馏性能与容量之间的均衡.进一步，本文提出热启动和松弛策略来优化决策变量.热启动策略通过约束学生模型缓慢增加容量，缓解因容量伸缩而导致的收敛困难.松弛策略则在蒸馏后期移除约束，以少量资源消耗实现显著的性能提升.在CIFAR-100数据集上，LAFA集成于13种蒸馏方法，平均性能提升了0.28个百分点.同时，消融实验和分析实验进一步验证了LAFA方法的有效性.

Abstract

Knowledge distillation is a learning paradigm that transfers knowledge from a complex and deep teacher model to a lightweight student model to enhance performance. To address the issues of insufficient diversity in the teacher model’s knowledge distribution and the significant resource consumption caused by the search space for constructing the student model’s architecture

we propose a low-rank adaptation based flexibility-aware distillation (LAFA) method. The LAFA method constructs low-rank transformation matrices to map teacher knowledge to both student model knowledge and class labels

thereby enhancing the diversity of distributed knowledge. Meanwhile

LAFA introduces a decision support module that dynamically adjusts the student model’s capacity

achieving a balance between distillation performance and model capacity. Furthermore

we propose the warm-up and relaxation strategies to optimize decision variables. The warm-up strategy constrains the gradual increase in model capacity to alleviate convergence difficulties caused by capacity scaling

while the relaxation strategy removes the constraints in the later stages of distillation

achieving significant performance improvements with minimal resource consumption. On the CIFAR-100 dataset

LAFA integrated into 13 distillation methods achieved an average performance improvement of 0.28 percentage points. Moreover

through ablation experiments and analytical experiments

the effectiveness of the LAFA method is further validated.

关键词

Keywords

references

HINTON G , VINYALS O , DEAN J . Distilling the knowledge in a neural network [EB/OL ] . ( 2015-03-09 )[ 2024-10-07 ] . https://arxiv.org/abs/1503.02531v1 https://arxiv.org/abs/1503.02531v1 .

ZHAO B R , CUI Q , SONG R J , et al . Decoupled knowledge distillation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 11943 - 11952 .

CHEN P G , LIU S , ZHAO H S , et al . Distilling knowledge via knowledge review [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 5006 - 5015 .

XIN X M , SONG H P , GOU J P . A new similarity-based relational knowledge distillation method [C ] // ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Piscataway : IEEE , 2024 : 3535 - 3539 .

MIRZADEH S I , FARAJTABAR M , LI A , et al . Improved knowledge distillation via teacher assistant [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 4 ): 5191 - 5198 .

LIU Y , JIA X H , TAN M X , et al . Search to distill: Pearls are everywhere but not the eyes [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE , 2020 : 7536 - 7545 .

ROMERO A , BALLAS N , KAHOU S E , et al . FitNets: Hints for thin deep nets [EB/OL ] . ( 2015-03-27 )[ 2024-10-07 ] . https://arxiv.org/abs/1412.6550v4 https://arxiv.org/abs/1412.6550v4 .

AHN S , HU S X , DAMIANOU A , et al . Variational information distillation for knowledge transfer [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 9163 - 9171 .

YOU S , XU C , XU C , et al . Learning from multiple teacher networks [C ] // Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . New York : ACM , 2017 : 1285 - 1294 .

欧阳毅 , 汤文燕 , 黎晏伶 . 基于特征蒸馏的变分编码器交通流预测模型 [J ] . 电子学报 , 2024 , 52 ( 6 ): 1938 - 1944 .

OUYANG Y , TANG W Y , LI Y L . Traffic flow prediction model based on spatio-temporal feature distillation variational autoencoder [J ] . Acta Electronica Sinica , 2024 , 52 ( 6 ): 1938 - 1944 . (in Chinese)

郑云飞 , 王晓兵 , 张雄伟 , 等 . 基于金字塔知识的自蒸馏HRNet目标分割方法 [J ] . 电子学报 , 2023 , 51 ( 3 ): 746 - 756 .

ZHENG Y F , WANG X B , ZHANG X W , et al . The self-distillation HRNet object segmentation based on the pyramid knowledge [J ] . Acta Electronica Sinica , 2023 , 51 ( 3 ): 746 - 756 . (in Chinese)

WANG X , YU F , DOU Z Y , et al . SkipNet: Learning dynamic routing in convolutional networks [M ] // Computer Vision-ECCV 2018 . Cham : Springer International Publishing , 2018 : 420 - 436 .

SAINATH T N , KINGSBURY B , SINDHWANI V , et al . Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets [C ] // 2013 IEEE International Conference on Acoustics, Speech and Signal Processing . Piscataway : IEEE , 2013 : 6655 - 6659 .

HUIJBEN I A M , KOOL W , PAULUS M B , et al . A review of the gumbel-max trick and its extensions for discrete stochasticity in machine learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 1353 - 1371 .

KRIZHEVSKY A . Learning multiple layers of features from tiny images [EB/OL ] . ( 2009-04-08 )[ 2024-10-07 ] . https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf https://www.cs.toronto.edu/kriz/learning-features-2009-TR.pdf .

DENG J , DONG W , SOCHER R , et al . ImageNet: A large-scale hierarchical image database [C ] // 2009 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2009 : 248 - 255 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [EB/OL ] . ( 2015-04-10 )[ 2024-10-07 ] . https://export.arxiv.org/abs/1409.1556v6 https://export.arxiv.org/abs/1409.1556v6 .

ZHANG X Y , ZHOU X Y , LIN M X , et al . ShuffleNet: An extremely efficient convolutional neural network for mobile devices [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 6848 - 6856 .

ZAGORUYKO S , KOMODAKIS N . Wide residual networks [C ] // Proceedings ofthe British Machine Vision Conference 2016 . Paris : British Machine Vision Association , 2016 .

HUANG G , LIU Z , VAN DER MAATEN L , et al . Densely connected convolutional networks [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 2261 - 2269 .

ZHOU H L , SONG L C , CHEN J J , et al . Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective [EB/OL ] . ( 2021-02-01 )[ 2024-10-07 ] . https://arxiv.org/abs/2102.00650 https://arxiv.org/abs/2102.00650 .

JAFARI A , REZAGHOLIZADEH M , SHARMA P , et al . Annealing knowledge distillation [C ] // Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics . EACL : Association for Computational Linguistics , 2021 : 2493 - 2504 .

XU L C , REN J , HUANG Z H , et al . Improving knowledge distillation via head and tail categories [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2024 , 34 ( 5 ): 3465 - 3480 .

PARK W , KIM D , LU Y , et al . Relational knowledge distillation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 3962 - 3971 .

HEO B , KIM J , YUN S , et al . A comprehensive overhaul of feature distillation [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 1921 - 1930 .

PASSALIS N , TEFAS A . Learning deep representations with probabilistic knowledge transfer [M ] // Computer Vision-ECCV 2018 . Cham : Springer International Publishing , 2018 : 283 - 299 .

TUNG F , MORI G . Similarity-preserving knowledge distillation [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 1365 - 1374 .

ZAGORUYKO S , KOMODAKIS N . Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer [EB/OL ] . ( 2017-02-12 )[ 2024-10-07 ] . https://arxiv.org/abs/1612.03928v3 https://arxiv.org/abs/1612.03928v3 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于稀疏平滑自蒸馏的差分隐私深度学习方法

端智能推理加速技术综述