端智能推理加速技术综述

章晋睿; 龙婷婷; 张德宇; 许愿; 任炬; 张尧学

doi:10.12263/DZXB.20240691

您当前的位置：

首页 >

文章列表页 >

端智能推理加速技术综述

中国电子学会科学技术奖特约专栏 | 更新时间：2025-07-24

- 端智能推理加速技术综述
- On-Device Intelligence Acceleration Technologies: A Survey
- 电子学报 2025年53卷第4期页码：1063-1102
- 作者机构：
  
  1.清华大学计算机与科学技术系，北京 100084
  2.中南大学计算机学院，湖南长沙 410083
- 作者简介：
  
  [ "章晋睿男，1992年12月出生于湖南省湘潭市，博士.现为清华大学计算机与技术系博士后.主要研究领域为边缘智能、移动计算加速、端侧异构计算、计算机视觉. E-mail: zhangjinrui@tsinghua.edu.cn" ]
  [ "龙婷婷女，2001年6月出生于湖北省枣阳市.现为中南大学计算机学院硕士研究生.主要研究领域为边缘智能、模型压缩、文本视频检索优化. E-mail: TingtingLong@csu.edu.cn" ]
  [ "张德宇男，1987年6月出生于河南省新乡市，博士.现为中南大学计算机学院副教授.主要研究领域为边缘计算、物联网、移动端深度学习加速.中国电子学会会员编号：E190085074M. E-mail: zdy876@csu.edu.cn" ]
  [ "许愿女，2003年2月出生于浙江省杭州市.现为中南大学计算机学院本科生.主要研究领域为边缘智能、人机交互. E-mail: _xuan_@csu.edu.cn" ]
  [ "任炬男，1987年12月出生于湖南省汨罗市，博士.现为清华大学计算机与技术系长聘副教授.国家级人才项目获得者.主要研究领域为边缘智能计算与智能协作、边缘智能安全与隐私保护.中国电子学会会员编号：E190018924. E-mail: renju@tsinghua.edu.cn" ]
  [ "张尧学男，1956年1月出生于湖南省常德市，博士.现为清华大学计算机与技术系长聘教授.中国工程院院士.主要研究领域为计算机网络、操作系统、普适计算.中国电子学会会员编号：E190004903F. E-mail: zhangyx@tsinghua.edu.cn" ]
- 基金信息：
  
  国家重点研发计划(2022YFF0604502);国家自然科学基金(62122095;62341201)
- DOI：10.12263/DZXB.20240691
  中图分类号： TP393.0;
- 收稿：2024-07-22，
  
  修回：2025-01-15，
  
  纸质出版：2025-04-25
- 稿件说明：
移动端阅览
章晋睿, 龙婷婷, 张德宇, 等. 端智能推理加速技术综述[J]. 电子学报, 2025, 53(04): 1063-1102.

ZHANG Jin-rui, LONG Ting-ting, ZHANG De-yu, et al. On-Device Intelligence Acceleration Technologies: A Survey[J]. Acta Electronica Sinica, 2025, 53(04): 1063-1102.
章晋睿, 龙婷婷, 张德宇, 等. 端智能推理加速技术综述[J]. 电子学报, 2025, 53(04): 1063-1102. DOI：10.12263/DZXB.20240691

ZHANG Jin-rui, LONG Ting-ting, ZHANG De-yu, et al. On-Device Intelligence Acceleration Technologies: A Survey[J]. Acta Electronica Sinica, 2025, 53(04): 1063-1102. DOI：10.12263/DZXB.20240691

摘要

智能下沉是迈向泛在智能时代的必经之路，也推动了端智能（on-device intelligence）技术的飞速发展.通过在终端设备直接部署运行深度学习模型，端智能在实时性、安全性、个性化等方面具有天然优势，已在自动驾驶、卫星侦察、虚拟现实/增强现实（Virtual Reality/Augmented Reality，VR/AR）等众多场景广泛应用.然而，随着深度学习模型参数量不断增大，端侧受限的硬件资源已难以支撑不断增长的计算开销.为提升终端设备在模型推理的计算效率，研究人员从模型算法、编译软件、设备硬件等多个层面开展了系统性优化，有效推动了端智能的发展与演进.本文从算法、软硬件结合优化等方面对现有端侧深度学习模型推理优化工作进行了总结，涵盖模型压缩技术、模型-软件-硬件的协同设计、模型异构并行部署策略以及大模型的端侧优化技术.最后，本文梳理了当前端智能推理加速技术所面临的挑战，并对未来发展趋势进行了展望.

Abstract

Intelligent edge computing is an essential pathway towards the era of pervasive intelligence

and it has propelled the rapid advancement of on-device intelligence technology. By directly deploying and running deep learning models on edge devices

on-device intelligence holds natural advantages in real-time processing

security

and personalization

among other aspects

and has found extensive applications in various scenarios such as autonomous driving

satellite reconnaissance

virtual reality/augmented reality (VR/AR)

and more. However

as the parameters of deep learning models continue to increase

the limited hardware resources at the edge struggle to sustain the growing computational costs. To enhance the computational efficiency of model inference on edge devices

researchers have systematically optimized from multiple perspectives including model algorithms

compilation software

and device hardware

driving the advancement and evolution of on-device intelligence. This paper summarizes existing optimization efforts for deep learning model inference at the edge

covering techniques such as model compression

collaborative design of model-software-hardware

heterogeneous model parallel deployment strategies

and optimizations for large models. Lastly

it outlines the challenges faced by current on-device intelligence inference acceleration technologies and provides insights into future development trends.

关键词

Keywords

references

廉筱峪 . 复杂噪声环境下基于轻量化模型的车内交互语音增强和识别方法 [J ] . 电子学报 , 2024 , 52 ( 4 ): 1282 - 1287 .

LIAN X Y . In-vehicle interactive speech enhancement and recognition method based on lightweight model in complex noise environments [J ] . Acta Electronica Sinica , 2024 , 52 ( 4 ): 1282 - 1287 . (in Chinese)

周治国 , 马文浩 . 一种多层多模态融合3D目标检测方法 [J ] . 电子学报 , 2024 , 52 ( 3 ): 696 - 708 .

ZHOU Z G , MA W H . 3D object detection based on multilayer multimodal fusion [J ] . Acta Electronica Sinica , 2024 , 52 ( 3 ): 696 - 708 . (in Chinese)

惠康华 , 闫建青 , 高思华 , 等 . 基于特征融合的轻量级新残差人脸识别方法 [J ] . 电子学报 , 2024 , 52 ( 3 ): 937 - 944 .

HUI K H , YAN J Q , GAO S H , et al . Lightweight new fesidual face recognition method based on feature fusion [J ] . Acta Electronica Sinica , 2024 , 52 ( 3 ): 937 - 944 . (in Chinese)

BROWN T , MANN B , RYDER N , et al . Language models are few-shot learners [J ] . Advances in Neural Information Processing Systems , 2020 , 33 : 1877 - 1901 .

ZHANG Q Y , LI X , CHE X Y , et al . A comprehensive benchmark of deep learning libraries on mobile devices [C ] // Proceedings of the ACM Web Conference 2022 . New York : ACM , 2022 : 3298 - 3307 .

ZHOU Z , CHEN X , LI E , et al . Edge intelligence: Paving the last mile of artificial intelligence with edge computing [J ] . Proceedings of the IEEE , 2019 , 107 ( 8 ): 1738 - 1762 .

WANG X F , HAN Y W , LEUNG V C M , et al . Convergence of edge computing and deep learning: A comprehensive survey [J ] . IEEE Communications Surveys & Tutorials , 2020 , 22 ( 2 ): 869 - 904 .

DENG S G , ZHAO H L , FANG W J , et al . Edge intelligence: The confluence of edge computing and artificial intelligence [J ] . IEEE Internet of Things Journal , 2020 , 7 ( 8 ): 7457 - 7469 .

SARWAR MURSHED M G , MURPHY C , HOU D Q , et al . Machine learning at the network edge: A survey [J ] . ACM Computing Surveys , 2022 , 54 ( 8 ): 1 - 37 .

DHAR S , GUO J Y , LIU J J , et al . A survey of on-device machine learning [J ] . ACM Transactions on Internet of Things , 2021 , 2 ( 3 ): 1 - 49 .

高晗 , 田育龙 , 许封元 , 等 . 深度学习模型压缩与加速综述 [J ] . 软件学报 , 2021 , 32 ( 1 ): 68 - 92 .

GAO H , TIAN Y L , XU F Y , et al . Survey of deep learning model compression and acceleration [J ] . Journal of Software , 2021 , 32 ( 1 ): 68 - 92 . (in Chinese)

CAI H , LIN J , LIN Y J , et al . Enable deep learning on mobile devices: Methods, systems, and applications [J ] . ACM Transactions on Design Automation of Electronic Systems , 2022 , 27 ( 3 ): 1 - 50 .

SHUVO M M H , ISLAM S K , CHENG J L , et al . Efficient acceleration of deep learning inference on resource-constrained edge devices: A review [J ] . Proceedings of the IEEE , 2023 , 111 ( 1 ): 42 - 91 .

ZHAO T M , XIE Y C , WANG Y , et al . A survey of deep learning on mobile devices: Applications, optimizations, challenges, and research opportunities [J ] . Proceedings of the IEEE , 2022 , 110 ( 3 ): 334 - 354 .

HUA H C , LI Y T , WANG T H , et al . Edge computing with artificial intelligence: A machine learning perspective [J ] . ACM Computing Surveys , 2023 , 55 ( 9 ): 1 - 35 .

LECUN Y , DENKER J , SOLLA S . Optimal brain damage [J ] . Advances in Neural Information Processing Systems , 1989 , 2 : 1 - 1990 .

HASSIBI B , STORK D . Second order derivatives for network pruning: Optimal brain surgeon [J ] . Advances in Neural Information Processing Systems , 1992 , 5 : 1 .

DONG X , CHEN S Y , PAN S J . Learning to prune deep neural networks via layer-wise optimal brain surgeon [EB/OL ] . ( 2017-11-09 )[ 2024-07-22 ] . https://arxiv.org/abs/1705.07565v2 https://arxiv.org/abs/1705.07565v2 .

SINGH S P , ALISTARH D . WoodFisher: Efficient second-order approximation for neural network compression [EB/OL ] . ( 2020-01-04 )[ 2024-07-22 ] . https://arxiv.org/abs/2004.14340v5 https://arxiv.org/abs/2004.14340v5 .

YU X , SERRA T , RAMALINGAM S , et al . The combinatorial brain surgeon: Pruning weights that cancel one another in neural networks [C ] // International Conference on Machine Learning . New York : PMLR , 2022 : 25668 - 25683 .

HAN S , POOL J , TRAN J , et al . Learning both weights and connections for efficient neural network [J ] . Advances in Neural Information Processing Systems , 2015 , 28 : 1 .

LIAO Z , QUÉTU V , NGUYEN V T , et al . Can unstructured pruning reduce the depth in deep neural networks [C ] // 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) . Piscataway : IEEE , 2023 : 1394 - 1398 .

XU K X , WANG Z , GENG X , et al . Efficient joint optimization of layer-adaptive weight pruning in deep neural networks [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 : 17401 - 17411 .

SHARIFY S , LASCORZ A D , MAHMOUD M , et al . Laconic deep learning inference acceleration [C ] // Proceedings of the 46th International Symposium on Computer Architecture . New York : ACM , 2019 : 304 - 317 .

ZHANG K , LIU G Z . Layer pruning for obtaining shallower ResNets [J ] . IEEE Signal Processing Letters , 2022 , 29 : 1172 - 1176 .

XIA M Z , ZHONG Z X , CHEN D Q . Structured pruning learns compact and accurate models [EB/OL ] . ( 2022-05-02 )[ 2024-07-22 ] . https://arxiv.org/abs/2204.00408 https://arxiv.org/abs/2204.00408 .

YU L , XIANG W . X-pruner: Explainable pruning for vision transformers [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 24355 - 24363 .

HE Z Q , QIAN Y G , WANG Y Q , et al . Filter pruning via feature discrimination in deep neural networks [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 245 - 261 .

MONDAL M , DAS B , ROY S D , et al . Adaptive CNN filter pruning using global importance metric [J ] . Computer Vision and Image Understanding , 2022 , 222 : 103511 .

LIU Y J , FAN K F , ZHOU W J . FPWT: Filter pruning via wavelet transform for CNNs [J ] . Neural Networks , 2024 , 179 : 106577 .

YANG W , XIAO Y C . Structured pruning via feature channels similarity and mutual learning for convolutional neural network compression [J ] . Applied Intelligence , 2022 , 52 ( 12 ): 14560 - 14570 .

ELKERDAWY S , ELHOUSHI M , SINGH A , et al . To filter prune, or to layer prune, that is the question [C ] // Computer Vision - ACCV 2020 . Cham : Springer International Publishing , 2021 , 1 : 737 - 753 .

WANG Z , LI C C , WANG X Y . Convolutional neural network pruning with structural redundancy reduction [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 14913 - 14922 .

LI G L , MA X , WANG X Y , et al . Optimizing deep neural networks on intelligent edge accelerators via flexible-rate filter pruning [J ] . Journal of Systems Architecture , 2022 , 124 : 102431 .

FANG G F , MA X Y , SONG M L , et al . DepGraph: Towards any structural pruning [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 16091 - 16101 .

PASZKE A , GROSS S , MASSA F , et al . PyTorch: An imperative style, high-performance deep learning library [EB/OL ] . ( 2019-12-03 )[ 2024-07-22 ] . https://arxiv.org/abs/1912.01703v1 https://arxiv.org/abs/1912.01703v1 .

WANG X , RACHWAN J , GÜNNEMANN S , et al . Structurally prune anything: Any architecture, any framework, any time [EB/OL ] .( 2024-03-03 )[ 2024-07-22 ] . https://arxiv.org/abs/2403.18955v1 https://arxiv.org/abs/2403.18955v1 .

GUPTA S , AGRAWAL A , GOPALAKRISHNAN K , et al . Deep learning with limited numerical precision [C ] // International conference on machine learning . New York : PMLR , 2015 : 1737 - 1746 .

KÖSTER U , WEBB T , WANG X , et al . Flexpoint: An adaptive numerical format for efficient training of deep neural networks [J ] . Advances in Neural Information Processing Systems , 2017 , 30 : 1 .

JACOB B , KLIGYS S , CHEN B , et al . Quantization and training of neural networks for efficient integer-arithmetic-only inference [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 2704 - 2713 .

COURBARIAUX M , BENGIO Y , DAVID J P . BinaryConnect: Training deep neural networks with binary weights during propagations [EB/OL ] . ( 2016-04-18 )[ 2024-07-22 ] . https://arxiv.org/abs/1511.00363v3 https://arxiv.org/abs/1511.00363v3 .

RASTEGARI M , ORDONEZ V , REDMON J , et al . XNOR-Net: ImageNet classification using binary convolutional neural networks [M ] // Computer Vision-ECCV 2016 . Cham : Springer International Publishing , 2016 : 525 - 542 .

JUEFEI-XU F , BODDETI V N , SAVVIDES M . Local binary convolutional neural networks [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 4284 - 4293 .

KIM H B , LEE J H , YOO S , et al . MetaMix: Meta-state precision searcher for mixed-precision activation quantization [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 12 ): 13132 - 13141 .

MICIKEVICIUS P , NARANG S R , ALBEN J , et al . Mixed precision training [EB/OL ] . ( 2018-02-15 )[ 2024-07-22 ] . https://arxiv.org/abs/1710.03740v3 https://arxiv.org/abs/1710.03740v3 .

WU B C , WANG Y H , ZHANG P Z , et al . Mixed precision quantization of ConvNets via differentiable neural architecture search [EB/OL ] . ( 2018-11-30 )[ 2024-07-22 ] . https://arxiv.org/abs/1812.00090v1 https://arxiv.org/abs/1812.00090v1 .

WANG K , LIU Z J , LIN Y J , et al . HAQ: Hardware-aware automated quantization with mixed precision [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 8604 - 8612 .

HU Y M , WANG X G , LI L J , et al . Improving one-shot NAS with shrinking-and-expanding supernet [J ] . Pattern Recognition , 2021 , 118 : 108025 .

JIN Q , YANG L J , LIAO Z Y . AdaBits: Neural network quantization with adaptive bit-widths [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 2146 - 2156 .

XU K , SHAO X Y , TIAN Y , et al . AutoMPQ: Automatic mixed-precision neural network search via few-shot quantization adapter [J/OL ] . IEEE Transactions on Emerging Topics in Computational Intelligence . ( 2024-05-08 )[ 2024-07-22 ] . https://ieeexplore.ieee.org/document/10523945 https://ieeexplore.ieee.org/document/10523945 .

WANG T Z , WANG K , CAI H , et al . APQ: Joint search for network architecture, pruning and quantization policy [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 2075 - 2084 .

DONG P J , LI L J , WEI Z M , et al . EMQ: Evolving training-free proxies for automated mixed precision quantization [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 : 17030 - 17040 .

KORYAKOVSKIY I , YAKOVLEVA A , BUCHNEV V , et al . One-shot model for mixed-precision quantization [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 7939 - 7949 .

SUN Z , GE C , WANG J , et al . Entropy-driven mixed-precision quantization for deep network design [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 21508 - 21520 .

JADERBERG M , VEDALDI A , ZISSERMAN A . Speeding up convolutional neural networks with low rank expansions [EB/OL ] . ( 2014-05-15 )[ 2024-07-22 ] . https://arxiv.org/abs/1405.3866v1 https://arxiv.org/abs/1405.3866v1 .

LIU B Y , WANG M , FOROOSH H , et al . Sparse convolutional neural networks [C ] // 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 806 - 814 .

WANG P S , CHENG J . Fixed-point factorized networks [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 3966 - 3974 .

PENG B , TAN W M , LI Z Y , et al . Extreme network compression via filter group approximation [M ] // Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 307 - 323 .

LIEBENWEIN L , MAALOUF A , GAL O , et al . Compressing neural networks: Towards determining the optimal layer-wise decomposition [EB/OL ] . ( 2021-11-18 )[ 2024-07-22 ] . https://arxiv.org/abs/2107.11442v2 https://arxiv.org/abs/2107.11442v2 .

HITCHCOCK F L . The expression of a tensor or a polyadic as a sum of products [J ] . Journal of Mathematics and Physics , 1927 , 6 ( 1 ): 164 - 189 .

HITCHCOCK F L . Multiple invariants and generalized rank of a P-way matrix or tensor [J ] . Journal of Mathematics and Physics , 1928 , 7 ( 1/2/3/4 ): 39 - 79 .

KIERS H A L . Towards a standardized notation and terminology in multiway analysis [J ] . Journal of Chemometrics , 2000 , 14 ( 3 ): 105 - 122 .

TUCKER L R . Some mathematical notes on three-mode factor analysis [J ] . Psychometrika , 1966 , 31 ( 3 ): 279 - 311 .

KIM Y D , PARK E , YOO S , et al . Compression of deep convolutional neural networks for fast and low power mobile applications [EB/OL ] . ( 2016-02-24 )[ 2024-07-22 ] . https://arxiv.org/abs/1511.06530v2 https://arxiv.org/abs/1511.06530v2 .

YIN M , PHAN H , ZANG X , et al . BATUDE: Budget-aware neural network compression based on tucker decomposition [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 8 ): 8874 - 8882 .

DENTON E , ZAREMBA W , BRUNA J , et al . Exploiting linear structure within convolutional networks for efficient evaluation [J ] . Advances in Neural Information Processing Systems , 2014 , 2(January): 1269- 1277 .

XUE J , LI J Y , GONG Y F . Restructuring of deep neural network acoustic models with singular value decomposition [C ] // Interspeech 2013 . Singapore : ISCA , 2013 : 2365 - 2369 .

ZHANG H N , LIU L J , ZHOU H Y , et al . CMD: Controllable matrix decomposition with global optimization for deep neural network compression [J ] . Machine Learning , 2022 , 111 ( 3 ): 831 - 851 .

HINTON G , VINYALS O , DEAN J . Distilling the knowledge in a neural network [EB/OL ] . ( 2015-03-09 )[ 2024-07-22 ] . https://arxiv.org/abs/1503.02531v1 https://arxiv.org/abs/1503.02531v1 .

LI Z , LI X , YANG L F , et al . Curriculum temperature for knowledge distillation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 2 ): 1504 - 1512 .

SUN S Q , REN W Q , LI J Z , et al . Logit standardization in knowledge distillation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2024 : 15731 - 15740 .

HAO Z W , GUO J Y , HAN K , et al . One-for-all: Bridge the gap between heterogeneous architectures in knowledge distillation [EB/OL ] . ( 2023-10-30 )[ 2024-07-22 ] . https://arxiv.org/abs/2310.19444v1 https://arxiv.org/abs/2310.19444v1 .

IANDOLA F N , HAN S , MOSKEWICZ M W , et al . SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [EB/OL ] . ( 2016-11-04 )[ 2024-07-22 ] . https://arxiv.org/abs/1602.07360v4 https://arxiv.org/abs/1602.07360v4 .

GHOLAMI A , KWON K , WU B C , et al . SqueezeNext: Hardware-aware neural network design [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE , 2018 : 1751 - 1760 .

HOWARD A G , ZHU M L , CHEN B , et al . MobileNets: Efficient convolutional neural networks for mobile vision applications [EB/OL ] . ( 2017-04-17 )[ 2024-07-22 ] . https://arxiv.org/abs/1704.04861v1 https://arxiv.org/abs/1704.04861v1 .

SANDLER M , HOWARD A , ZHU M L , et al . MobileNetV2: Inverted residuals and linear bottlenecks [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 4510 - 4520 .

ZHANG X Y , ZHOU X Y , LIN M X , et al . ShuffleNet: An extremely efficient convolutional neural network for mobile devices [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 6848 - 6856 .

MA N N , ZHANG X Y , ZHENG H T , et al . ShuffleNet V2: Practical guidelines for efficient CNN architecture design [C ] // Computer Vision-ECCV 2018 . Cham : Springer International Publishing , 2018 : 122 - 138 .

ZHANG T , QI G J , XIAO B , et al . Interleaved group convolutions [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4383 - 4392 .

XIE G T , WANG J D , ZHANG T , et al . Interleaved structured sparse convolutional neural networks [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 8847 - 8856 .

SUN K , LI M J , LIU D , et al . IGCV3: Interleaved low-rank group convolutions for efficient deep neural networks [EB/OL ] . ( 2018-07-20 )[ 2024-07-22 ] . https://arxiv.org/abs/1806.00178v2 https://arxiv.org/abs/1806.00178v2 .

HAN K , WANG Y H , TIAN Q , et al . GhostNet: More features from cheap operations [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 1577 - 1586 .

QIN D F , LEICHNER C , DELAKIS M , et al . MobileNetV4: Universal models for the mobile ecosystem [EB/OL ] . ( 2024-09-29 )[ 2025-05-06 ] . https://arxiv.org/abs/2404.10518v2 https://arxiv.org/abs/2404.10518v2 .

BENDER G , LIU H X , CHEN B , et al . Can weight sharing outperform random architecture search? An investigation with TuNAS [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 14323 - 14332 .

TAN M , LE Q . Efficientnet: Rethinking model scaling for convolutional neural networks [C ] // International Conference on Machine Learning . New York : PMLR , 2019 : 6105 - 6114 .

TAN M X , PANG R M , LE Q V . EfficientDet: Scalable and efficient object detection [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 10781 - 10790 .

CHEN J R , KAO S H , HE H , et al . Run, don’t walk: Chasing higher FLOPS for faster neural networks [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 12021 - 12031 .

VASU P K A , GABRIEL J , ZHU J , et al . MobileOne: An improved one millisecond mobile backbone [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 7907 - 7917 .

WANG A , CHEN H , LIN Z J , et al . Rep ViT: Revisiting mobile CNN from ViT perspective [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2024 : 15909 - 15920 .

ZOPH B , LE Q V , MATHUR V , et al . Neural architecture search with reinforcement learning [EB/OL ] . ( 2017-02-15 )[ 2024-07-22 ] . https://arxiv.org/abs/1611.01578v2 https://arxiv.org/abs/1611.01578v2 .

ZOPH B , VASUDEVAN V , SHLENS J , et al . Learning transferable architectures for scalable image recognition [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 8697 - 8710 .

TAN M X , CHEN B , PANG R M , et al . MnasNet: Platform-aware neural architecture search for mobile [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 2820 - 2828 .

CAI H , ZHU L , HAN S . Proxylessnas: Direct neural architecture search on target task and hardware [EB/OL ] . ( 2019-02-23 )[ 2024-07-22 ] . https://arxiv.org/abs/1812.00332 https://arxiv.org/abs/1812.00332 .

HOWARD A , SANDLER M , CHEN B , et al . Searching for MobileNetV3 [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 1314 - 1324 .

WU B C , KEUTZER K , DAI X L , et al . FBNet: Hardware-aware efficient ConvNet design via differentiable neural architecture search [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 10734 - 10742 .

WU B C , WAN A , YUE X Y , et al . Shift: A zero FLOP, zero parameter alternative to spatial convolutions [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 9127 - 9135 .

STAMOULIS D , DING R Z , WANG D , et al . Single-path NAS: Designing hardware-efficient ConvNets in less than 4 hours [M ] // Machine Learning and Knowledge Discovery in Databases . Cham : Springer International Publishing , 2020 : 481 - 497 .

ANGELINE P J , SAUNDERS G M , POLLACK J B . An evolutionary algorithm that constructs recurrent neural networks [J ] . IEEE Transactions on Neural Networks , 1994 , 5 ( 1 ): 54 - 65 .

STANLEY K O , MIIKKULAINEN R . Efficient evolution of neural network topologies [C ] // Proceedings of the 2002 Congress on Evolutionary Computation (CEC’02) . Piscataway : IEEE , 2002 : 1757 - 1762 .

FLOREANO D , DÜRR P , MATTIUSSI C . Neuroevolution: From architectures to learning [J ] . Evolutionary Intelligence , 2008 , 1 ( 1 ): 47 - 62 .

JÓZEFOWICZ R , ZAREMBA W , SUTSKEVER I . An empirical exploration of recurrent network architectures [C ] // International Conference on Machine Learning . New York : PMLR , 2015 : 2342 - 2350 .

MIIKKULAINEN R , LIANG J , MEYERSON E , et al . Evolving deep neural networks [M ] // Artificial Intelligence in the Age of Neural Networks and Brain Computing . Pittsburg : Academic Press , 2019 : 293 - 312 .

LIU H X , SIMONYAN K , VINYALS O , et al . Hierarchical representations for efficient architecture search [EB/OL ] . ( 2018-02-22 )[ 2024-07-22 ] . https://arxiv.org/abs/1711.00436v2 https://arxiv.org/abs/1711.00436v2 .

LIU H X , SIMONYAN K , YANG Y M , et al . DARTS: Differentiable architecture search [EB/OL ] . ( 2019-04-23 )[ 2024-07-22 ] . https://arxiv.org/abs/1806.09055v2 https://arxiv.org/abs/1806.09055v2 .

ZHENG X W , JI R R , TANG L , et al . Multinomial distribution learning for effective neural architecture search [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 1304 - 1313 .

LIU C X , ZOPH B , NEUMANN M , et al . Progressive neural architecture search [C ] // Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 19 - 35 .

REAL E , AGGARWAL A , HUANG Y P , et al . Regularized evolution for image classifier architecture search [EB/OL ] . ( 2019-02-16 )[ 2024-07-22 ] . https://arxiv.org/abs/1802.01548v7 https://arxiv.org/abs/1802.01548v7 .

LI L S , JAMIESON K G , DESALVO G , et al . Hyperband: Bandit-based configuration evaluation for hyperparameter optimization [C ] // Computer Science . Washington DC : ICLR , 2017 : 53 .

ZELA A , KLEIN A , FALKNER S , et al . Towards automated deep learning: Efficient joint neural architecture and hyperparameter search [EB/OL ] . ( 2018-07-18 )[ 2024-07-22 ] . https://arxiv.org/abs/1807.06906v1 https://arxiv.org/abs/1807.06906v1 .

CHRABĄSZCZ P , LOSHCHILOV I , HUTTER F . Back to basics: Benchmarking canonical evolution strategies for playing atari [C ] // Proceedings of the 27th International Joint Conference on Artificial Intelligence . Freiburg : International Joint Conferences on Artificial Intelligence Organization , 2018 : 1419 - 1426 .

RUNGE F , STOLL D , FALKNER S , et al . Learning to design RNA [EB/OL ] . ( 2019-04-12 )[ 2024-07-22 ] . https://arxiv.org/abs/1812.11951v2 https://arxiv.org/abs/1812.11951v2 .

SWERSKY K , SNOEK J , ADAMS R P . Freeze-thaw Bayesian optimization [EB/OL ] . ( 2014-06-16 )[ 2024-07-22 ] . https://arxiv.org/abs/1406.3896v1 https://arxiv.org/abs/1406.3896v1 .

DOMHAN T , SPRINGENBERG J T , HUTTER F . Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves [J ] . IJCAI International Joint Conference on Artificial Intelligence , 2015 , 1 : 3460 - 3468 .

KLEIN A , FALKNER S , SPRINGENBERG J T , et al . Learning curve prediction with bayesian neural networks [J ] . International Conference on Learning Representations , 2017 , 1 : 1 .

BAKER B , GUPTA O , RASKAR R , et al . Accelerating neural architecture search using performance prediction [EB/OL ] . ( 2017-11-08 )[ 2024-07-22 ] . https://arxiv.org/abs/1705.10823v2 https://arxiv.org/abs/1705.10823v2 .

ZHANG M Y , YU X Y , ZHAO H D , et al . ShiftNAS: Improving one-shot NAS via probability shift [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 : 5896 - 5905 .

WANG H B , GE C , CHEN H S , et al . PreNAS: Preferred one-shot learning towards efficient neural architecture search [EB/OL ] . ( 2023-06-16 )[ 2024-07-22 ] . https://arxiv.org/abs/2304.14636v3 https://arxiv.org/abs/2304.14636v3 .

YUAN G L , XUE B , ZHANG M J . An effective one-shot neural architecture search method with supernet fine-tuning for image classification [C ] // Proceedings of the Genetic and Evolutionary Computation Conference . New York : ACM , 2023 : 615 - 623 .

ZHENG H , LIU K H , FEDOROV I , et al . SiGeo: Sub-one-shot NAS via geometry of loss landscape [C ] // Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . New York : ACM , 2024 : 4536 - 4547 .

SESHADRI K , AKIN B , LAUDON J , et al . An evaluation of edge TPU accelerators for convolutional neural networks [C ] // IEEE International Symposium on Workload Characterization (IISWC) . Piscataway : IEEE , 2022 : 79 - 91 .

LIN Y , HAFDI D , WANG K , et al . Neural-hardware architecture search [J ] . NeurIPS WS , 2019 , 1 : 1 - 36 .

SHEN M Y , YIN H X , MOLCHANOV P , et al . HALP: Hardware-aware latency pruning [EB/OL ] . ( 2021-10-20 )[ 2024-07-22 ] . https://arxiv.org/abs/2110.10811v1 https://arxiv.org/abs/2110.10811v1 .

YANG T J , HOWARD A , CHEN B , et al . NetAdapt: Platform-aware neural network adaptation for mobile applications [C ] // Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 289 - 304 .

LI X , ZHOU Y M , PAN Z , et al . Partial order pruning: For best speed/accuracy trade-off in neural architecture search [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 9145 - 9153.42 .

CHEN Y H , EMER J , SZE V . Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks [C ] // 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2016 : 367 - 379 .

ZHANG Y H , JIANG H X , ZHU Y T , et al . LOCP: Latency-optimized channel pruning for CNN inference acceleration on GPUs [J ] . The Journal of Supercomputing , 2023 , 79 ( 13 ): 14313 - 14341 .

YANG T J , CHEN Y H , SZE V . Designing energy-efficient convolutional neural networks using energy-aware pruning [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 6071 - 6079 .

HAN S , MAO H Z , DALLY W J . Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding [EB/OL ] . ( 2016-02-15 )[ 2024-07-22 ] . https://arxiv.org/abs/1510.00149v5 https://arxiv.org/abs/1510.00149v5 .

YU J C , LUKEFAHR A , PALFRAMAN D , et al . Scalpel: Customizing DNN pruning to the underlying hardware parallelism [C ] // 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2017 : 548 - 560 .

SUN M S , ZHAO P , GUNGOR M , et al . 3D CNN acceleration on FPGA using hardware-aware pruning [C ] // Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC) . Piscataway : IEEE , 2020 : 1 - 6 .

PLOCHAET J , GOEDEMÉ T . Hardware-aware pruning for FPGA deep learning accelerators [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE , 2023 : 4482 - 4490 .

SUI X F , LV Q B , ZHI L J , et al . A hardware-friendly high-precision CNN pruning method and its FPGA implementation [J ] . Sensors , 2023 , 23 ( 2 ): 824 .

GUO C , HSUEH B Y , LENG J W , et al . Accelerating sparse DNN models without hardware-support via tile-wise sparsity [C ] // SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . Piscataway : IEEE , 2020 : 1 - 15 .

CHEN Y H , KRISHNA T , EMER J S , et al . Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks [J ] . IEEE Journal of Solid-State Circuits , 2017 , 52 ( 1 ): 127 - 138 .

CHEN Y H , YANG T J , EMER J , et al . Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices [J ] . IEEE Journal on Emerging and Selected Topics in Circuits and Systems , 2019 , 9 ( 2 ): 292 - 308 .

CAVIGELLI L , BENINI L . Origami: A 803-GOp/s/W convolutional network accelerator [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2017 , 27 ( 11 ): 2461 - 2475 .

WEI X C , LIANG Y , CONG J . Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management [C ] // Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC) . Piscataway : IEEE , 2019 : 1 - 6 .

PARASHAR A , RHU M , MUKKARA A , et al . SCNN: An accelerator for compressed-sparse convolutional neural networks [C ] // 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2017 : 27 - 40 .

SCHULMAN J , WOLSKI F , DHARIWAL P , et al . Proximal policy optimization algorithms [EB/OL ] . ( 2017-08-28 )[ 2024-07-22 ] . https://arxiv.org/abs/1707.06347v2 https://arxiv.org/abs/1707.06347v2 .

DAI X L , JIA Y Q , VAJDA P , et al . ChamNet: Towards efficient network design through platform-aware model adaptation [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 11398 - 11407 .

TULI S , JHA N K . EdgeTran: Device-aware co-search of transformers for efficient inference on mobile edge platforms [J ] . IEEE Transactions on Mobile Computing , 2024 , 23 ( 6 ): 7012 - 7029 .

ABDELFATTAH M S , DUDZIAK L , CHAU T , et al . Best of both worlds: AutoML codesign of a CNN and its hardware accelerator [C ] // Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC) . Piscataway : IEEE , 2020 : 1 - 6 .

YANG L , YAN Z Y , LI M , et al . Co-exploration of neural architectures and heterogeneous ASIC accelerator designs targeting multiple tasks [C ] // Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC) . Piscataway : IEEE , 2020 : 1 - 6 .

LI Y H , HAO C , ZHANG X F , et al . EDD: Efficient differentiable DNN architecture and implementation co-search for embedded AI solutions [C ] // Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC) . Piscataway : IEEE , 2020 : 1 - 6 .

FU Y G , ZHANG Y A , ZHANG Y , et al . Auto-NBA: Efficient and effective search over the joint space of networks, bitwidths, and accelerators [EB/OL ] . ( 2025-01-04 )[ 2024-07-22 ] . https://arxiv.org/abs/2106.06575v3 https://arxiv.org/abs/2106.06575v3 .

SEKANINA L . Neural architecture search and hardware accelerator co-search: A survey [J ] . IEEE Access , 2021 , 9 : 151337 - 151362 .

TULI S , LI C H , SHARMA R , et al . CODEBench: A neural architecture and hardware accelerator co-design framework [J ] . ACM Transactions on Embedded Computing Systems , 2023 , 22 ( 3 ): 1 - 30 .

HONG C , HUANG Q J , DINH G , et al . DOSA: Differentiable model-based one-loop search for DNN accelerators [C ] // Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture . New York : ACM , 2023 : 209 - 224 .

SAKHUJA C , SHI Z , LIN C . Leveraging domain information for the efficient automated design of deep learning accelerators [C ] // 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) . Piscataway : IEEE , 2023 : 287 - 301 .

LIN Y J , YANG M T , HAN S . NAAS: Neural accelerator architecture search [C ] // 2021 58th ACM/IEEE Design Automation Conference (DAC) . Piscataway : IEEE , 2021 : 1051 - 1056 .

DONG Z , YAO Z W , CAI Y H , et al . HAWQ-V2: Hessian aware trace-weighted quantization of neural networks [EB/OL ] . ( 2019-11-10 )[ 2024-07-22 ] . https://arxiv.org/abs/1911.03852v1 https://arxiv.org/abs/1911.03852v1 .

YAO Z , DONG Z , ZHENG Z , et al . Hawq-v3: Dyadic neural network quantization [C ] // International Conference on Machine Learning . New York : PMLR , 2021 : 11875 - 11886 .

REN A , ZHANG T Y , YE S K , et al . ADMM-NN: An algorithm-hardware co-design framework of DNNs using alternating direction methods of multipliers [C ] // Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2019 : 925 - 938 .

BALASKAS K , KARATZAS A , SAD C , et al . Hardware-aware DNN compression via diverse pruning and mixed-precision quantization [J ] . IEEE Transactions on Emerging Topics in Computing , 2024 , 12 ( 4 ): 1079 - 1092 .

SONG Z R , FU B Q , WU F Y , et al . DRQ: Dynamic Region-based quantization for deep neural network acceleration [C ] // 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2020 : 1010 - 1021 .

HUANG W , QIN H T , LIU Y D , et al . On-chip hardware-aware quantization for mixed precision neural networks [EB/OL ] . ( 2024-05-23 )[ 2024-07-22 ] . https://arxiv.org/abs/2309.01945v5 https://arxiv.org/abs/2309.01945v5 .

DENG C H , SUN F X , QIAN X H , et al . TIE: Energy-efficient tensor train-based inference engine for deep neural network [C ] // 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2019 : 264 - 277 .

XIAO J Q , ZHANG C M , GONG Y , et al . HALOC: Hardware-aware automatic low-rank compression for compact neural networks [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2023 , 37 ( 9 ): 10464 - 10472 .

LI S , HANSON E , LI H , et al . PENNI: Pruned kernel sharing for efficient CNN inference [C ] // International Conference on Machine Learning . New York : PMLR , 2020 : 5863 - 5873 .

YU Z W , BOUGANIS C S . StreamSVD: Low-rank approximation and streaming accelerator co-design [C ] // 2021 International Conference on Field-Programmable Technology (ICFPT) . Piscataway : IEEE , 2021 : 1 - 9 .

TensorFlow lite [EB/OL ] . ( 2017-11-15 )[ 2024-07-22 ] . https://www.tensorflow.org/lite/guide?hl=zh-cn https://www.tensorflow.org/lite/guide?hl=zh-cn .

Tencent, ncnn [CP/OL ] . ( 2022-11-13 )[ 2024-07-22 ] . https://github.com/Tencent/ncnn https://github.com/Tencent/ncnn .

JIANG X T , WANG H , CHEN Y L , et al . MNN: A universal and efficient inference engine [J ] . Proceedings of Machine Learning and Systems , 2020 , 2 : 1 - 13 .

Xiaomi . XiaoMi mace [CP/OL ] . ( 2022-11-12 )[ 2024-07-22 ] . https://github.com/XiaoMi/mace https://github.com/XiaoMi/mace .

Software Arm . Arm NN [CP/OL ] . ( 2022-11-11 )[ 2024-07-22 ] . https://github.com/ARM-software/armnn https://github.com/ARM-software/armnn .

PaddlePaddle . Paddlelite [CP/OL ] . ( 2022-11-12 )[ 2024-07-22 ] . https://github.com/PaddlePaddle/Paddle-Lite https://github.com/PaddlePaddle/Paddle-Lite .

PyTorch mobile [EB/OL ] . ( 2021-06-15 )[ 2024-07-22 ] . https://pytorch.org/mobile/home https://pytorch.org/mobile/home .

LI M Z , LIU Y , LIU X Y , et al . The deep learning compiler: A comprehensive survey [J ] . IEEE Transactions on Parallel and Distributed Systems , 2021 , 32 ( 3 ): 708 - 727 .

CHEN T Q , MOREAU T , JIANG Z H , et al . TVM: An automated end-to-end optimizing compiler for deep learning [EB/OL ] . ( 2018-10-05 )[ 2024-07-22 ] . https://arxiv.org/abs/1802.04799v3 https://arxiv.org/abs/1802.04799v3 .

ROTEM N , FIX J , ABDULRASOOL S , et al . Glow: Graph lowering compiler techniques for neural networks [EB/OL ] . ( 2019-04-03 )[ 2024-07-22 ] . https://arxiv.org/abs/1805.00907v3 https://arxiv.org/abs/1805.00907v3 .

CYPHERS S , BANSAL A K , BHIWANDIWALLA A , et al . Intel nGraph: An intermediate representation, compiler, and executor for deep learning [EB/OL ] . ( 2018-01-30 )[ 2024-07-22 ] . https://arxiv.org/abs/1801.08058v2 https://arxiv.org/abs/1801.08058v2 .

Xla-tensorflow , compiled [EB/OL ] . ( 2017-03-01 )[ 2024-07-22 ] . https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html .

KIM T , KWON Y , LEE J , et al . CPrune: Compiler-informed model pruning for efficient target-aware DNN execution [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 651 - 667 .

MA X L , GUO F M , NIU W , et al . PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 4 ): 5117 - 5124 .

NIU W , MA X L , LIN S , et al . PatDNN: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning [C ] // Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems . New York : ACM , 2020 : 907 - 922 .

ZHANG C M , YUAN G , NIU W , et al . ClickTrain: Efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning [C ] // Proceedings of the ACM International Conference on Supercomputing . New York : ACM , 2021 : 266 - 278 .

NIU W , SUN M S , LI Z G , et al . RT3D: Achieving real-time execution of 3D convolutional neural networks on mobile devices [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 10 ): 9179 - 9187 .

GONG Y F , YUAN G , ZHAN Z , et al . Automatic mapping of the best-suited DNN pruning schemes for real-time mobile acceleration [J ] . ACM Transactions on Design Automation of Electronic Systems , 2022 , 27 ( 5 ): 1 - 26 .

LI Z G , YUAN G , NIU W , et al . NPAS: A compiler-aware framework of unified network pruning and architecture search for beyond real-time mobile acceleration [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 14255 - 14266 .

NIU W , GUAN J X , WANG Y Z , et al . DNNFusion: Accelerating deep neural networks execution with advanced operator fusion [C ] // Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation . New York : ACM , 2021 : 883 - 898 .

CAI X Y , WANG Y , ZHANG L . Optimus: An operator fusion framework for deep neural networks [J ] . ACM Transactions on Embedded Computing Systems , 2023 , 22 ( 1 ): 1 - 26 .

JIA Z H , PADON O , THOMAS J , et al . TASO: Optimizing deep learning computation with automatic generation of graph substitutions [C ] // Proceedings of the 27th ACM Symposium on Operating Systems Principles . New York : ACM , 2019 : 47 - 62 .

TARG S , ALMEIDA D , LYMAN K . Resnet in resnet: Generalizing residual architectures [EB/OL ] . ( 2016-03-25 )[ 2024-07-22 ] . https://arxiv.org/abs/1603.08029v1 https://arxiv.org/abs/1603.08029v1 .

CAI H , CHEN T Y , ZHANG W N , et al . Efficient architecture search by network transformation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2018 , 32 ( 1 ): 1 .

JIA Y . Learning Semantic Image Representations at a Large Scale [M ] . Berkeley : University of California , 2014 .

童敢 , 黄立波 , 吕雅帅 . 面向现代GPU的Winograd卷积加速研究 [J ] . 电子学报 , 2024 , 52 ( 1 ): 244 - 257 .

TONG G , HUANG L B , LYU Y S . Research on winograd convolution acceleration for modern GPU [J ] . Acta Electronica Sinica , 2024 , 52 ( 1 ): 244 - 257 . (in Chinese)

DUKHAN M . The indirect convolution algorithm [EB/OL ] . ( 2019-07-03 )[ 2024-07-22 ] . https://arxiv.org/abs/1907.02129v1 https://arxiv.org/abs/1907.02129v1 .

UHYUN LEE Y P . Optimizing tensorflow lite runtime memory [EB/OL ] . ( 2020-10-02 )[ 2024-07-22 ] . https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html https://blog.tensorflow.org/2020/10/optimizing-tensorflow-lite-runtime.html .

SEKIYAMA T , IMAMICHI T , IMAI H , et al . Profile-guided memory optimization for deep neural networks [EB/OL ] . ( 2018-04-26 )[ 2024-07-22 ] . https://arxiv.org/abs/1804.10001v1 https://arxiv.org/abs/1804.10001v1 .

JAIN P , JAIN A , NRUSIMHA A , et al . Checkmate: Breaking the memory wall with optimal tensor rematerialization [J ] . Proceedings of Machine Learning and Systems , 2020 , 2 : 497 - 511 .

MAAS M , BEAUGNON U , CHAUHAN A , et al . TelaMalloc: Efficient on-chip memory allocation for production machine learning accelerators [C ] // Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , Volume 1 . New York : ACM , 2022: 123 - 137 .

WANG M N , DING S H , CAO T , et al . AsyMo: Scalable and efficient deep-learning inference on asymmetric mobile CPUs [C ] // Proceedings of the 27th Annual International Conference on Mobile Computing and Networking . New York : ACM , 2021 : 215 - 228 .

KIM Y , KIM J , CHAE D J , et al . μLayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization [C ] // Proceedings of the 14th EuroSys Conference 2019 . New York : ACM , 2019 : 1 - 15 .

ZHANG J R , ZHANG D Y , XU X H , et al . MobiPose: Real-time multi-person pose estimation on mobile devices [C ] // Proceedings of the 18th Conference on Embedded Networked Sensor Systems . New York : ACM , 2020 : 136 - 149 .

ZHANG J R , ZHANG D Y , YANG H , et al . MVPose: Realtime multi-person pose estimation using motion vector on mobile devices [J ] . IEEE Transactions on Mobile Computing , 2023 , 22 ( 6 ): 3508 - 3524 .

JEONG J S , LEE J Y , KIM D , et al . Band: Coordinated multi-DNN inference on heterogeneous mobile processors [C ] // Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services . New York : ACM , 2022 : 235 - 247 .

TAN T X , CAO G H . Efficient execution of deep neural networks on mobile devices with NPU [C ] // Proceedings of the 20th International Conference on Information Processing in Sensor Networks (co-located with CPS-IoT Week 2021) . New York : ACM , 2021 : 283 - 298 .

WEI J Y , CAO T , CAO S J , et al . NN-stretch: Automatic neural network branching for parallel inference on heterogeneous multi-processors [C ] // Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services . New York : ACM , 2023 : 70 - 83 .

HUYNH L N , LEE Y , BALAN R K . DeepMon: Mobile GPU-based deep learning framework for continuous vision applications [C ] // Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services . New York : ACM , 2017 : 82 - 95 .

XU M W , ZHU M Z , LIU Y X , et al . DeepCache: Principled cache for mobile deep vision [C ] // Proceedings of the 24th Annual International Conference on Mobile Computing and Networking . New York : ACM , 2018 : 129 - 144 .

HUYNH L N , BALAN R K , LEE Y . DeepSense: A GPU-based deep convolutional neural network framework on commodity mobile devices [C ] // Proceedings of the 2016 Workshop on Wearable Systems and Applications . New York : ACM , 2016 : 25 - 30 .

HEO S , CHO S , KIM Y , et al . Real-time object detection system with multi-path neural networks [C ] // 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) . Piscataway : IEEE , 2020 : 174 - 187 .

ZHANG J R , YANG H , REN J , et al . MobiDepth: Real-time depth estimation using on-device dual cameras [C ] // Proceedings of the 28th Annual International Conference on Mobile Computing and Networking . New York : ACM , 2022 : 528 - 541 .

LIN J , CHEN W M , LIN Y , et al . Mcunet: Tiny deep learning on iot devices [J ] . Advances in Neural Information Processing Systems , 2020 , 33 : 11711 - 11722 .

WEI J , TAY Y , BOMMASANI R , et al . Emergent abilities of large language models [EB/OL ] . ( 2022-10-26 )[ 2024-07-22 ] . https://arxiv.org/abs/2206.07682v2 https://arxiv.org/abs/2206.07682v2 .

Vaswani A . Attention is all you need [J ] . Advances in Neural Information Processing Systems , 2017 , 1 : 1 .

DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional transformers for language understanding [EB/OL ] . ( 2019-05-24 )[ 2024-07-22 ] . https://arxiv.org/abs/1810.04805v2 https://arxiv.org/abs/1810.04805v2 .

Dosovitskiy A . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2024-07-22 ] . https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 .

Touvron H , Cord M , Douze M , et al . Training data-efficient image transformers & distillation through attention [C ] // International Conference on Machine Learning . New York : PMLR , 2021 : 10347 - 10357 .

LI Y Y , HU J , WEN Y , et al . Rethinking vision transformers for MobileNet size and speed [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 : 16843 - 16854 .

LIU Z , LIN Y T , CAO Y , et al . Swin transformer: Hierarchical vision transformer using shifted windows [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 9992 - 10002 .

LIU Z , HU H , LIN Y T , et al . Swin transformer V2: Scaling up capacity and resolution [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 11999 - 12009 .

DONG X Y , BAO J M , CHEN D D , et al . CSWin transformer: A general vision transformer backbone with cross-shaped windows [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 12114 - 12124 .

CHEN C R , FAN Q F , PANDA R . CrossViT: Cross-attention multi-scale vision transformer for image classification [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 347 - 356 .

WU S T , WU T Y , TAN H R , et al . Pale transformer: A general vision transformer backbone with pale-shaped attention [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 3 ): 2731 - 2739 .

PAN Z , CAI J , ZHUANG B . Fast vision transformers with hilo attention [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 14541 - 14554 .

WANG W , CHEN W , QIU Q , et al . CrossFormer++: A versatile vision transformer hinging on cross-scale attention [J ] . IEEE Trans Pattern Anal Mach Intell , 2024 , 46 ( 5 ): 3123 - 3136 .

MEHTA S , RASTEGARI M . MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer [EB/OL ] . ( 2022-03-04 )[ 2024-07-22 ] . https://arxiv.org/abs/2110.02178v2 https://arxiv.org/abs/2110.02178v2 .

MEHTA S , RASTEGARI M . Separable self-attention for mobile vision transformers [EB/OL ] . ( 2022-06-06 )[ 2024-07-22 ] . https://arxiv.org/abs/2206.02680v1 https://arxiv.org/abs/2206.02680v1 .

LI Y , YUAN G , WEN Y , et al . Efficientformer: Vision transformers at mobilenet speed [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 12934 - 12949 .

SHAKER A , MAAZ M , RASHEED H , et al . SwiftFormer: Efficient additive attention for transformer-based real-time mobile vision applications [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2023 : 17379 - 17390 .

CHEN Y P , DAI X Y , CHEN D D , et al . Mobile-former: Bridging MobileNet and transformer [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 5260 - 5269 .

XIAO G , LIN J , SEZNEC M , et al . Smoothquant: Accurate and efficient post-training quantization for large language models [C ] // International Conference on Machine Learning . New York : PMLR , 2023 : 38087 - 38099 .

Dettmers T , Lewis M , Belkada Y , et al . Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 30318 - 30332 .

LIN J , TANG J M , TANG H T , et al . AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration [J ] . Proceedings of Machine Learning and Systems , 2024 , 6 : 87 - 100 .

SHEN X , DONG P Y , LU L , et al . Agile-quant: Activation-guided quantization for faster inference of LLMs on the edge [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 17 ): 18944 - 18951 .

AINSLIE J , LEE-THORP J , DE JONG M , et al . GQA: Training generalized multi-query transformer models from multi-head checkpoints [EB/OL ] . ( 2023-12-23 )[ 2024-07-22 ] . https://arxiv.org/abs/2305.13245v3 https://arxiv.org/abs/2305.13245v3 .

GE S Y , ZHANG Y N , LIU L Y , et al . Model tells you what to discard: Adaptive KV cache compression for LLMs [EB/OL ] . ( 2024-10-29 )[ 2024-07-22 ] . https://arxiv.org/abs/2310.01801v4 https://arxiv.org/abs/2310.01801v4 .

JIANG H Q , WU Q H , LIN C Y , et al . LLMLingua: Compressing prompts for accelerated inference of large language models [EB/OL ] . ( 2023-12-06 )[ 2024-07-22 ] . https://arxiv.org/abs/2310.05736v2 https://arxiv.org/abs/2310.05736v2 .

LI Y C , DONG B , LIN C H , et al . Compressing context to enhance inference efficiency of large language models [EB/ OL ] . ( 2023-10-09 )[ 2024-07-22 ] . https://arxiv.org/abs/2310.06201v1 https://arxiv.org/abs/2310.06201v1 .

MU J , LI X , GOODMAN N . Learning to compress prompts with gist tokens [J ] . Advances in Neural Information Processing Systems , 2024 , 36 : 1 .

REN S Y , JIA Q , ZHU K Q . Context compression for auto-regressive transformers with sentinel tokens [EB/OL ] . ( 2023-10-15 )[ 2024-07-22 ] . https://arxiv.org/abs/2310.08152v2 https://arxiv.org/abs/2310.08152v2 .

XIAO G X , TIAN Y D , CHEN B D , et al . Efficient streaming language models with attention sinks [EB/OL ] . ( 2024-04-07 )[ 2024-07-22 ] . https://arxiv.org/abs/2309.17453v4 https://arxiv.org/abs/2309.17453v4 .

HAN C , WANG Q F , PENG H , et al . LM-infinite: Zero-shot extreme length generalization for large language models [EB/OL ] . ( 2024-06-24 )[ 2024-07-22 ] . https://arxiv.org/abs/2308.16137v7 https://arxiv.org/abs/2308.16137v7 .

ZHANG Z , SHENG Y , ZHOU T , et al . H2O: Heavy-hitter oracle for efficient generative inference of large language models [J ] . Advances in Neural Information Processing Systems , 2024 , 36 : 1 .

WU H Y , TU K W . Layer-condensed KV cache for efficient inference of large language models [EB/OL ] . ( 2024-06-04 )[ 2024-07-22 ] . https://arxiv.org/abs/2405.10637v2 https://arxiv.org/abs/2405.10637v2 .

KWON W , LI Z H , ZHUANG S Y , et al . Efficient memory management for large language model serving with PagedAttention [C ] // Proceedings of the 29th Symposium on Operating Systems Principles . New York : ACM , 2023 : 611 - 626 .

FEDUS W , ZOPH B , SHAZEER N M . Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity [J ] . Journal of Machine Learning Research , 2022 , 23 ( 120 ): 1 - 39 .

COLIN R , NOAM S , ADAM R , et al . Exploring the limits of transfer learning with a unified text-to-text transformer [J ] . Journal of Machine Learning Research , 2020 , 21 : 1 - 67 .

HWANG R , WEI J Y , CAO S J , et al . Pre-gated MoE: An algorithm-system co-design for fast and scalable mixture-of-expert inference [C ] // 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2024 : 1018 - 1031 .

KONG R , LI Y C , FENG Q T , et al . SwapMoE: Serving off-the-shelf MoE-based large language models with tunable memory budget [C ] // Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Stroudsburg : USAACL , 2024 : 6710 - 6720 .

RAJBHANDARI S , LI C , YAO Z , et al . Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale [C ] // International Conference on Machine Learning . New York : PMLR , 2022 : 18332 - 18346 .

YI R J , GUO L W , WEI S Y , et al . EdgeMoE: Empowering sparse large language models on mobile devices [EB/OL ] . ( 2025-03-07 )[ 2025-05-06 ] . https://arxiv.org/abs/2308.14352v2 https://arxiv.org/abs/2308.14352v2 .

SONG Y X , XIE H T , ZHANG Z Y , et al . Turbo sparse: Achieving LLM SOTA performance with minimal activated parameters [EB/OL ] . ( 2024-06-11 )[ 2024-07-22 ] . https://arxiv.org/abs/2406.05955v2 https://arxiv.org/abs/2406.05955v2 .

陶建华 , 吴飞 , 黄民烈 , 等 . 中国人工智能系列白皮书——大模型技术 [R ] . 北京 : 中国人工智能学会 , 2023 .

TAO J H , WU F , HUANG M L , et al . China AI Series White Paper Large Model Technology [R ] . Beijing : Chinese Association for Artificial Intelligence , 2023 . (in Chinese)

MA R L , WANG J Y , QI Q , et al . Poster: PipeLLM: Pipeline LLM inference on heterogeneous devices with sequence slicing [C ] // Proceedings of the ACM SIGCOMM 2023 Conference . New York : ACM , 2023 : 1126 - 1128 .

WEI Y X , YE S Y , JIANG J Z , et al . Communication-efficient model parallelism for distributed in situ transformer inference [C ] // 2024 Design , Automation & Test in Europe Conference & Exhibition (DATE) . Piscataway : IEEE , 2024 : 1 - 6 .

ZHAO J C , SONG Y R , LIU S M , et al . LinguaLinked: A distributed large language model inference system for mobile devices [EB/OL ] . ( 2023-12-01 )[ 2024-07-22 ] . https://arxiv.org/abs/2312.00388v1 https://arxiv.org/abs/2312.00388v1 .

YANG B F , HE L X , LING N W , et al . EdgeFM: Leveraging foundation model for open-set learning on the edge [C ] // Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems . New York : ACM , 2023 : 111 - 124 .

CHEN Y X , LI R P , ZHAO Z F , et al . NetGPT: A native-AI network architecture beyond provisioning personalized generative services [EB/OL ] . ( 2024-03-09 )[ 2024-07-22 ] . https://arxiv.org/abs/2307.06148v4 https://arxiv.org/abs/2307.06148v4 .

YAO J C , ZHANG S Y , YAO Y , et al . Edge-cloud polarization and collaboration: A comprehensive survey for AI [J ] . IEEE Transactions on Knowledge and Data Engineering , 2023 , 35 ( 7 ): 6866 - 6886 .

XU D L , YIN W S , JIN X , et al . LLMCad: Fast and scalable on-device large language model inference [EB/OL ] . ( 2023-09-08 )[ 2024-07-22 ] . https://arxiv.org/abs/2309.04255v1 https://arxiv.org/abs/2309.04255v1 .

YIN W S , XU M W , LI Y C , et al . LLM as a system service on mobile devices [EB/OL ] . ( 2024-03-18 )[ 2024-07-22 ] . https://arxiv.org/abs/2403.11805v1 https://arxiv.org/abs/2403.11805v1 .

SONG Y X , MI Z Y , XIE H T , et al . PowerInfer: Fast large language model serving with a consumer-grade GPU [EB/OL ] . ( 2024-12-12 )[ 2024-07-22 ] . https://arxiv.org/abs/2312.12456v2 https://arxiv.org/abs/2312.12456v2 .

XUE Z L , SONG Y X , MI Z Y , et al . PowerInfer-2: Fast large language model inference on a smartphone [EB/OL ] . ( 2024-12-12 )[ 2024-07-22 ] . https://arxiv.org/abs/2406.06282v3 https://arxiv.org/abs/2406.06282v3 .

LIU Z C , ZHAO C S , IANDOLA F , et al . MobileLLM: Optimizing sub-billion parameter language models for on-device use cases [EB/OL ] . ( 2024-06-27 )[ 2024-07-22 ] . https://arxiv.org/abs/2402.14905v2 https://arxiv.org/abs/2402.14905v2 .

XU D , ZHANG H , YANG L , et al . Empowering 1000 tokens/second on-device llm prefilling with mllm-npu [EB/OL ] . ( 2024-12-15 )[ 2024-07-22 ] . https://arxiv.org/pdf/2407.05858 https://arxiv.org/pdf/2407.05858 .

DENG J , DONG W , SOCHER R , et al . ImageNet: A large-scale hierarchical image database [C ] // 2009 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2009 : 248 - 255 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks [J ] . Communications of the ACM , 2017 , 60 ( 6 ): 84 - 90 .

RUSSAKOVSKY O , DENG J , SU H , et al . ImageNet large scale visual recognition challenge [J ] . International Journal of Computer Vision , 2015 , 115 ( 3 ): 211 - 252 .

KRIZHEVSKY A , HINTON G . Learning multiple layers of features from tiny images [J ] . Handbook of Systemic Autoimmune Diseases , 2009 , 1 ( 4 ): 1 - 60 .

ROMERO A , BALLAS N , KAHOU S E , et al . FitNets: Hints for thin deep nets [EB/OL ] . ( 2015-03-27 )[ 2024-07-22 ] . https://arxiv.org/abs/1412.6550v4 https://arxiv.org/abs/1412.6550v4 .

CROWLEY E J , GRAY G , STORKEY A . Moonshine: Distilling with cheap convolutions [EB/OL ] . ( 2019-11-07 )[ 2024-07-22 ] . https://arxiv.org/abs/1711.02613v4 https://arxiv.org/abs/1711.02613v4 .

WAN A , DAI X L , ZHANG P Z , et al . FBNetV2: Differentiable neural architecture search for spatial and channel dimensions [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 12965 - 12974 .

MISAWA M , KUDO S E , MORI Y , et al . Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video) [J ] . Gastrointestinal Endoscopy , 2021 , 93 ( 4 ): 960 - 967 .

EELBODE T , SINONQUEL P , BISSCHOPS R , et al . Convolutional LSTM [M ] // Computer-Aided Analysis of Gastrointestinal Videos . Cham : Springer International Publishing , 2021 : 121 - 126 .

WANG A , SINGH A , MICHAEL J , et al . GLUE: A multi-task benchmark and analysis platform for natural language understanding [EB/OL ] . ( 2019-02-22 )[ 2024-07-22 ] . https://arxiv.org/abs/1804.07461v3 https://arxiv.org/abs/1804.07461v3 .

MERITY S , XIONG C M , BRADBURY J , et al . Pointer sentinel mixture models [EB/OL ] . ( 2016-09-26 )[ 2024-07-22 ] . https://arxiv.org/abs/1609.07843v1 https://arxiv.org/abs/1609.07843v1 .

KIM S , HOOPER C , GHOLAMI A , et al . SqueezeLLM: Dense-and-sparse quantization [EB/OL ] . ( 2023-06-13 )[ 2024-07-22 ] . https://arxiv.org/abs/2306.07629v4 https://arxiv.org/abs/2306.07629v4 .

HUANG W , LIU Y D , QIN H T , et al . BiLLM: Pushing the limit of post-training quantization for LLMs [EB/OL ] . ( 2024-05-15 )[ 2024-07-22 ] . https://arxiv.org/abs/2402.04291v2 https://arxiv.org/abs/2402.04291v2 .

COBBE K , KOSARAJU V , BAVARIAN M , et al . Training verifiers to solve math word problems [EB/OL ] . ( 2021-11-18 )[ 2024-07-22 ] . https://arxiv.org/abs/2110.14168v2 https://arxiv.org/abs/2110.14168v2 .

LI Y C , LI Y C , DONG B , et al . Unlocking context constraints of LLMs: Enhancing context efficiency of LLMs with self-information-based content filtering [EB/OL ] . ( 2023-04-24 )[ 2024-07-22 ] . https://arxiv.org/abs/2304.12102v1 https://arxiv.org/abs/2304.12102v1 .

RAJPURKAR P , ZHANG J , LOPYREV K , et al . SQuAD: 100 , 000 + questions for machine comprehension of text[EB/OL ] . ( 2016-10-11 )[ 2024-07-22 ] . https://arxiv.org/abs/1606.05250v3 https://arxiv.org/abs/1606.05250v3 .

SHOEYBI M , PATWARY M , PURI R , et al . Megatron-LM: Training multi-billion parameter language models using model parallelism [EB/OL ] . ( 2020-03-13 )[ 2024-07-22 ] . https://arxiv.org/abs/1909.08053v4 https://arxiv.org/abs/1909.08053v4 .

LIN S , HILTON J , EVANS O . TruthfulQA: Measuring how models mimic human falsehoods [EB/OL ] . ( 2022-05-08 )[ 2024-07-22 ] . https://arxiv.org/abs/2109.07958v2 https://arxiv.org/abs/2109.07958v2 .

PAPERNO D , KRUSZEWSKI G , LAZARIDOU A , et al . The LAMBADA dataset: Word prediction requiring a broad discourse context [EB/OL ] . ( 2016-06-20 )[ 2024-07-22 ] . https://arxiv.org/abs/1606.06031v1 https://arxiv.org/abs/1606.06031v1 .

BAI Y S , LV X , ZHANG J J , et al . LongBench: A bilingual, multitask benchmark for long context understanding [EB/OL ] . ( 2024-06-19 )[ 2024-07-22 ] . https://arxiv.org/abs/2308.14508v2 https://arxiv.org/abs/2308.14508v2 .

SUN Z Q , YU H K , SONG X D , et al . MobileBERT: A compact task-agnostic BERT for resource-limited devices [EB/OL ] . ( 2020-04-14 )[ 2024-07-22 ] . https://arxiv.org/abs/2004.02984v2 https://arxiv.org/abs/2004.02984v2 .

JIAO X Q , YIN Y C , SHANG L F , et al . TinyBERT: Distilling BERT for natural language understanding [EB/OL ] . ( 2020-10-16 )[ 2024-07-22 ] . https://arxiv.org/abs/1909.10351v5 https://arxiv.org/abs/1909.10351v5 .

YOU K E , ZHANG H T , SCHOOP E , et al . Ferret-UI: Grounded mobile UI understanding with multimodal LLMs [EB/OL ] . ( 2024-04-08 )[ 2024-07-22 ] . https://arxiv.org/abs/2404.05719v1 https://arxiv.org/abs/2404.05719v1 .

HU Z N , ISCEN A , SUN C , et al . Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 23369 - 23379 .

HU T M , LUO B , YANG C H , et al . MO-MIX: Multi-objective multi-agent cooperative decision-making with deep reinforcement learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 10 ): 12098 - 12112 .

ZHANG J J , HOU Y P , XIE R B , et al . AgentCF: Collaborative learning with autonomous language agents for recommender systems [C ] // Proceedings of the ACM Web Conference 2024 . New York : ACM , 2024 : 3679 - 3689 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于低秩自适应的伸缩感知蒸馏方法

面向时序异常检测的可变视距多向扫描方法

基于稀疏平滑自蒸馏的差分隐私深度学习方法