High-Precision Lightweight Quantization Inference Method for Prevalent Activation Functions in Transformer Models in Edge Device Deployment

YANG Yun-hui; CHENG Hu; WEI Jing-he; LIU Guo-zhu; SANG Xian-zhen

doi:10.12263/DZXB.20240435

您当前的位置：

首页 >

文章列表页 >

High-Precision Lightweight Quantization Inference Method for Prevalent Activation Functions in Transformer Models in Edge Device Deployment

PAPERS | 更新时间：2026-05-07

- High-Precision Lightweight Quantization Inference Method for Prevalent Activation Functions in Transformer Models in Edge Device Deployment
- ACTA ELECTRONICA SINICA Vol. 52, Issue 10, Pages: 3301-3311(2024)
- 作者机构：
  
  中国电子科技集团公司第五十八研究所，江苏无锡 214072
- 作者简介：
- 基金信息：
  
  Natural Science Foundation of Jiangsu Province(K20211041;BK20211040;BE2021003-1;BE2023005-1);National Natural Science Foundation of China(62174150)
- DOI：10.12263/DZXB.20240435
  CLC： TP301
- Received：10 May 2024，
  
  Revised：2024-07-11，
  
  Published：25 October 2024
- 稿件说明：
移动端阅览
杨赟辉, 程虎, 魏敬和, 等. 面向Transformer模型边缘端部署的常用激活函数高精度轻量级量化推理方法[J]. 电子学报, 2024, 52(10): 3301-3311.

YANG Yun-hui, CHENG Hu, WEI Jing-he, et al. High-Precision Lightweight Quantization Inference Method for Prevalent Activation Functions in Transformer Models in Edge Device Deployment[J]. Acta Electronica Sinica, 2024, 52(10): 3301-3311.
杨赟辉, 程虎, 魏敬和, 等. 面向Transformer模型边缘端部署的常用激活函数高精度轻量级量化推理方法[J]. 电子学报, 2024, 52(10): 3301-3311. DOI：10.12263/DZXB.20240435

YANG Yun-hui, CHENG Hu, WEI Jing-he, et al. High-Precision Lightweight Quantization Inference Method for Prevalent Activation Functions in Transformer Models in Edge Device Deployment[J]. Acta Electronica Sinica, 2024, 52(10): 3301-3311. DOI：10.12263/DZXB.20240435

摘要

基于Transformer的大语言模型（Large Language Models， LLM）和视觉Transformer（Vision Transformers， ViTs）分别在自然语言处理、机器视觉任务上实现了最为先进的性能.但是ViTs和LLM的常用激活函数GELU（Gaussian Error Linear Unit）、Swish在Transformer全量化推理中存在精度不足、计算效率低的问题，限制了它们在资源受限的边缘端设备上的部署和应用.本文提出了一种基于分段二次多项式拟合的激活函数高精度近似计算方法（Segmented Quadratic Polynomial Fitting，SQPF）及其量化推理过程，以实现端侧非线性激活函数的高性能部署.SQPF采用最小二乘法和粒子群优化方法求解非线性激活函数拟合优化问题，给出最优的二次多项式拟合系数和区间划分.得到的二次多项式拟合采用动态精度定点对称量化方法进行纯整数推理，推理过程仅包含移位操作和乘加运算.本文使用SQPF计算了GELU和Swish的二次多项式拟合Si-GELU和Si-Swish，并评估了量化推理精度.实验结果表明，在标准数据集ImageNet上，Si-GELU引起的ViTs（ViT、DeiT和Swin）模型分类任务准确率衰减仅为0.09%，是其他同类方法的27.3%；在主流的大语言模型评测数据集MMLU上，Si-Swish引起的子类别精度衰减不超过0.77%，大类别精度衰减不超过0.23%.极小的精度损失表明SQPF计算得到的最优分段二次多项式拟合可以直接替换Transformer模型中全精度浮点激活函数，不必进行参数微调或者重训练.

Abstract

Transformer-based models

such as large language models (LLM) and vision Transformers (ViTs)

had achieved state-of-the-art performance in tasks across natural language processing and machine vision. However

the prevalent activation functions such as GELU (Gaussian Error Linear Unit) and Swish in ViTs and LLMs encountered challenges with insufficient precision and low computational efficiency during fully quantized inference

which constrained their deployment and application in resource-limited edge devices. This paper introduced a high-precision segmented quadratic polynomial fitting method (SQPF) and its corresponding quantized inference process

to achieve high-performance deployment of nonlinear activation functions on the edge side.The SQPF adopted the least squares method and particle swarm optimization to fetch the optimal coefficient and interval divisions for the quadratic polynomial fitting of activation functions. The obtained quadratic polynomialswere subjected to dynamic fixed-point symmetric quantization

enabling pure integer inference that solely required shift operations and multiply-accumulate computations. This paper calculated the quadratic polynomials of GELU and Swish to Si-GELU and Si-Swish

and evaluated their inference accuracy. The experimental results demonstrated that on ImageNet

the Si-GELU induced a minimal accuracy reduction of only 0.09% in the classification tasks for ViTs (ViT

DeiT

and Swin)

which is 27.3% of other methods. On large language model benchmark dataset MMLU

Si-Swish caused a negligible precision degradation

with subcategory precision degradation not exceeding 0.77% and major category precision degradation not exceeding 0.23%. The minimal loss in precision indicated that the optimal quadratic polynomials derived from SQPF were a direct substitute for the full-precision floating-point activation functions in Transformer models

negating parameter fine-tuning or retraining.

关键词

Keywords

references

TOUVRON H , CORD M , DOUZE M , et al . Training data-efficient image Transformers & distillation through attention [C ] // International Conference on Machine Learning . New York : PMLR , 2021 : 10347 - 10357 .

DEVLIN J , CHANG M W , LEE K , et al . BERT: Pre-training of deep bidirectional Transformers for language understanding [C ] // Proceedings of NAACL-HLT . Minneapolis : Association for Computational Linguistics , 2019 : 4171 - 4186 .

LIU Z , LIN Y T , CAO Y , et al . Swin Transformer: Hierarchical vision Transformer using shifted windows [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2021 : 10012 - 10022 .

TOUVRON H , MARTIN L , STONE K R , et al . Llama 2: Open foundation and fine-tuned chat models [EB/OL ] . ( 2023-07-19 )[ 2024-05-10 ] . https://arxiv.org/abs/2307.09288 https://arxiv.org/abs/2307.09288 .

刘兵 , 李穗 , 刘明明 , 等 . 基于全局与序列混合变分Transformer的多样化图像描述生成方法 [J ] . 电子学报 , 2024 , 52 ( 4 ): 1305 - 1314 .

LIU B , LI S , LIU M M , et al . Diverse image captioning based on hybrid global and sequential variational Transformer [J ] . Acta Electronica Sinica , 2024 , 52 ( 4 ): 1305 - 1314 . (in Chinese)

赵玥 , 肖梦燕 , 罗军 , 等 . 人工智能芯片及测评体系分析 [J ] . 电子与封装 , 2023 , 23 ( 5 ): 31 - 37 .

ZHAO Y , XIAO M Y , LUO J , et al . Analysisof artificial intelligence chip and evaluation system [J ] . Electronics & Packaging , 2023 , 23 ( 5 ): 31 - 37 . (in Chinese)

田文超 , 谢昊伦 , 陈源明 , 等 . 人工智能芯片先进封装技术 [J ] . 电子与封装 , 2024 , 24 ( 1 ): 21 - 33 .

TIAN W C , XIE H L , CHEN Y M , et al . Advanced packaging technology for artificial intelligence chips [J ] . Electronics & Packaging , 2024 , 24 ( 1 ): 21 - 33 . (in Chinese)

LI Z K , MA L P , CHEN M J , et al . Patch similarity aware data-free quantization for vision Transformers [M ] // Lecture Notes in Computer Science . Cham : Springer Nature Switzerland , 2022 : 154 - 170 .

HOU Z J , KUNG S Y . Multi-dimensional vision Transformer compression via dependency guided Gaussian process search [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) . Piscataway : IEEE Press , 2022 : 3669 - 3678 .

HAO Z , GUO J , JIA D , et al . Learning efficient vision Transformers via fine-grained manifold distillation [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 9164 - 9175 .

TANG Y H , HAN K , WANG Y H , et al . Patch slimming for efficient vision Transformers [C ] // P2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE Press , 2022 : 12165 - 12174 .

DONG Z , YAO Z W , GHOLAMI A , et al . HAWQ: Hessian aware quantization of neural networks with mixed-precision [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2019 : 293 - 302 .

JACOB B , KLIGYS S , CHEN B , et al . Quantization and training of neural networks for efficient integer-arithmetic-only inference [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE Press , 2018 : 2704 - 2713 .

WU B , WANG Y , ZHANG P , et al . Mixed precision quantization of ConvNetsvia differentiable neural architecture search [EB/OL ] . ( 2018-11-30 )[ 2024-05-10 ] . https://arxiv.org/abs/1812.00090 https://arxiv.org/abs/1812.00090 .

WU H , JUDD P , ZHANG X J , et al . Integer quantization for deep learning inference: Principles and empirical evaluation [EB/OL ] . ( 2020-04-20 )[ 2024-05-10 ] . https://arxiv.org/abs/2004.09602 https://arxiv.org/abs/2004.09602 .

YAO Z , DONG Z , ZHENG Z , et al . HAWQ-V3: Dyadic neural network quantization [C ] // International Conference on Machine Learning . New York : PMLR , 2021 : 11875 - 11886 .

KIM S , GHOLAMI A , YAO Z , et al . I-BERT: Integer-only BERT quantization [C ] // International Conference on Machine Learning . New York : PMLR , 2021 : 5506 - 5518 .

LI Z K , GU Q Y . I-ViT: Integer-only quantization for efficient vision Transformer inference [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2023 : 17065 - 17075 .

BHANDARE A , SRIPATHI V , KARKADA D , et al . Efficient 8-bit quantization of Transformer neural machine language translation model [EB/OL ] . ( 2019-06-07 )[ 2024-05-10 ] . https://arxiv.org/abs/1906.00532 https://arxiv.org/abs/1906.00532 .

SHEN S , DONG Z , YE J Y , et al . Q-BERT: Hessian based ultra low precision quantization of BERT [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 5 ): 8815 - 8821 .

ZAFRIR O , BOUDOUKH G , IZSAK P , et al . Q8BERT: Quantized 8bit BERT [C ] // 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) . Piscataway : IEEE Press , 2019 : 36 - 39 .

HENDRYCKS D , GIMPEL K . Gaussian error linear units (GELUs) [EB/ OL ] . ( 2023-06-06 )[ 2024-05-10 ] . https://arxiv.org/abs/1606.08415 https://arxiv.org/abs/1606.08415 .

HOWARD A , SANDLER M , CHEN B , et al . Searching for MobileNetV3 [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2019 : 1314 - 1324 .

CHOWDHERY A , NARANG S , DEVLIN J , et al . Palm: Scaling language modeling with pathways [J ] . Journal of Machine Learning Research , 2023 , 24 ( 240 ): 1 - 113 .

KRISHNAMOORTHI R . Quantizing deep convolutional networks for efficient inference: A whitepaper [EB/OL ] . ( 2019-06-21 )[ 2024-05-10 ] . https://arxiv.org/abs/1806.08342 https://arxiv.org/abs/1806.08342 .

GHOLAMI A , KIM S , DONG Z , et al . A survey of quantization methods for efficient neural network inference [M ] // Low-Power Computer Vision . Boca Raton : Chapman and Hall/CRC , 2022 : 291 - 326 .

LI Z K , MA L P , LONG X L , et al . Dual-discriminator adversarial framework for data-free quantization [J ] . Neurocomputing , 2022 , 511 : 67 - 77 .

ZHOU A , YAO A , GUO Y , et al . Incremental network quantization: Towards lossless CNNs with low-precision weights [EB/OL ] . ( 2017-08-25 )[ 2024-05-10 ] . https://arxiv.org/abs/1702.03044 https://arxiv.org/abs/1702.03044 .

LIN Y , ZHANG T , SUN P , et al . FQ-ViT: Post-training quantization for fully quantized vision Transformer [EB/OL ] . ( 2023-02-17 )[ 2024-05-10 ] . https://arxiv.org/abs/2111.13824 https://arxiv.org/abs/2111.13824 .

LI Z K , XIAO J R , YANG L W , et al . RepQ-ViT: Scale reparameterization for post-training quantization of vision Transformers [C ] // 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE Press , 2023 : 17227 - 17236 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2024-05-10 ] . https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks [J ] . Communications of the ACM , 2017 , 60 ( 6 ): 84 - 90 .

HENDRYCKS D , BURNS C , BASART S , et al . Measuring massive multitask language understanding [EB/‍OL ] . ( 2021-01-12 )[ 2024-05-10 ] . https://arxiv.org/abs/2009.03300 https://arxiv.org/abs/2009.03300 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰