Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators

WANG Ying; GAO Lan; ZHANG Zhe; LIU Xin; WU Yi-xiong; ZHANG Wei-gong

doi:10.12263/DZXB.20250312

您当前的位置：

首页 >

文章列表页 >

Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators

PAPERS | 更新时间：2025-12-27

- Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators
- ACTA ELECTRONICA SINICA Vol. 53, Issue 9, Pages: 3299-3309(2025)
- 作者机构：
  
  1.首都师范大学信息工程学院，北京 100048
  2.首都师范大学数学科学学院，北京 100048
  3.山西农业大学软件学院，山西晋中 030801
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62202317)
- DOI：10.12263/DZXB.20250312
  CLC： TP302.8;
- Received：22 April 2025，
  
  Accepted：08 September 2025，
  
  Published：25 September 2025
- 稿件说明：
移动端阅览
王莹, 高岚, 张哲, 等. 基于非一般类算子融合方法及硬件架构设计[J]. 电子学报, 2025, 53(09): 3299-3309.

WANG Ying, GAO Lan, ZHANG Zhe, et al. Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators[J]. Acta Electronica Sinica, 2025, 53(09): 3299-3309.
王莹, 高岚, 张哲, 等. 基于非一般类算子融合方法及硬件架构设计[J]. 电子学报, 2025, 53(09): 3299-3309. DOI：10.12263/DZXB.20250312

WANG Ying, GAO Lan, ZHANG Zhe, et al. Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators[J]. Acta Electronica Sinica, 2025, 53(09): 3299-3309. DOI：10.12263/DZXB.20250312

摘要

针对传统算子融合算法在异构计算系统跨计算单元时的失效性问题，本文提出一种优化后的算子融合策略，并针对新型融合算法进行了硬件设计实现.论文基于传统算子融合算法的设计初衷，在端侧异构计算系统部署深度学习算法时，分析算子融合覆盖率对推理任务计算性能的影响，挖掘跨计算单元算子融合的可能性，设计可以提升算子融合覆盖率的改进算法模型；同时，通过构建以CPU（Central Processing Unit）+GPU（Graphics Processing Unit）+DLA（（Deep Learning Accelerator））组成的异构计算平台，为改进后的算子融合策略提供结构更加耦合的多层级存储共享结构.实验结果表明，与优化前的算子融合算法相比，改进后的算子融合策略可以有效提升算子融合覆盖率，部署在Xilinx公司FPGA（Field-Programmable Gate Array）开发板上进行目标检测网络推理实验.结果表明，本文提出的设计方案，针对YOLOX-Nano的推理过程可实现62.67%推理计算性能提升，计算加速比为2.68；针对YOLOv5s的推理过程可实现71.10%推理计算性能提升，计算加速比为3.46.

Abstract

To address the failure of traditional operator fusion algorithms in heterogeneous computing systems when crossing different computing units

this paper proposes an optimized operator fusion strategy and implements a hardware design for the novel fusion algorithm. Building upon the original design intentions of traditional operator fusion

we analyze the impact of operator fusion coverage on inference performance when deploying deep learning algorithms on edge-side heterogeneous computing systems. We explore the feasibility of cross-unit operator fusion and design an improved fusion algorithm model that enhances fusion coverage. Furthermore

a heterogeneous computing platform composed of CPU (Central Processing Unit)

GPU (Graphics Processing Unit) and DLA (Deep Learning Accelerator) is constructed

incorporating a tightly coupled multi-level shared memory architecture tailored for the optimized fusion strategy. Experimental results demonstrate that the proposed fusion strategy significantly improves operator fusion coverage compared to the unoptimized version. Deployed on a Xilinx FPGA (Field-Programmable Gate Array) development board for object detection network inference

the proposed design achieves a 62.67% performance improvement and a 2.68× speedup for YOLOX-Nano inference

and a 71.10% performance improvement and a 3.46× speedup for YOLOv5s inference.

关键词

Keywords

references

VOULODIMOS A , DOULAMIS N , DOULAMIS A , et al . Deep learning for computer vision: A brief review [J ] . Computational Intelligence and Neuroscience , 2018 , 2018 : 7068349 .

DIWAN T , ANIRUDH G , TEMBHURNE J V . Object detection using YOLO: Challenges, architectural successors, datasets and applications [J ] . Multimedia Tools and Applications , 2023 , 82 ( 6 ): 9243 - 9275 .

ZHANG J Q , CHEN X R , SONG M C , et al . Eager pruning: Algorithm and architecture support for fast training of deep neural networks [C ] // 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA) . Piscataway : IEEE , 2019 : 292 - 303 .

BOCHKOVSKIY A , WANG C Y , LIAO H M . YOLOv4: Optimal speed and accuracy of object detection[EB/ OL ] . ( 2020-04-23 )[ 2025-03-21 ] . https://arXiv.org/abs/2004.10934 https://arXiv.org/abs/2004.10934 .

FLYNN J , NEULANDER I , PHILBIN J , et al . Deep stereo: Learning to predict new views from the world’s imagery [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 5515 - 5524 .

KWON H , LAI L Z , PELLAUER M , et al . Heterogeneous dataflow accelerators for multi-DNN workloads [C ] // 2021 IEEE International Symposium on High-Performance Computer Architecture . Piscataway : IEEE , 2021 : 71 - 83 .

XU H , NAZHAMAITI M , LIU Y D , et al . Utilizing direct photocurrent computation and 2D kernel scheduling to improve in-sensor-processing efficiency [C ] // 2020 57th ACM/IEEE Design Automation Conference . Piscataway : IEEE , 2020 : 1 - 6 .

VILIM M , RUCKER A , OLUKOTUN K . Aurochs: An architecture for dataflow threads [C ] // 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture . Piscataway : IEEE , 2021 : 402 - 415 .

JOUPPI N P , YOON D H , ASHCRAFT M , et al . Ten lessons from three generations shaped google’s TPUv4i: Industrial product [C ] // 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture . Piscataway : IEEE , 2021 : 1 - 14 .

NORRIE T , PATIL N , YOON D H , et al . The design process for google’s training chips: TPUv2 and TPUv3 [J ] . IEEE Micro , 2021 , 41 ( 2 ): 56 - 63 .

ZHAO K J , WANG J , ZANG D . A convolutional neural network accelerator based on NVDLA [C ] // 2021 The 5th International Conference on Algorithms, Computing and Systems . New York : ACM , 2021 : 43 - 47 .

ZHANG J Q , CHEN X R , RAY S . GCONV chain: Optimizing the whole-life cost in end-to-end CNN acceleration [C ] // IEEE Transactions on Computers . Piscataway : IEEE , 2021 : 2300 - 2312 .

REDMON J , FARHADI A . YOLOv3: An incremental improvement [EB/OL ] . ( 2018-04-08 )[ 2025-03-21 ] . https://arXiv.org/abs/1804.02767 https://arXiv.org/abs/1804.02767 .

RODRIGUEZ-FERRANDEZ I , TALI M , KOSMIDIS L , et al . Sources of single event effects in the NVIDIA Xavier SoC family under proton irradiation [C ] // 2022 IEEE 28th International Symposium on On-Line Testing and Robust System Design . Piscataway : IEEE , 2022 : 1 - 7 .

ZHAO Y W , LIU C , DU Z D , et al . Cambricon-Q: A hybrid architecture for efficient training [C ] // 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture . Piscataway : IEEE , 2021 : 706 - 719 .

DAS SHARMA D , PASDAST G , QIAN Z G , et al . Universal chiplet interconnect express (UCIe): An open industry standard for innovations with chiplets at package level [J ] . IEEE Transactions on Components, Packaging and Manufacturing Technology , 2022 , 12 ( 9 ): 1423 - 1431 .

NIU W , GUAN J X , WANG Y Z , et al . DNNFusion: Accelerating deep neural networks execution with advanced operator fusion [C ] // Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation . New York : ACM , 2021 : 883 - 898 .

CAI X Y , WANG Y , ZHANG L . Optimus: An operator fusion framework for deep neural networks [J ] . ACM Transactions on Embedded Computing Systems , 2023 , 22 ( 1 ): 1 - 26 .

PARK J , NAUMOV M , BASU P , et al . Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications [EB/OL ] . ( 2018-11-29 )[ 2025-03-21 ] . https://arXiv.org/abs/1811.09886 https://arXiv.org/abs/1811.09886 .

SHUVO M M H , ISLAM S K , CHENG J L , et al . Efficient acceleration of deep learning inference on resource-constrained edge devices: A review [J ] . Proceedings of the IEEE , 2023 , 111 ( 1 ): 42 - 91 .

YANG Y , DIAO B Y , LIU H D , et al . An evolutionary search-based operator fusion method with binary representation for deep learning inference acceleration [M ] // Pattern Recognition . Cham : Springer Nature Switzerland , 2024 : 32 - 45 .

CHEN T , MOREAU T , JIANG Z , et al . TVM: An automated end-to-end optimizing compiler for deep learning [C ] // 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) . California : USENIX Association , 2018 : 578 - 594 .

XING Y , LIANG S , SUI L Z , et al . DNNVM: End-to-end compiler leveraging heterogeneous optimizations on FPGA-based CNN accelerators [J ] . IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2020 , 39 ( 10 ): 2668 - 2681 .

JEONG E , KIM J , TAN S , et al . Deep learning inference parallelization on heterogeneous processors with TensorRT [J ] . IEEE Embedded Systems Letters , 2022 , 14 ( 1 ): 15 - 18 .

OPEN AI LAB . Tengine [EB/OL ] . ( 2021-10-08 )[ 2025-03-21 ] . https://github.com/OAID/Tengine https://github.com/OAID/Tengine .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

MalMKNet: A Multi-Scale Convolutional Neural Network Used for Malware Classification

An Improved Attention Mechanism Algorithm Model and Hardware Aceleration Design Method

Research of Lightweight Convolution Neural Network Based on Feature Expansion Convolution

Mobile_BLNet: Optimization Design of Lightweight Convolutional Neural Network Based on Big-Little Net

Related Author

ZHANG Zhe

ZHANG Dan-dan

SONG Ya-fei

LIU Shu

WANG Ying

WANG Jing

GAO Lan

LÜ Xu

Related Institution

Software College of Shanxi Agricultural University

Institute of Air Defense and Anti-missile， Air Force Engineering University

College of Information Engineering， Capital Normal University

School of Mathematical Science， Capital Normal University

School of Computer Science and Technology, China University of Mining and Technology

⁰