A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism

YUAN Hai-ying; ZENG Zhi-yong; CHENG Jun-peng

doi:10.12263/DZXB.20211514

您当前的位置：

首页 >

文章列表页 >

A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism

PAPERS | 更新时间：2025-12-08

- A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism
- ACTA ELECTRONICA SINICA Vol. 50, Issue 8, Pages: 1811-1818(2022)
- 作者机构：
  
  北京工业大学信息学部，北京100124
- 作者简介：
- 基金信息：
- DOI：10.12263/DZXB.20211514
  CLC： TN47
- Received：12 November 2021，
  
  Revised：2022-06-08，
  
  Published：25 August 2022
- 稿件说明：
移动端阅览
袁海英,曾智勇,成君鹏.面向灵活并行度的稀疏卷积神经网络加速器[J].电子学报,2022,50(08):1811-1818.

YUAN Hai-ying,ZENG Zhi-yong,CHENG Jun-peng.A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism[J].ACTA ELECTRONICA SINICA,2022,50(08):1811-1818.
袁海英,曾智勇,成君鹏.面向灵活并行度的稀疏卷积神经网络加速器[J].电子学报,2022,50(08):1811-1818. DOI： 10.12263/DZXB.20211514.

YUAN Hai-ying,ZENG Zhi-yong,CHENG Jun-peng.A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism[J].ACTA ELECTRONICA SINICA,2022,50(08):1811-1818. DOI： 10.12263/DZXB.20211514.

摘要

大规模卷积神经网络计算复杂度高且资源开销大，这极大提高了深度学习算法的硬件部署成本.在模型推理过程中充分利用层间稀疏激活的信息冗余，以较低资源开销和几乎无损的网络精度降低推理时延和功耗提供高效的加速器解决方案.针对稀疏卷积神经加速器中控制粒度过大导致运算模块利用率过低问题，本文提出基于FPGA具有灵活并行度的稀疏卷积神经网络加速器架构.基于运算簇思想对卷积运算模块实现灵活调度，根据卷积层结构在线调整输入通道和输出激活的并行度；根据输出激活并行运算的数据一致性设计了一种输入数据的并行传播方式.本文在Xilinx VC709目标设备上实现了提出的加速器硬件架构，它包含1 024个乘累加单元，提供409.6GOP/s理论峰值算力；实际运算速度在VGG-16模型中达到325.8GOP/s，等效于稀疏激活优化前加速器的794.63GOP/s，运算性能达到baseline模型4.6倍以上.

Abstract

Convolutional neural network involves in high computational complexity and excessive hardware resources

which greatly increases hardware deployment cost of deep learning algorithm. It is a promising scheme to make full use of the information redundancy of sparsity activation between layers can reduce the inference delay and power consumption with low resource overhead and almost lossless network accuracy. To solve low utilization problem of operation module caused by coarse-grained control in sparse convolution neural network accelerator

a sparsity-aware accelerator with flexible parallelism based on FPGA is designed. Convolution operation module is flexibly scheduled based on operation clustering idea

and the parallelism of input channel and output activation is adjusted online.In addition

a parallel propagation mode of input data is designed according to the data consistency during output activated parallel operation. The proposed hardware architecture is implemented on Xilinx VC709. It contains up to 1 024 multiplication and accumulation units and provides 409.6GOP/s peak computing power

and the operation speed is up to 325.8GOP/ s in VGG-16 model

which is equivalent to 794.63GOP/s of accelerator without sparse activation optimization. Its performance is 4.6 times more than that of baseline model.

关键词

Keywords

references

BAI L , LYU Y , HUANG X . RoadNet-RT: High throughput CNN architecture and SoC design for real-time road segmentation [J]. IEEE Transactions on Circuits and Systems I , 2021 , 68 ( 2 ): 704 - 714 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G . ImageNet classification with deep convolutional neural networks [J]. Advances in Neural Information Processing Systems , 2012 , 25 ( 2 ): 1097 - 1105 .

HE K M , ZHANG X Y , REN S Q , et al . Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification [C]// IEEE International Conference on Computer Vision . Santiago : IEEE , 2015 : 1026 - 1034 .

刘杰 , 葛一凡 , 田明 , 马力强 . 基于ZYNQ的可重构卷积神经网络加速器 [J]. 电子学报 , 2021 , 49 ( 4 ): 729 - 735 .

LIU Jie , GE Yi-fan , TIAN Ming , MA Li-qiang . Reconfigurable convolutional network accelerator based on ZYNQ [J]. Acta Electronica Sinica , 2021 , 49 ( 4 ): 729 - 735 . (in Chinese)

LIANG S , YIN S , LIU L , et al . Acoarse-grained reconfigurable architecture for compute-intensive mapreduce acceleration [J]. IEEE Computer Architecture Letters , 2016 , 15 ( 2 ): 69 - 72 .

YU Y , WU C , ZHAO T , et al . OPU: An FPGA-based overlay processor for convolutional neural networks [J] . IEEE Transactions on Very Large-Scale Integration(VLSI) Systems , 2020 , 28 ( 1 ): 35 - 47 .

ZHANG C , ZHENMAN F , PEIPEI Z , et al . Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks [C]// IEEE/ACM International Conference on Computer-Aided Design(ICCAD) . Austin : IEEE , 2016 : 1 - 8 .

GUO J , YIN S , OUYANG P , et al . Bit-width based resource partitioning for CNN acceleration on FPGA [C]// IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM) . Napa : IEEE , 2017 : 31 - 31 .

ALBERICIO J , JUDD P , HETHERINGTON T , et al . Cnvlutin: Ineffectual-neuron-freedeep neural network computing [C]// IEEE 43th International Symposium on Computer Architecture . Seoul : IEEE , 2016 : 1 - 13 .

MA Y , CAOY , VRUDHULA S , et al . Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks [C]// ACM/Sigda International Symposium on Field-programmable Gate Arrays . Monterey : ACM , 2017 : 45 - 54 .

LEE H , GROSSE R , RANGANATH R , et al . Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations [C]// 26th International Conference on Machine Learning . Montreal : ACM , 2009 : 609 - 616 .

KARPATHY A , TODERICI G , SHETTY S , et al . Large-scale video classification with convolutional neural networks [C]// Computer Vision & Pattern Recognition . Columbus : IEEE , 2014 : 1725 - 1732 .

YU D , DENG L . Deep learning and its applications to signal and information processing [J]. IEEE Signal Processing Magazine , 2011 , 28 ( 1 ): 145 - 154 .

CONG J , XIAO B . Minimizing computation in convolutional neural networks [C]// International Conference on Artificial Neural Networks . Hamburg : Springer, Cham , 2014 : 281 - 290 .

LI Y , MA S , GUO Y , et al . Configurable CNN accelerator based on tiling dataflow [C]// 2018 IEEE 9th International Conference on Software Engineering and Service Science(ICSESS) . Beijing : IEEE , 2018 : 309 - 313 .

SHANG J W , QIAN L , ZHANG Z , et al . LACS: A high-computational-efficiency accelerator for CNNs [J]. IEEE Access , 2020 , 8 : 6045 - 6059 .

ZHU C , HUANG K , YANG S , et al . Anefficient hardware accelerator for structured sparse convolutional neural networks on FPGAs [J]. IEEE Transactions on Very Large-Scale Integration(VLSI) Systems , 2020 , 28 ( 9 ): 1953 - 1965 .

LIANG Y , LU L Q , XIE J M . OMNI: A framework for integrating hardware and software optimizations for sparse CNNs [J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2021 , 40 ( 8 ): 1648 - 1661 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators

Shared Super-Resolution Dual-Branch Network for Spatiotemporal Fusion of Remote Sensing Images

Lightweight Fully-Connected Tensorial Mapping Network for Hyperspectral Image Classification

Cross-CNN: An Animation Cross-Frame Sketch Colorization Algorithm Based on Hybrid Model with CNN and Transformer

No-Reference Screen Content Image Quality Assessment Based on Edge Assistance and Multi-Scale Transformer

Related Author

WANG Ying

GAO Lan

ZHANG Zhe

LIU Xin

WU Yi-xiong

ZHANG Wei-gong

FANG Shuai

ZHANG Xiao-xi

Related Institution

College of Information Engineering, Capital Normal University

School of Mathematical Science, Capital Normal University

Faculty of Software Technologics, Shanxi Agricultural University

School of Computer and Information, Hefei University of Technology

Anhui Province Key Laboratory of Industry Safety and Emergency Technology

⁰