北京工业大学信息学部,北京100124
[ "袁海英 女,博士,1976年出生于四川阆中, 现为北京工业大学信息学部副教授,主要研究方向为面向人工智应用的高能效计算系统、微弱信号检测与智能信息处理、电子系统容错与通信总线技术." ]
[ "曾智勇 男,1997年出生于北京.现为北京工业大学信息学部硕士研究生.主要研究方向为基于FPGA的卷积神经网络加速器.E-mail: m_x_zy@126.com" ]
[ "成君鹏 男,1995年出生于江苏盐城.现为北京工业大学信息学部硕士研究生.主要研究方向为轻量级卷积神经网络.E-mail: chengjp@emails.bjut.edu.cn" ]
收稿:2021-11-12,
修回:2022-06-08,
纸质出版:2022-08-25
移动端阅览
袁海英,曾智勇,成君鹏.面向灵活并行度的稀疏卷积神经网络加速器[J].电子学报,2022,50(08):1811-1818.
YUAN Hai-ying,ZENG Zhi-yong,CHENG Jun-peng.A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism[J].ACTA ELECTRONICA SINICA,2022,50(08):1811-1818.
袁海英,曾智勇,成君鹏.面向灵活并行度的稀疏卷积神经网络加速器[J].电子学报,2022,50(08):1811-1818. DOI: 10.12263/DZXB.20211514.
YUAN Hai-ying,ZENG Zhi-yong,CHENG Jun-peng.A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism[J].ACTA ELECTRONICA SINICA,2022,50(08):1811-1818. DOI: 10.12263/DZXB.20211514.
大规模卷积神经网络计算复杂度高且资源开销大,这极大提高了深度学习算法的硬件部署成本.在模型推理过程中充分利用层间稀疏激活的信息冗余,以较低资源开销和几乎无损的网络精度降低推理时延和功耗提供高效的加速器解决方案.针对稀疏卷积神经加速器中控制粒度过大导致运算模块利用率过低问题,本文提出基于FPGA具有灵活并行度的稀疏卷积神经网络加速器架构.基于运算簇思想对卷积运算模块实现灵活调度,根据卷积层结构在线调整输入通道和输出激活的并行度;根据输出激活并行运算的数据一致性设计了一种输入数据的并行传播方式.本文在Xilinx VC709目标设备上实现了提出的加速器硬件架构,它包含1 024个乘累加单元,提供409.6GOP/s理论峰值算力;实际运算速度在VGG-16模型中达到325.8GOP/s,等效于稀疏激活优化前加速器的794.63GOP/s,运算性能达到baseline模型4.6倍以上.
Convolutional neural network involves in high computational complexity and excessive hardware resources
which greatly increases hardware deployment cost of deep learning algorithm. It is a promising scheme to make full use of the information redundancy of sparsity activation between layers can reduce the inference delay and power consumption with low resource overhead and almost lossless network accuracy. To solve low utilization problem of operation module caused by coarse-grained control in sparse convolution neural network accelerator
a sparsity-aware accelerator with flexible parallelism based on FPGA is designed. Convolution operation module is flexibly scheduled based on operation clustering idea
and the parallelism of input channel and output activation is adjusted online.In addition
a parallel propagation mode of input data is designed according to the data consistency during output activated parallel operation. The proposed hardware architecture is implemented on Xilinx VC709. It contains up to 1 024 multiplication and accumulation units and provides 409.6GOP/s peak computing power
and the operation speed is up to 325.8GOP/ s in VGG-16 model
which is equivalent to 794.63GOP/s of accelerator without sparse activation optimization. Its performance is 4.6 times more than that of baseline model.
BAI L , LYU Y , HUANG X . RoadNet-RT: High throughput CNN architecture and SoC design for real-time road segmentation [J]. IEEE Transactions on Circuits and Systems I , 2021 , 68 ( 2 ): 704 - 714 .
KRIZHEVSKY A , SUTSKEVER I , HINTON G . ImageNet classification with deep convolutional neural networks [J]. Advances in Neural Information Processing Systems , 2012 , 25 ( 2 ): 1097 - 1105 .
HE K M , ZHANG X Y , REN S Q , et al . Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification [C]// IEEE International Conference on Computer Vision . Santiago : IEEE , 2015 : 1026 - 1034 .
刘杰 , 葛一凡 , 田明 , 马力强 . 基于ZYNQ的可重构卷积神经网络加速器 [J]. 电子学报 , 2021 , 49 ( 4 ): 729 - 735 .
LIU Jie , GE Yi-fan , TIAN Ming , MA Li-qiang . Reconfigurable convolutional network accelerator based on ZYNQ [J]. Acta Electronica Sinica , 2021 , 49 ( 4 ): 729 - 735 . (in Chinese)
LIANG S , YIN S , LIU L , et al . Acoarse-grained reconfigurable architecture for compute-intensive mapreduce acceleration [J]. IEEE Computer Architecture Letters , 2016 , 15 ( 2 ): 69 - 72 .
YU Y , WU C , ZHAO T , et al . OPU: An FPGA-based overlay processor for convolutional neural networks [J] . IEEE Transactions on Very Large-Scale Integration(VLSI) Systems , 2020 , 28 ( 1 ): 35 - 47 .
ZHANG C , ZHENMAN F , PEIPEI Z , et al . Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks [C]// IEEE/ACM International Conference on Computer-Aided Design(ICCAD) . Austin : IEEE , 2016 : 1 - 8 .
GUO J , YIN S , OUYANG P , et al . Bit-width based resource partitioning for CNN acceleration on FPGA [C]// IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM) . Napa : IEEE , 2017 : 31 - 31 .
ALBERICIO J , JUDD P , HETHERINGTON T , et al . Cnvlutin: Ineffectual-neuron-freedeep neural network computing [C]// IEEE 43th International Symposium on Computer Architecture . Seoul : IEEE , 2016 : 1 - 13 .
MA Y , CAOY , VRUDHULA S , et al . Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks [C]// ACM/Sigda International Symposium on Field-programmable Gate Arrays . Monterey : ACM , 2017 : 45 - 54 .
LEE H , GROSSE R , RANGANATH R , et al . Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations [C]// 26th International Conference on Machine Learning . Montreal : ACM , 2009 : 609 - 616 .
KARPATHY A , TODERICI G , SHETTY S , et al . Large-scale video classification with convolutional neural networks [C]// Computer Vision & Pattern Recognition . Columbus : IEEE , 2014 : 1725 - 1732 .
YU D , DENG L . Deep learning and its applications to signal and information processing [J]. IEEE Signal Processing Magazine , 2011 , 28 ( 1 ): 145 - 154 .
CONG J , XIAO B . Minimizing computation in convolutional neural networks [C]// International Conference on Artificial Neural Networks . Hamburg : Springer, Cham , 2014 : 281 - 290 .
LI Y , MA S , GUO Y , et al . Configurable CNN accelerator based on tiling dataflow [C]// 2018 IEEE 9th International Conference on Software Engineering and Service Science(ICSESS) . Beijing : IEEE , 2018 : 309 - 313 .
SHANG J W , QIAN L , ZHANG Z , et al . LACS: A high-computational-efficiency accelerator for CNNs [J]. IEEE Access , 2020 , 8 : 6045 - 6059 .
ZHU C , HUANG K , YANG S , et al . Anefficient hardware accelerator for structured sparse convolutional neural networks on FPGAs [J]. IEEE Transactions on Very Large-Scale Integration(VLSI) Systems , 2020 , 28 ( 9 ): 1953 - 1965 .
LIANG Y , LU L Q , XIE J M . OMNI: A framework for integrating hardware and software optimizations for sparse CNNs [J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , 2021 , 40 ( 8 ): 1648 - 1661 .
0
浏览量
11
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621