电子学报 ›› 2022, Vol. 50 ›› Issue (8): 1811-1818.DOI: 10.12263/DZXB.20211514

• 学术论文 • 上一篇    下一篇

面向灵活并行度的稀疏卷积神经网络加速器

袁海英, 曾智勇, 成君鹏   

  1. 北京工业大学信息学部,北京 100124
  • 收稿日期:2021-11-12 修回日期:2022-06-08 出版日期:2022-08-25
    • 作者简介:
    • 袁海英 女,博士,1976年出生于四川阆中, 现为北京工业大学信息学部副教授,主要研究方向为面向人工智应用的高能效计算系统、微弱信号检测与智能信息处理、电子系统容错与通信总线技术.
      曾智勇 男,1997年出生于北京.现为北京工业大学信息学部硕士研究生.主要研究方向为基于FPGA的卷积神经网络加速器.E-mail: m_x_zy@126.com
      成君鹏 男,1995年出生于江苏盐城.现为北京工业大学信息学部硕士研究生.主要研究方向为轻量级卷积神经网络.E-mail: chengjp@emails.bjut.edu.cn

A Sparsity-Aware Convolutional Neural Network Accelerator with Flexible Parallelism

YUAN Hai-ying, ZENG Zhi-yong, CHENG Jun-peng   

  1. Faculty of Information Technology,Beijing University of Technology,Beijing 100124,China
  • Received:2021-11-12 Revised:2022-06-08 Online:2022-08-25 Published:2022-09-08

摘要:

大规模卷积神经网络计算复杂度高且资源开销大,这极大提高了深度学习算法的硬件部署成本.在模型推理过程中充分利用层间稀疏激活的信息冗余,以较低资源开销和几乎无损的网络精度降低推理时延和功耗提供高效的加速器解决方案.针对稀疏卷积神经加速器中控制粒度过大导致运算模块利用率过低问题,本文提出基于FPGA具有灵活并行度的稀疏卷积神经网络加速器架构.基于运算簇思想对卷积运算模块实现灵活调度,根据卷积层结构在线调整输入通道和输出激活的并行度;根据输出激活并行运算的数据一致性设计了一种输入数据的并行传播方式.本文在Xilinx VC709目标设备上实现了提出的加速器硬件架构,它包含1 024个乘累加单元,提供409.6GOP/s理论峰值算力;实际运算速度在VGG-16模型中达到325.8GOP/s,等效于稀疏激活优化前加速器的794.63GOP/s,运算性能达到baseline模型4.6倍以上.

关键词: FPGA, 卷积神经网络, 硬件加速, 稀疏感知, 并行计算

Abstract:

Convolutional neural network involves in high computational complexity and excessive hardware resources, which greatly increases hardware deployment cost of deep learning algorithm. It is a promising scheme to make full use of the information redundancy of sparsity activation between layers can reduce the inference delay and power consumption with low resource overhead and almost lossless network accuracy. To solve low utilization problem of operation module caused by coarse-grained control in sparse convolution neural network accelerator, a sparsity-aware accelerator with flexible parallelism based on FPGA is designed. Convolution operation module is flexibly scheduled based on operation clustering idea,and the parallelism of input channel and output activation is adjusted online.In addition, a parallel propagation mode of input data is designed according to the data consistency during output activated parallel operation. The proposed hardware architecture is implemented on Xilinx VC709. It contains up to 1 024 multiplication and accumulation units and provides 409.6GOP/s peak computing power, and the operation speed is up to 325.8GOP/ s in VGG-16 model, which is equivalent to 794.63GOP/s of accelerator without sparse activation optimization. Its performance is 4.6 times more than that of baseline model.

Key words: field programmable gate array, convolutional neural network, hardware accelerator, sparse perception, parallel computation

中图分类号: