一种可配置的CNN协加速器的FPGA实现方法

蹇强; 张培勇; 王雪洁

doi:10.3969/j.issn.0372-2112.2019.07.017

您当前的位置：

首页 >

文章列表页 >

一种可配置的CNN协加速器的FPGA实现方法

学术论文 | 更新时间：2025-07-16

- 一种可配置的CNN协加速器的FPGA实现方法
- An FPGA Implementation Method for Configurable CNN Co-Accelerator
- 电子学报 2019年47卷第7期页码：1525-1531
- 作者机构：
  
  1. 浙江大学信息与电子工程学院,浙江,杭州,310027
  2. 浙江大学城市学院,浙江,杭州,310015
  3. 浙江大学信息与电子工程学院,浙江,杭州,310027
  4. 浙江大学城市学院,浙江,杭州,310015
- 作者简介：
- 基金信息：
  
  面向14纳米及以下工艺的亚皮秒精度信号片上测量关键技术研究 (No.61474098）;面向10纳米及以下工艺集成电路晶圆快速缺陷检测 (No.61674129）
- DOI：10.3969/j.issn.0372-2112.2019.07.017
  中图分类号： TN47
- 网络出版：2019-07-25，
  
  纸质出版：2019
- 稿件说明：
移动端阅览
蹇强, 张培勇, 王雪洁. 一种可配置的CNN协加速器的FPGA实现方法[J]. 电子学报, 2019,47(7):1525-1531.

JIAN Qiang, ZHANG Pei-yong, WANG Xue-jie. An FPGA Implementation Method for Configurable CNN Co-Accelerator[J]. Acta Electronica Sinica, 2019, 47(7): 1525-1531.
蹇强, 张培勇, 王雪洁. 一种可配置的CNN协加速器的FPGA实现方法[J]. 电子学报, 2019,47(7):1525-1531. DOI： 10.3969/j.issn.0372-2112.2019.07.017.

JIAN Qiang, ZHANG Pei-yong, WANG Xue-jie. An FPGA Implementation Method for Configurable CNN Co-Accelerator[J]. Acta Electronica Sinica, 2019, 47(7): 1525-1531. DOI： 10.3969/j.issn.0372-2112.2019.07.017.

摘要

针对卷积神经网络中卷积运算复杂度高而导致计算时间过长的问题，本文提出了一种八级流水线结构的可配置CNN协加速器FPGA实现方法.通过在卷积运算控制器中嵌入池化采样控制器的复用手段使计算模块获得更多资源，利用mirror-tree结构来提高并行度，并采用Map算法来提高计算密度，同时加快了计算速度.实验结果表明，当精度为32位定点数/浮点数时，该实现方法的计算性能达到22.74GOPS.对比MAPLE加速器，计算密度提高283.3%，计算速度提高了224.9%，对比MCA（Memory-Centric Accelerator）加速器，计算密度提高了14.47%，计算速度提高了33.76%，当精度为8-16位定点数时，计算性能达到58.3GOPS，对比LBA（Layer-Based Accelerator）计算密度提高了8.5%.

Abstract

To solve the problem that the time consumption of convolutional neural network is too much

which is mostly caused by the high complexity of convolution operation

an FPGA implementation of a configurable CNN co-accelerator with eight-stage pipeline structure is proposed. By embedding the pooling controller in the convolution controller

the computational module is able to obtain more resources. Specially

a mirror-tree structure is designed to increase parallelism. Furthermore

to increase computational density and speed up calculation at the same time

the Map algorithm is implemented in this design. The experimental results show that the computing performance of this implementation reaches 22.74 GOPS on 32-bit fixed/float point. Compared with MAPLE accelerator

the computational density is increased by 283.3%

and the calculation speed is boosted by 224.9%. Compared with MCA(Memory-Centric Accelerator)

the computational density is increased by 14.47%

and the calculation speed is boosted by 33.76%. With a precision range between 8-bit and 16-bit fixed point

the performance reaches 58.3GOPS

and the computational density is increased by 8.5% compared with LBA(Layer-Based Accelerator).

关键词

Keywords

references

浏览量

257

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于非一般类算子融合方法及硬件架构设计

一种注意力机制优化方法及硬件加速设计

基于ZYNQ的可重构卷积神经网络加速器