An FPGA Implementation Method for Configurable CNN Co-Accelerator

JIAN Qiang; ZHANG Pei-yong; WANG Xue-jie

doi:10.3969/j.issn.0372-2112.2019.07.017

您当前的位置：

首页 >

文章列表页 >

An FPGA Implementation Method for Configurable CNN Co-Accelerator

更新时间：2025-07-16

- An FPGA Implementation Method for Configurable CNN Co-Accelerator
- Acta Electronica Sinica Vol. 47, Issue 7, Pages: 1525-1531(2019)
- 作者机构：
  
  1. 浙江大学信息与电子工程学院,浙江,杭州,310027
  2. 浙江大学城市学院,浙江,杭州,310015
  3. 浙江大学信息与电子工程学院,浙江,杭州,310027
  4. 浙江大学城市学院,浙江,杭州,310015
- 作者简介：
- 基金信息：
- DOI：10.3969/j.issn.0372-2112.2019.07.017
  CLC： TN47
- Published Online：25 July 2019，
  
  Published：2019
- 稿件说明：
移动端阅览
JIAN Qiang, ZHANG Pei-yong, WANG Xue-jie. An FPGA Implementation Method for Configurable CNN Co-Accelerator[J]. Acta Electronica Sinica, 2019, 47(7): 1525-1531.
DOI：

JIAN Qiang, ZHANG Pei-yong, WANG Xue-jie. An FPGA Implementation Method for Configurable CNN Co-Accelerator[J]. Acta Electronica Sinica, 2019, 47(7): 1525-1531. DOI： 10.3969/j.issn.0372-2112.2019.07.017.

摘要

针对卷积神经网络中卷积运算复杂度高而导致计算时间过长的问题，本文提出了一种八级流水线结构的可配置CNN协加速器FPGA实现方法.通过在卷积运算控制器中嵌入池化采样控制器的复用手段使计算模块获得更多资源，利用mirror-tree结构来提高并行度，并采用Map算法来提高计算密度，同时加快了计算速度.实验结果表明，当精度为32位定点数/浮点数时，该实现方法的计算性能达到22.74GOPS.对比MAPLE加速器，计算密度提高283.3%，计算速度提高了224.9%，对比MCA（Memory-Centric Accelerator）加速器，计算密度提高了14.47%，计算速度提高了33.76%，当精度为8-16位定点数时，计算性能达到58.3GOPS，对比LBA（Layer-Based Accelerator）计算密度提高了8.5%.

Abstract

To solve the problem that the time consumption of convolutional neural network is too much

which is mostly caused by the high complexity of convolution operation

an FPGA implementation of a configurable CNN co-accelerator with eight-stage pipeline structure is proposed. By embedding the pooling controller in the convolution controller

the computational module is able to obtain more resources. Specially

a mirror-tree structure is designed to increase parallelism. Furthermore

to increase computational density and speed up calculation at the same time

the Map algorithm is implemented in this design. The experimental results show that the computing performance of this implementation reaches 22.74 GOPS on 32-bit fixed/float point. Compared with MAPLE accelerator

the computational density is increased by 283.3%

and the calculation speed is boosted by 224.9%. Compared with MCA(Memory-Centric Accelerator)

the computational density is increased by 14.47%

and the calculation speed is boosted by 33.76%. With a precision range between 8-bit and 16-bit fixed point

the performance reaches 58.3GOPS

and the computational density is increased by 8.5% compared with LBA(Layer-Based Accelerator).

关键词

Keywords

references

Views

257

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Operator Fusion Method and Hardware Architecture Design Based on Non-Standard Operators

An Improved Attention Mechanism Algorithm Model and Hardware Aceleration Design Method

Reconfigurable Convolutional Network Accelerator Based on ZYNQ

Related Author

WANG Ying

GAO Lan

ZHANG Zhe

LIU Xin

WU Yi-xiong

ZHANG Wei-gong

WANG Ying

WANG Jing

Related Institution

College of Information Engineering, Capital Normal University

School of Mathematical Science, Capital Normal University

Faculty of Software Technologics, Shanxi Agricultural University

College of Information Engineering， Capital Normal University

School of Mathematical Science， Capital Normal University

⁰