1. 浙江大学信息与电子工程学院,浙江,杭州,310027
2. 浙江大学城市学院,浙江,杭州,310015
3. 浙江大学信息与电子工程学院,浙江,杭州,310027
4. 浙江大学城市学院,浙江,杭州,310015
网络出版:2019-07-25,
纸质出版:2019
移动端阅览
蹇强, 张培勇, 王雪洁. 一种可配置的CNN协加速器的FPGA实现方法[J]. 电子学报, 2019,47(7):1525-1531.
JIAN Qiang, ZHANG Pei-yong, WANG Xue-jie. An FPGA Implementation Method for Configurable CNN Co-Accelerator[J]. Acta Electronica Sinica, 2019, 47(7): 1525-1531.
蹇强, 张培勇, 王雪洁. 一种可配置的CNN协加速器的FPGA实现方法[J]. 电子学报, 2019,47(7):1525-1531. DOI: 10.3969/j.issn.0372-2112.2019.07.017.
JIAN Qiang, ZHANG Pei-yong, WANG Xue-jie. An FPGA Implementation Method for Configurable CNN Co-Accelerator[J]. Acta Electronica Sinica, 2019, 47(7): 1525-1531. DOI: 10.3969/j.issn.0372-2112.2019.07.017.
针对卷积神经网络中卷积运算复杂度高而导致计算时间过长的问题,本文提出了一种八级流水线结构的可配置CNN协加速器FPGA实现方法.通过在卷积运算控制器中嵌入池化采样控制器的复用手段使计算模块获得更多资源,利用mirror-tree结构来提高并行度,并采用Map算法来提高计算密度,同时加快了计算速度.实验结果表明,当精度为32位定点数/浮点数时,该实现方法的计算性能达到22.74GOPS.对比MAPLE加速器,计算密度提高283.3%,计算速度提高了224.9%,对比MCA(Memory-Centric Accelerator)加速器,计算密度提高了14.47%,计算速度提高了33.76%,当精度为8-16位定点数时,计算性能达到58.3GOPS,对比LBA(Layer-Based Accelerator)计算密度提高了8.5%.
To solve the problem that the time consumption of convolutional neural network is too much
which is mostly caused by the high complexity of convolution operation
an FPGA implementation of a configurable CNN co-accelerator with eight-stage pipeline structure is proposed. By embedding the pooling controller in the convolution controller
the computational module is able to obtain more resources. Specially
a mirror-tree structure is designed to increase parallelism. Furthermore
to increase computational density and speed up calculation at the same time
the Map algorithm is implemented in this design. The experimental results show that the computing performance of this implementation reaches 22.74 GOPS on 32-bit fixed/float point. Compared with MAPLE accelerator
the computational density is increased by 283.3%
and the calculation speed is boosted by 224.9%. Compared with MCA(Memory-Centric Accelerator)
the computational density is increased by 14.47%
and the calculation speed is boosted by 33.76%. With a precision range between 8-bit and 16-bit fixed point
the performance reaches 58.3GOPS
and the computational density is increased by 8.5% compared with LBA(Layer-Based Accelerator).
0
浏览量
257
下载量
6
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621