• •
李斌1,2, 齐延荣1,2, 周清雷1,2
收稿日期:
2020-12-01
修回日期:
2021-08-17
出版日期:
2022-07-04
通讯作者:
齐延荣
作者简介:
Li Bin1,2, Qi Yan-rong1,2, Zhou Qing-lei1,2
Received:
2020-12-01
Revised:
2021-08-17
Online:
2022-07-04
Contact:
Qi Yan-rong
摘要:
卷积神经网络(Convolutional Neural Networks,CNN)已被广泛应用于图像处理领域.基于CNN的目标检测模型,如YOLO,已被证明在许多应用中是最先进的.CNN对计算能力和内存带宽要求极高,通常需要部署到专用硬件平台,FPGA因其高性能、低功耗和可重配置性成为CNN的有效硬件加速器.以往的基于FPGA的目标检测加速器主要采用传统卷积算法,然而,传统卷积算法的高运算复杂度限制了加速器的性能.基于此,本文设计了一种基于Winograd算法的目标检测加速器.考虑到各模块间的联系,采用模块融合策略融合卷积层和池化层模块,降低数据移动次数,减少片外存储器访问次数,提高加速器整体性能.以YOLO2模型为例,对数据访问模式、池化内核、参数重排序、数据通路优化进行分析设计,并部署在U280板卡上.实验结果表明,量化后mAP降低了0.96%,性能达249.65GOP/s,是Xilinx官网所给数据的4.4倍.
中图分类号:
李斌, 齐延荣, 周清雷. 基于Winograd算法的目标检测加速器设计与优化[J]. 电子学报, DOI: 10.12263/DZXB.20201371.
Li Bin, Qi Yan-rong, Zhou Qing-lei. Design and optimization of target detection accelerator based on Winograd algorithm[J]. Acta Electronica Sinica, DOI: 10.12263/DZXB.20201371.
F(4,3) | Multiplication |
---|---|
Winograd algorithm | 6 |
Standard algorithm | 12 |
Arithmetic complexity reduction | 2x |
表1 采用Winograd算法降低复杂度
F(4,3) | Multiplication |
---|---|
Winograd algorithm | 6 |
Standard algorithm | 12 |
Arithmetic complexity reduction | 2x |
Input Feature | Output Feature | Kernel Size | Winograd algorithm 乘法数 | Standard algorithm 乘法数 | Non-blocking algorithm乘法数 | |
---|---|---|---|---|---|---|
Conv1 | 3 | 544×544 | 3×3 | 21307392 | 42614784 | 127844352 |
Conv2 | 16 | 272×272 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv5 | 32 | 136×136 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv6 | 64 | 68×68 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv7 | 128 | 68×68 | 1×1 | 56819712 | 113639424 | 340918272 |
Conv8 | 64 | 68×68 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv11 | 128 | 34×34 | 3×3 | 60162048 | 120324096 | 340918272 |
Conv12 | 256 | 34×34 | 1×1 | 60162048 | 120324096 | 37879808 |
Conv13 | 128 | 34×34 | 3×3 | 60162048 | 120324096 | 340918272 |
Conv16 | 256 | 17×17 | 3×3 | 66846720 | 133693440 | 340918272 |
Conv17 | 512 | 17×17 | 1×1 | 66846720 | 133693440 | 37879808 |
Conv21 | 256 | 17×17 | 3×3 | 33423360 | 66846720 | 170459136 |
Conv22 | 256 | 17×17 | 1×1 | 4177920 | 8355840 | 2367488 |
Total | 1125113856 | 2250227712 | 4448509952 |
表2 部分Conv层乘法数对比
Input Feature | Output Feature | Kernel Size | Winograd algorithm 乘法数 | Standard algorithm 乘法数 | Non-blocking algorithm乘法数 | |
---|---|---|---|---|---|---|
Conv1 | 3 | 544×544 | 3×3 | 21307392 | 42614784 | 127844352 |
Conv2 | 16 | 272×272 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv5 | 32 | 136×136 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv6 | 64 | 68×68 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv7 | 128 | 68×68 | 1×1 | 56819712 | 113639424 | 340918272 |
Conv8 | 64 | 68×68 | 3×3 | 56819712 | 113639424 | 340918272 |
Conv11 | 128 | 34×34 | 3×3 | 60162048 | 120324096 | 340918272 |
Conv12 | 256 | 34×34 | 1×1 | 60162048 | 120324096 | 37879808 |
Conv13 | 128 | 34×34 | 3×3 | 60162048 | 120324096 | 340918272 |
Conv16 | 256 | 17×17 | 3×3 | 66846720 | 133693440 | 340918272 |
Conv17 | 512 | 17×17 | 1×1 | 66846720 | 133693440 | 37879808 |
Conv21 | 256 | 17×17 | 3×3 | 33423360 | 66846720 | 170459136 |
Conv22 | 256 | 17×17 | 1×1 | 4177920 | 8355840 | 2367488 |
Total | 1125113856 | 2250227712 | 4448509952 |
融合前 | 融合后 | |||||
---|---|---|---|---|---|---|
模块 | memRead | Conv | maxPool | memWrite | memRead | memWrite |
时间/s | 0.016 37 | 0.047 72 | 0.032 243 46 | 0.013 35 | 0.020 12 | 0.013 35 |
总时间/s | 0.109 683 46 | 0.033 47 |
表3 融合前后时间对比
融合前 | 融合后 | |||||
---|---|---|---|---|---|---|
模块 | memRead | Conv | maxPool | memWrite | memRead | memWrite |
时间/s | 0.016 37 | 0.047 72 | 0.032 243 46 | 0.013 35 | 0.020 12 | 0.013 35 |
总时间/s | 0.109 683 46 | 0.033 47 |
模块名 | FF | LUT | DSP | BRAM |
---|---|---|---|---|
memRead | 135 576 | 221 478 | 23 | 218 |
memWrite | 20 220 | 31 079 | 27 | 3 |
表4 各模块资源占用
模块名 | FF | LUT | DSP | BRAM |
---|---|---|---|---|
memRead | 135 576 | 221 478 | 23 | 218 |
memWrite | 20 220 | 31 079 | 27 | 3 |
资源种类 | N已用/个 | N可用/个 | 利用率/% |
---|---|---|---|
FF | 155 796 | 2 607 360 | 5.98 |
LUT | 252 557 | 1 303 680 | 19.40 |
DSP | 50 | 9 024 | 0.55 |
BRAM | 221 | 2 016 | 10.96 |
表5 资源耗费
资源种类 | N已用/个 | N可用/个 | 利用率/% |
---|---|---|---|
FF | 155 796 | 2 607 360 | 5.98 |
LUT | 252 557 | 1 303 680 | 19.40 |
DSP | 50 | 9 024 | 0.55 |
BRAM | 221 | 2 016 | 10.96 |
文献[ | 文献[ | 文献[ | 文献[ | 文献[ | Xilinx官网 | 本设计 | |
---|---|---|---|---|---|---|---|
CNN模型 | Yolo | Yolo2 | Yolo2 | Yolo1 | Tiny-Yolo2 | Yolo2 | Yolo2 |
FPGA板卡 | Pynq-z2 | ZCU102 | PYNQ-Z1 Zynq 7z020 | ZC706 | Cyclone V PCIe | U280 | U280 |
时钟频率/MHz | 125 | 300 | 200 | 200 | 117 | 300 | 400 |
权重精度 | fixed-16 | fixed-16 | - | fixed-32 | fixed-16 | fixed-8 | fixed-8 |
时间/ms | 124 | 288 | 611 | 74.39 | 339 | 66.20 | 27.04 |
性能/(GOP/s) | 24.17 | 102.2 | 48.23 | 18.82 | 21.6 | 56.65 | 249.65 |
表6 与其他文献的对比
文献[ | 文献[ | 文献[ | 文献[ | 文献[ | Xilinx官网 | 本设计 | |
---|---|---|---|---|---|---|---|
CNN模型 | Yolo | Yolo2 | Yolo2 | Yolo1 | Tiny-Yolo2 | Yolo2 | Yolo2 |
FPGA板卡 | Pynq-z2 | ZCU102 | PYNQ-Z1 Zynq 7z020 | ZC706 | Cyclone V PCIe | U280 | U280 |
时钟频率/MHz | 125 | 300 | 200 | 200 | 117 | 300 | 400 |
权重精度 | fixed-16 | fixed-16 | - | fixed-32 | fixed-16 | fixed-8 | fixed-8 |
时间/ms | 124 | 288 | 611 | 74.39 | 339 | 66.20 | 27.04 |
性能/(GOP/s) | 24.17 | 102.2 | 48.23 | 18.82 | 21.6 | 56.65 | 249.65 |
1 | REDMON J, FARHADI A. YOLO9000: Better, faster, stronger[C]// IEEE Conference on Computer Vision & Pattern Recognition. Honolulu: IEEE, 2017:6517-6525. |
2 | NAKAHARA H, YONEKAWA H, FUJII T, et al. A lightweight YOLOv2: A binarized CNN with a parallel support vector regression for an FPGA[C]// the 2018 ACM/SIGDA International Symposium. Monterey: ACM,2018: 31-40. |
3 | GUO K, ZENG S, YU J, et al. [DL] A survey of FPGA-based neural network inference accelerators[J]. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2019, 12(1): 1-26. |
4 | BAO C, XIE T, FENG W, et al. A power-efficient optimizing framework fpga accelerator based on winograd for yolo[J]. IEEE Access, 2020, 8: 94307-94317. |
5 | HUANG Y, SHEN J, WANG Z, et al. A high-efficiency fpga-based accelerator for convolutional neural networks using winograd algorithm[J]. Journal of Physics Conference Series, 2018, 1026:012019. |
6 | YANG A, LI Y, SHU H, et al. An opencl-based FPGA accelerator for compressed YOLOv2[C]//2019 International Conference on Field-Programmable Technology (ICFPT). Tianjin: IEEE, 2019: 235-238. |
7 | SHI F, LI H, GAO Y, et al. Sparse winograd convolutional neural networks on small-scale systolic arrays[EB/OL]. (2018-10-03)[2020-12-01]. . |
8 | 武铮,安虹,金旭,等.基于Intel平台的Winograd快速卷积算法研究与优化[J].计算机研究与发展,2019,56(4):825-835. |
WU Z, AN H, JIN X, et al. Research and optimization of Winograd fast convolution algorithm based on Intel platform[J]. Computer Research and Development, 2019, 56(4) ): 825-835.(in Chinese | |
9 | NGUYEN D T, NGUYEN T N, KIM H, et al. A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2019, 27(8): 1861-1873. |
10 | LIAN X, LIU Z, SONG Z, et al. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2019, 27(8): 1874-1885. |
11 | XIAO Q, LIANG Y, LU L, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]// Design Automation Conference. Austin: IEEE, 2017: 1-6. |
12 | ALWANI M, CHEN H, FERDMAN M, et al. Fused-layer CNN accelerators[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Taipei: IEEE, 2016: 1-12. |
13 | BI F, YANG J. Target detection system design and FPGA implementation based on YOLO v2 algorithm[C]//2019 3rd International Conference on Imaging, Signal Processing and Communication (ICISPC). Singapore: IEEE, 2019: 10-14. |
14 | ZHANG S, CAO J, ZHANG Q, et al. An FPGA-based reconfigurable CNN accelerator for YOLO[C]//2020 IEEE 3rd International Conference on Electronics Technology (ICET). Chengdu: IEEE, 2020: 74-78. |
15 | LU T Y, CHIN H H, WU H I, et al. A very compact embedded CNN processor design based on logarithmic computing[EB/OL]. (2020-10-13)[2020-12-01]. . |
16 | ZHAO R, NIU X, WU Y, et al. Optimizing CNN-based object detection algorithms on embedded FPGA platforms[C]//International Symposium on Applied Reconfigurable Computing. Rennes: Springer, 2017: 255-267. |
17 | WAI Y J, MOHD YUSSOF Z BIN, SALIM S I BIN, et al. Fixed point implementation of Tiny-Yolo-v2 using OpenCL on FPGA[J]. International Journal of Advanced Computer Science and Applications, 2018, 9(10): 506-512 |
18 | 齐延荣. 基于FPGA的深度学习图像识别加速与优化研究[D]. 郑州: 郑州大学, 2021. |
QI Y R. Research on Acceleration and Optimization of Deep Learning Image Recognition Based on FPGA[D]. Zhengzhou: Zhengzhou University, 2021. |
[1] | 许皓文, 陆浩, 王振占. 高光谱微波辐射计系统中2GHz带宽数字谱仪设计[J]. 电子学报, 2022, 50(6): 1472-1479. |
[2] | 曲立国, 陈国豪, 胡俊, 陈鹏. 单次扫描连通域分析算法研究综述[J]. 电子学报, 2022, 50(6): 1521-1536. |
[3] | 肖进胜, 张舒豪, 陈云华, 王元方, 杨力衡. 双向特征融合与特征选择的遥感影像目标检测[J]. 电子学报, 2022, 50(2): 267-272. |
[4] | 罗会兰, 袁璞, 童康. 基于深度学习的显著性目标检测方法综述[J]. 电子学报, 2021, 49(7): 1417-1427. |
[5] | 程旭, 宋晨, 史金钢, 周琳, 张毅锋, 郑钰辉. 基于深度学习的通用目标检测研究综述[J]. 电子学报, 2021, 49(7): 1428-1438. |
[6] | 张建民, 黎铁军, 马柯帆, 肖立权. 一种加速FPGA布线的不可满足子式求解算法[J]. 电子学报, 2021, 49(6): 1210-1216. |
[7] | 周勇, 陈思霖, 赵佳琦, 张迪, 王瀚正. 基于弱语义注意力的遥感图像可解释目标检测[J]. 电子学报, 2021, 49(4): 679-689. |
[8] | 刘杰, 葛一凡, 田明, 马力强. 基于ZYNQ的可重构卷积神经网络加速器[J]. 电子学报, 2021, 49(4): 729-735. |
[9] | 王洪雁, 张海坤, 罗宇华, 汪祖民. 复杂动态背景下基于群稀疏的运动目标检测方法[J]. 电子学报, 2021, 49(12): 2330-2338. |
[10] | 罗大鹏, 杜国庆, 曾志鹏, 魏龙生, 高常鑫, 陈应, 肖菲, 罗琛. 基于少量样本学习的多目标检测跟踪方法[J]. 电子学报, 2021, 49(1): 183-191. |
[11] | 李维刚, 叶欣, 赵云涛, 王文波. 基于改进YOLOv3算法的带钢表面缺陷检测[J]. 电子学报, 2020, 48(7): 1284-1292. |
[12] | 孙明乾, 乔庐峰, 陈庆华. 一种无匹配时间损耗的DFA压缩算法的研究与实现[J]. 电子学报, 2020, 48(6): 1132-1139. |
[13] | 罗会兰, 陈鸿坤. 基于深度学习的目标检测研究综述[J]. 电子学报, 2020, 48(6): 1230-1239. |
[14] | 刘颖, 刘红燕, 范九伦, 公衍超, 李莹华, 王富平, 卢津. 基于深度学习的小目标检测研究与应用综述[J]. 电子学报, 2020, 48(3): 590-601. |
[15] | 孟祥伟. 量化秩非参数CFAR检测器在杂波边缘中的性能分析[J]. 电子学报, 2020, 48(2): 384-389. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||