Research on Winograd Convolution Acceleration for Modern GPU

TONG Gan; HUANG Li-bo; LYU Ya-shuai

doi:10.12263/DZXB.20211641

您当前的位置：

首页 >

文章列表页 >

Research on Winograd Convolution Acceleration for Modern GPU

PAPERS | 更新时间：2026-04-10

- Research on Winograd Convolution Acceleration for Modern GPU
- ACTA ELECTRONICA SINICA Vol. 52, Issue 1, Pages: 244-257(2024)
- 作者机构：
  
  1.国防科技大学计算机学院，湖南长沙 410073
  2.华为技术有限公司，北京 100031
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(61872374)
- DOI：10.12263/DZXB.20211641
  CLC： TP183;
- Received：09 December 2021，
  
  Revised：2022-11-03，
  
  Published：25 January 2024
- 稿件说明：
移动端阅览
童敢,黄立波,吕雅帅.面向现代GPU的Winograd卷积加速研究[J].电子学报,2024,52(01):244-257.

TONG Gan,HUANG Li-bo,LYU Ya-shuai.Research on Winograd Convolution Acceleration for Modern GPU[J].ACTA ELECTRONICA SINICA,2024,52(01):244-257.
童敢,黄立波,吕雅帅.面向现代GPU的Winograd卷积加速研究[J].电子学报,2024,52(01):244-257. DOI： 10.12263/DZXB.20211641.

TONG Gan,HUANG Li-bo,LYU Ya-shuai.Research on Winograd Convolution Acceleration for Modern GPU[J].ACTA ELECTRONICA SINICA,2024,52(01):244-257. DOI： 10.12263/DZXB.20211641.

摘要

卷积运算是现代卷积神经网络中必不可少的组成部分，同时也是最耗时的.为了解决卷积算子的性能问题，包括快速傅里叶变换（Fast Fourier Transform，FFT）和Winograd在内的快速卷积算法被提出.Winograd卷积可被用于提高小卷积核的推理性能，是目前卷积神经网络中的主流实现方法.然而，Winograd卷积在许多高度优化的深度神经网络库和深度学习编译器中的实现比较低效.由于Winograd卷积的四个阶段的复杂数据依赖关系，面向GPU对其进行优化非常具有挑战性.本文针对现代GPU体系结构优化了Winograd卷积算子的性能.本文提出了Winograd计算阶段的等价变化及其利用Tensor Core进行计算的无同步实现，并进一步提出了利用不同GPU内存层级的部分计算核融合方法PKF（Partial Kernel Fusion）.基于张量虚拟机（Tensor Virtual Machine，TVM）和代码重构器PKF-Reconstructor（Partial Kernel Fusion Reconstructor），实现了高性能的Winograd卷积.对真实应用中卷积神经网络的卷积算子的评估表明，与cuDNN相比，本文所提算法实现了7.58~13.69倍的性能提升.

Abstract

Convolution operation is an indispensable part of modern convolutional neural networks

and it is also the most time-consuming. In order to solve the performance problem of convolution operators

fast convolution algorithms including FFT (Fast Fourier Transform) and Winograd have been proposed. Winograd convolution is used to improve the inference performance of small convolution kernels and is currently the mainstream implementation method in convolutional neural networks. However

the implementation of Winograd convolution in many highly optimized deep neural network libraries and deep learning compilers is relatively inefficient. Due to the complex data dependence of the four stages of Winograd convolution

it is very challenging to optimize it for GPU. In this paper

the performance of the Winograd convolution operator is optimized for modern GPU architecture. This paper proposes the equivalent transformation of the Winograd calculation stage and its non-synchronization implementation using Tensor Core

and further proposes a partial kernel fusion method utilizing different GPU memory hierarchies

i.e. PKF (Partial Kernel Fusion). Based on TVM (Tensor Virtual Machine) and a code reconstructor named PKF-Reconstructor (Partial Kernel Fusion Reconstructor)

a high-performance Winograd convolution is implemented. The evaluation of the convolution operators from real-world convolutional neural networks shows that the proposed algorithm achieves a performance improvement of 7.58~13.69 times compared with cuDNN.

关键词

Keywords

references

LAVIN A , GRAY S . Fast algorithms for convolutional neural networks [C]// 2016 IEEE CVPR . Las Vegas : IEEE , 2016 : 4013 - 4021 .

MENG L , BROTHERS J . Efficient Winograd convolution via integer arithmetic [EB/OL]. ( 2019 )[2021]. http://arxiv.org/abs/1901.01965 http://arxiv.org/abs/1901.01965 .

GONG Y , LIU B , GE W , et al . ARA: Cross-layer approximate computing framework based reconfigurable architecture for CNNs [J]. Microelectronics Journal , 2019 , 87 : 33 - 44 .

FERNANDEZ-MARQUES J , WHATMOUGH P N , MUNDY A , et al . Searching for Winograd-aware quantized networks [J]. Proceedings of Machine Learning and Systems , 2020 , 2 : 14 - 29 .

LIU Z G , MATTINA M . Efficient residue number system based Winograd convolution [C]// Computer Vision-ECCV 2020 . Cham : Springer , 2020 : 53 - 68 .

ZHANG W , LIAO X , JIN H . Fine-grained scheduling in FPGA-based convolutional neural networks [C]// 2020 IEEE 5th ICCCBDA . Chengdu : IEEE , 2020 : 120 - 128 .

BARABASZ B . Quantaized Winograd/Toom-Cook convolution for DNNs: Beyond canonical polynomials base [EB/OL]. ( 2020 )[2021]. http://arxiv.org/abs/2004.11077 http://arxiv.org/abs/2004.11077 .

LI G , LIU L , WANG X , et al . Lance: Efficient low-precision quantized Winograd convolution for neural networks based on graphics processing units [C]// ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . Barcelona : IEEE , 2020 : 3842 - 3846 .

HAN Q , HU Y , YU F , et al . Extremely low-bit convolution optimization for quantized neural network on modern computer architectures [C]// 49th International Conference on Parallel Processing-ICPP . New York : ACM , 2020 : 1 - 12 .

SABIR D , HANIF M A , HASSAN A , et al . TiQSA: Workload minimization in convolutional neural networks using tile quantization and symmetry approximation [J]. IEEE Access , 2021 , 9 : 53647 - 53668 .

CAO Y , SONG C , TANG Y . Efficient LUT-based FPGA accelerator design for universal quantized CNN inference [C]// 2021 2nd Asia Service Sciences and Software Engineering Conference . Macao : ACM , 2021 : 108 - 115 .

GHAFFAR M M , SUDARSHAN C , WEIS C , et al . A low power in-DRAM architecture for quantized CNNs using fast Winograd convolutions [C]// The International Symposium on Memory Systems . Washington : ACM , 2020 : 158 - 168 .

WU D , FAN X , CAO W , et al . SWM: A high-performance sparse-Winograd matrix multiplication CNN accelerator [J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 2021 , 29 ( 5 ): 936 - 949 .

YAO Y , LI Y , WANG C , et al . INT8 Winograd acceleration for Conv1D equipped ASR models deployed on mobile devices [EB/OL]. ( 2020 )[2021]. http://arxiv.org/abs/2010.14841 http://arxiv.org/abs/2010.14841 .

HUANG D , ZHANG X , ZHANG R , et al . DWM: A decomposable winograd method for convolution acceleration [J]. Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 4 ): 4174 - 4181 .

JIANG J , CHEN X , TSUI C Y . A reconfigurable Winograd CNN accelerator with nesting decomposition algorithm for computing convolution with large filters [EB/OL]. ( 2021 ) [2021]. https://arxiv.org/abs/2102.13272v1 https://arxiv.org/abs/2102.13272v1 .

VINCENT K , STEPHANO K , FRUMKIN M , et al . On improving the numerical stability of Winograd convolutions [C]// 5th International Conference on Learning Representations (ICLR 2017) . Toulon : OpenView , 2017 : 1 - 4 .

HONG B , RO Y , KIM J . Multi-dimensional parallel training of Winograd layer on memory-centric architecture [C]// 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) . Fukuoka : IEEE , 2018 : 682 - 695 .

JIA L , LIANG Y , LI X , et al . Enabling efficient fast convolution algorithms on GPUs via MegaKernels [J]. IEEE Transactions on Computers , 2020 , 69 : 986 - 997 .

YAN D , WANG W , CHU X . Optimizing batched Winograd convolution on GPUs [C]// Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . San Diego : ACM , 2020 : 32 - 44 .

WINOGRAD S . Arithmetic complexity of computations [EB/OL]. ( 1980 )[2021]. https://epubs.siam.org/doi/abs/10.1137/1.9781611970364 https://epubs.siam.org/doi/abs/10.1137/1.9781611970364 .

ABADI M , BARHAM P , CHEN J , et al . TensorFlow: A system for large-scale machine learning [C]// Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation . Savannah : USENIX Association , 2016 : 265 - 283 .

CHEN T , MOREAU T , JIANG Z , et al . TVM: An automated end-to-end optimizing compiler for deep learning [C]// 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18) . Savannah : USENIX Association , 2018 : 17 .

KENNEDY K , ALLEN J R . Optimizing Compilers for Modern Architectures: A Dependence-Based Approach [M]. San Francisco : Morgan Kaufmann Publishers Inc. , 2001 .

GELASHVILI R , SHAVIT N , ZLATESKI A . L 3 fusion: Fast transformed convolutions on CPUs [EB/OL]. ( 2019 ) [2021]. http://arxiv.org/abs/1912.02165 http://arxiv.org/abs/1912.02165 .

NVIDIA . GPU performance background user guide [EB/OL]. ( 2021 )[2021]. http://docs.nvidia.com/deeplearning/frameworks/dl-performance-gpu-background/index.html http://docs.nvidia.com/deeplearning/frameworks/dl-performance-gpu-background/index.html .

CHEN T , ZHENG L , YAN E , et al . Learning to optimize tensor programs [EB/OL]. ( 2018 )[2021]. https://proceedings.neurips.cc/paper/2018/hash/8b5700012be65c9da25f49408d959ca0-Abstract.html https://proceedings.neurips.cc/paper/2018/hash/8b5700012be65c9da25f49408d959ca0-Abstract.html .

Apache . Apache TVM [EB/OL]. ( 2021 )[2021]. https://tvm.apache.org/ https://tvm.apache.org/ .

Apache . Apache MXNet [EB/OL]. ( 2021 )[2021]. https://mxnet.apache.org/versions/1.8.0/ https://mxnet.apache.org/versions/1.8.0/ .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

No data

Related Institution

No data

⁰