ConvFormer：基于Transformer的视觉主干网络

胡杰; 昌敏杰; 徐博远; 徐文才

doi:10.12263/DZXB.20220735

您当前的位置：

首页 >

文章列表页 >

ConvFormer：基于Transformer的视觉主干网络

学术论文 | 更新时间：2026-04-10

- ConvFormer：基于Transformer的视觉主干网络
- ConvFormer: Vision Backbone Network Based on Transformer
- 电子学报 2024年52卷第1期页码：46-57
- 作者机构：
  
  1.武汉理工大学汽车工程学院,湖北武汉 430070
  2.武汉理工大学现代汽车零部件技术湖北省重点实验室,湖北武汉 430070
  3.武汉理工大学汽车零部件技术湖北省协同创新中心,湖北武汉 430070
  4.武汉理工大学湖北省新能源与智能网联车工程技术研究中心,湖北武汉 430070
- 作者简介：
  
  [ "胡杰男，1984年生，湖南永州人.武汉理工大学汽车工程学院教授，博士生导师.主要研究方向为汽车控制与诊断、车联网与大数据、智能驾驶、智能底盘等.E-mail: auto_hj@163.com" ]
  [ "昌敏杰男，1999年生，湖北洪湖人.武汉理工大学汽车工程学院硕士研究生.主要研究方向为目标检测和目标跟踪. E-mail: 1468139558@qq.com" ]
  [ "徐博远男，1998年生，湖北仙桃人.武汉理工大学汽车工程学院硕士研究生.主要研究方向为目标检测. E-mail: 1903086417@qq.com" ]
  [ "徐文才男，1995年生，山东潍坊人.武汉理工大学汽车工程学院博士研究生.主要研究方向为3D目标检测、目标跟踪和场景理解. E-mail: wencaixu_val@163.com" ]
- 基金信息：
  
  湖北省重大科技专项(2020AAA001;2022AAA001)
- DOI：10.12263/DZXB.20220735
  中图分类号： TP391.41;
- 收稿：2022-06-27，
  
  修回：2023-04-11，
  
  纸质出版：2024-01-25
- 稿件说明：
移动端阅览
胡杰,昌敏杰,徐博远等.ConvFormer：基于Transformer的视觉主干网络[J].电子学报,2024,52(01):46-57.

HU Jie,CHANG Min-jie,XU Bo-yuan,et al.ConvFormer: Vision Backbone Network Based on Transformer[J].ACTA ELECTRONICA SINICA,2024,52(01):46-57.
胡杰,昌敏杰,徐博远等.ConvFormer：基于Transformer的视觉主干网络[J].电子学报,2024,52(01):46-57. DOI： 10.12263/DZXB.20220735.

HU Jie,CHANG Min-jie,XU Bo-yuan,et al.ConvFormer: Vision Backbone Network Based on Transformer[J].ACTA ELECTRONICA SINICA,2024,52(01):46-57. DOI： 10.12263/DZXB.20220735.

摘要

针对主流Transformer网络仅对输入像素块做自注意力计算而忽略了不同像素块间的信息交互，以及输入尺度单一导致局部特征细节模糊的问题，本文提出一种基于Transformer并用于处理视觉任务的主干网络ConvFormer.ConvFormer通过所设计的多尺度混洗自注意力模块（Channel-Shuffle and Multi-Scale attention，CSMS）和动态相对位置编码模块（Dynamic Relative Position Coding，DRPC）来聚合多尺度像素块间的语义信息，并在前馈网络中引入深度卷积提高网络的局部建模能力.在公开数据集ImageNet-1K，COCO 2017和ADE20K上分别进行图像分类、目标检测和语义分割实验，ConvFormer-Tiny与不同视觉任务中同量级最优网络RetNetY-4G，Swin-Tiny和ResNet50对比，精度分别提高0.3%，1.4%和0.5%.

Abstract

To solve the problem that the mainstream network based on Transformer only does self-attention computation on the input pixel blocks and ignores the information interaction between different pixel blocks

as well as the blurring of local feature details due to a single input scale

a backbone network based on Transformer and used for processing vision tasks is proposed called ConvFormer. ConvFormer aggregates the semantic information between multi-scale pixel blocks through the designed channel-shuffle and multi-scale attention (CSMS) and dynamic relative position coding (DRPC) modules

as well as introduces deep convolution in the feedforward network to improve the local modeling capability of the network. In the image classification

target detection

and semantic segmentation experiments on public datasets ImageNet-1K

COCO 2017

and ADE20K

ConvFormer-Tiny compares with the optimal networks of the same magnitude RetNetY-4G

Swin-Tiny

and ResNet50 in different vision tasks

the accuracy is improved by 0.3%

1.4%

and 0.5%.

关键词

Keywords

references

REN S Q , HE K M , GIRSHICK R , et al . Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 6 ): 1137 - 1149 .

KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks [J]. Communications of the ACM , 2017 , 60 ( 6 ): 84 - 90 .

SIMONYAN K , ZISSERMAN A . Very deep convolutional networks for large-scale image recognition [EB/OL]. ( 2014-09-04 )[ 2022-06-22 ]. https://arxiv.org/abs/1409.1556 https://arxiv.org/abs/1409.1556 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

HUANG G , LIU Z , VAN DER MAATEN L , et al . Densely connected convolutional networks [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 2261 - 2269 .

HE K M , GKIOXARI G , DOLLÁR P , et al . Mask R-CNN [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2980 - 2988 .

BERTINETTO L , VALMADRE J , HENRIQUES J F , et al . Fully-convolutional Siamese networks for object tracking [C]// European Conference on Computer Vision . Cham : Springer , 2016 : 850 - 865 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6000 - 6010 .

BELLO I , ZOPH B , LE Q , et al . Attention augmented convolutional networks [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2020 : 3285 - 3294 .

田永林 , 王雨桐 , 王建功 , 等 . 视觉Transformer研究的关键问题: 现状及展望 [J]. 自动化学报 , 2022 , 48 ( 4 ): 957 - 979 .

TIAN Y L , WANG Y T , WANG J G , et al . Key problems and progress of vision transformers: The state of the art and prospects [J]. Acta Automatica Sinica , 2022 , 48 ( 4 ): 957 - 979 . (in Chinese)

BELTAGY I , PETERS M E , COHAN A . Longformer: The long-document transformer [EB/OL]. ( 2020-04-10 )[ 2022-06-22 ]. https://arxiv.org/abs/2004.05150 https://arxiv.org/abs/2004.05150 .

CARION N , MASSA F , SYNNAEVE G , et al . End-to-end object detection with transformers [C]// European Conference on Computer Vision . Cham : Springer , 2020 : 213 - 229 .

JIANG Y F , CHANG S Y , WANG Z Y . TransGAN: Two pure transformers can make one strong GAN, and that can scale up [C]// Proceedings of Neural Information Processing Systems . La Jolla : NIPS , 2021 : 14745 - 14758 .

XIE E Z , WANG W H , YU Z D , et al . SegFormer: Simple and efficient design for semantic segmentation with transformers [C]// Proceedings of Neural Information Processing Systems . La Jolla : NIPS , 2021 : 12077 - 12090 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 × 16 words: Transformers for image recognition at scale[EB/OL]. ( 2021-01-03 )[ 2022-06-22 ]. https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 .

LIU Z , LIN Y T , CAO Y , et al . Swin transformer: Hierarchical vision transformer using shifted windows [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 9992 - 10002 .

WANG W H , XIE E Z , LI X , et al . Pyramid vision transformer: A versatile backbone for dense prediction without convolutions [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 548 - 558 .

TOUVRON H , CORD M , DOUZE M , et al . Training data-efficient image transformers & distillation through attention [C]// International Conference on Machine Learning . San Diego : JMLR , 2021 : 7358 - 7367 .

LIN H Z , CHENG X , WU X Y , et al . CAT: Cross attention in vision transformer [C]// 2022 IEEE International Conference on Multimedia and Expo (ICME) . Piscataway : IEEE , 2022 : 1 - 6 .

CHEN Z Y , ZHU Y S , ZHAO C Y , et al . DPT: Deformable patch-based transformer for visual recognition [C]// Proceedings of the 29th ACM International Conference on Multimedia . New York : ACM , 2021 : 2899 - 2907 .

HAN K , XIAO A , WU E , et al . Transformer in transformer [C]// Proceedings of Neural Information Processing Systems . La Jolla : NIPS , 2021 : 15908 - 15919 .

YUAN K , GUO S P , LIU Z W , et al . Incorporating convolution designs into visual transformers [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 559 - 568 .

SZEGEDY C , LIU W , JIA Y Q , et al . Going deeper with convolutions [C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2015 : 1 - 9 .

TAN M X , LE Q . Efficientnet: Rethinking model scaling for convolutional neural networks [C]// International Conference on Machine Learning . San Diego : JMLR , 2019 : 6105 - 6114 .

XIE S N , GIRSHICK R , DOLLÁR P , et al . Aggregated residual transformations for deep neural networks [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 5987 - 5995 .

ZHU X Z , HU H , LIN S , et al . Deformable ConvNets V2: More deformable, better results [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 9300 - 9308 .

CHEN L C , PAPANDREOU G , SCHROFF F , et al . Rethinking atrous convolution for semantic image segmentation [EB/OL]. ( 2017-12-05 )[ 2022-06-22 ]. https://arxiv.org/abs/2106.05786 https://arxiv.org/abs/2106.05786 .

ZHAO H S , SHI J P , QI X J , et al . Pyramid scene parsing network [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 6230 - 6239 .

WANG X L , GIRSHICK R , GUPTA A , et al . Non-local neural networks [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 7794 - 7803 .

HUANG Z L , WANG X G , HUANG L C , et al . CCNet: Criss-cross attention for semantic segmentation [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2020 : 603 - 612 .

CHEN C F R , FAN Q F , PANDA R . CrossViT: Cross-attention multi-scale vision transformer for image classification [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 347 - 356 .

BA J L , KIROS J R , HINTON G E . Layer normalization [EB/OL]. ( 2016-07-21 )[ 2022-06-22 ]. https://arxiv.org/abs/1607.06450 https://arxiv.org/abs/1607.06450 .

LIN T Y , DOLLÁR P , GIRSHICK R , et al . Feature pyramid networks for object detection [C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2017 : 936 - 944 .

ZHANG X Y , ZHOU X Y , LIN M X , et al . ShuffleNet: An extremely efficient convolutional neural network for mobile devices [C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 6848 - 6856 .

HOWARD A G , ZHU M L , CHEN B , et al . MobileNets: Efficient convolutional neural networks for mobile vision applications [EB/OL]. ( 2017-04-17 )[ 2022-06-22 ]. https://arxiv.org/abs/1704.04861 https://arxiv.org/abs/1704.04861 .

DENG J , DONG W , SOCHER R , et al . ImageNet: A large-scale hierarchical image database [C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2009 : 248 - 255 .

LIN T Y , MAIRE M , BELONGIE S , et al . Microsoft COCO: Common objects in context [C]// European Conference on Computer Vision . Cham : Springer , 2014 : 740 - 755 .

ZHOU B L , ZHAO H , PUIG X , et al . Semantic understanding of scenes through the ADE20K dataset [J]. International Journal of Computer Vision , 2019 , 127 ( 3 ): 302 - 321 .

RADOSAVOVIC I , KOSARAJU R P , GIRSHICK R , et al . Designing network design spaces [C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2020 : 10425 - 10433 .

LOSHCHILOV I , HUTTER F . Decoupled weight decay regularization [EB/OL]. ( 2019-01-04 )[ 2022-06-22 ]. https://arxiv.org/abs/1711.05101 https://arxiv.org/abs/1711.05101 .

YUN S , HAN D , CHUN S , et al . CutMix: Regularization strategy to train strong classifiers with localizable features [C]// 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2020 : 6022 - 6031 .

ZHANG H Y , CISSE M , DAUPHIN Y N , et al . Mixup: Beyond empirical risk minimization [EB/OL]. ( 2018-04-27 )[ 2022-06-22 ]. https://arxiv.org/abs/1710.09412 https://arxiv.org/abs/1710.09412 .

LIN T Y , GOYAL P , GIRSHICK R , et al . Focal loss for dense object detection [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 2999 - 3007 .

CHEN K , WANG J Q , PANG J M , et al . MMDetection: Open MMLab detection toolbox and benchmark [EB/OL]. ( 2019-01-17 )[ 2022-06-22 ]. https://arxiv.org/abs/1906.07155 https://arxiv.org/abs/1906.07155 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

多模态生理特征融合的驾驶行为识别研究

融合多源城市环境信息的知识图谱驱动轨迹生成模型

基于交叉视觉状态空间与多分支交互注意力的医学图像分割

DMR-KAN：基于多尺度区域强化的三维肿瘤影像分割方法

基于时频注意力Conformer的多尺度短语音说话人识别模型