电子学报 ›› 2022, Vol. 50 ›› Issue (1): 72-78.DOI: 10.12263/DZXB.20200443

• 学术论文 • 上一篇    下一篇

基于多层聚焦Inception-V3卷积网络的细粒度图像分类

王波1, 黄冕2, 刘利军1,3, 黄青松1,4, 单文琦1   

  1. 1.昆明理工大学信息工程与自动化学院,云南 昆明 650500
    2.云南国土资源职业学院信息中心,云南 昆明 652501
    3.云南大学信息学院,云南 昆明 650091
    4.云南省计算机技术应用重点实验室,云南 昆明 650500
  • 收稿日期:2020-05-09 修回日期:2020-10-10 出版日期:2022-01-25 发布日期:2022-01-25
  • 作者简介:王 波 男,1995年3月出生于湖南省邵阳市.昆明理工大学信息工程与自动化学院硕士研究生.主要研究方向为深度学习和图像处理.E-mail:251970441@qq.com
    黄青松(通信作者) 男,1962年4月出生于湖南省长沙市.昆明理工大学信息工程与自动化学院副院长、教授、研究生导师.主要研究方向为智能信息系统. E-mail:ynkmhqs@sina.com
  • 基金资助:
    国家自然科学基金(81860318);云南省计算机技术应用重点实验室开放基金(2020106)

Multi-Layer Focused Inception-V3 Models for Fine-Grained Visual Recognition

WANG Bo1, HUANG Mian2, LIU Li-jun1,3, HUANG Qing-song1,4, SHAN Wen-qi1   

  1. 1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming,Yunnan 650500,China
    2.Information Center,Yunnan Land and Resources Vocational College,Kunming,Yunnan 652501,China
    3.School of Information,Yunnan University,Kunming,Yunnan 650091,China
    4.Yunnan Key Laboratory of Computer Technology Applications,Kunming,Yunnan 650500,China
  • Received:2020-05-09 Revised:2020-10-10 Online:2022-01-25 Published:2022-01-25

摘要:

细粒度图片具有结构多变、背景干扰大、类间差异小、类内差异大等特点,准确地定位与提取判别性局部特征至关重要.本文提出一种多层聚焦卷积网络,通过首层聚焦网络能够准确、有效地聚焦于识别局域并生成定位区域,根据定位区域对原图像分别进行裁剪和遮挡后输入下一层的聚焦网络进行训练分类.其中单层聚焦网络以Inception-V3网络为基础,通过卷积块特征注意力模块和定位区域选择机制来聚焦有效的定位区域;使用双线性注意力最大池化提取各个局部的特征;最后进行分类预测.本文在3个常用的细粒度数据集CUB-2011、FGVC-Aircraft以及Stanford Cars上进行了实验验证,分别获得了89.7%、93.6%和95.1%的Top-1准确率.实验结果表明,本模型的分类准确率高于目前主流方法.

关键词: 多层聚焦卷积网络, Inception-V3网络, 注意力机制, 双线性注意力最大池化

Abstract:

Fine-grained pictures are characterized by variable structure, large background interference, small inter-class difference and large intra-class difference, so accurate positioning and extraction of discriminant local features are crucial. In this paper, a multi-layer focused convolution network is proposed, which can accurately and effectively focus on identifying local areas and generating locating regions through the first-layer focused network. According to the positioning area, the image is cropped and dropped, and then the focus network of the next layer is input for training and classification. The single-layer focused network is based on the Inception-V3 network and focuses the effective location area through the convolutional block feature attention module, and location area selection mechanism. Bilinear attention maximum pooling was used to extract the features of each part. Classification prediction is made. Experimental verification was carried out on three commonly used fine-grained data sets CUB-2011, Fgvc-Aircraft and Stanford Cars the accuracy of top-1 was obtained at 89.7%, 93.6% and 95.1%, respectively. Experimental results show that the classification accuracy of this model is higher than that of the current mainstream methods.

Key words: multilayer focused convolution network, inception-V3, attention mechanism, bilinear attention maximum pooling

中图分类号: