基于标签层次结构的视觉关系检测模型

王元龙; 雷鸣; 王智强; 张虎; 李茹; 梁吉业

doi:10.12263/DZXB.20221050

您当前的位置：

首页 >

文章列表页 >

基于标签层次结构的视觉关系检测模型

学术论文 | 更新时间：2025-12-11

- 基于标签层次结构的视觉关系检测模型
- Visual Relationship Detection Model Based on Label Hierarchy
- 电子学报 2023年51卷第12期页码：3496-3506
- 作者机构：
  
  1.山西大学计算机与信息技术学院,山西太原 030006
  2.山西大学计算智能与中文信息处理教育部重点实验室,山西太原 030006
- 作者简介：
  
  [ "王元龙男，1983年5月出生于山西省大同市.现为山西大学计算机与信息技术学院副教授.E-mail: ylwang@sxu.edu.cn" ]
  [ "雷鸣男，1999年2月出生于山西省运城市.现为山西大学计算机与信息技术学院研究生." ]
- 基金信息：
  
  国家重点研发计划(2020AAA0106100);国家自然科学基金(62176145)
- DOI：10.12263/DZXB.20221050
  中图分类号： TP391.7
- 收稿：2022-09-14，
  
  修回：2023-04-20，
  
  纸质出版：2023-12-25
- 稿件说明：
移动端阅览
王元龙,雷鸣,王智强等.基于标签层次结构的视觉关系检测模型[J].电子学报,2023,51(12):3496-3506.

WANG Yuan-long,LEI Ming,WANG Zhi-qiang,et al.Visual Relationship Detection Model Based on Label Hierarchy[J].ACTA ELECTRONICA SINICA,2023,51(12):3496-3506.
王元龙,雷鸣,王智强等.基于标签层次结构的视觉关系检测模型[J].电子学报,2023,51(12):3496-3506. DOI： 10.12263/DZXB.20221050.

WANG Yuan-long,LEI Ming,WANG Zhi-qiang,et al.Visual Relationship Detection Model Based on Label Hierarchy[J].ACTA ELECTRONICA SINICA,2023,51(12):3496-3506. DOI： 10.12263/DZXB.20221050.

摘要

视觉关系检测是在目标识别的基础上，进一步检测出目标之间的关系，属于视觉理解和推理的关键技术.然而，由于关系标签视觉上的相似性以及数据不平衡问题造成少样本的尾部关系检测召回率较低.为了提高尾部关系的检测效果，本文将关系标签进行粗细粒度划分构建了标签的层次结构表示，提出了基于标签层次结构的视觉关系检测模型.模型利用视觉关系之间的相似性以及数据带有的偏见性构建关系标签的层次结构表示，以此将关系区分为粗粒度关系和细粒度关系，使尾部关系在由粗粒度到细粒度的结构上获得更多的关注.同时，针对标签层次结构的性质设计其损失函数，该损失函数通过结构化信息逐层学习不同类别关系之间的差异，使模型更好的检测尾部细粒度关系.分别在公开数据集Visual Relationship Detection（VRD）和Visual Genome（VG）中验证了本文模型检测尾部关系的效果.与现有模型相比，在VRD数据集中平均召回率mR@20、mR@50和mR@100分别提高了0.62%、1.57%和2.47%；在VG数据集中，mR@20、mR@50和mR@100分别提高了0.67%、0.83%和1.15%.

Abstract

Visual relationship detection is based on target recognition

and further detects the relationship between targets

which is a key technology of visual understanding and reasoning. However

the recall of few-shot tail relation detection is low due to the visual similarity of relation labels and the problem of data imbalance. In order to improve the detection effect of the tail relationship

this paper divides the relationship tags into coarse and fine-grained to construct a hierarchical representation of tags

and proposes a visual relationship detection model based on the tag hierarchy. The model uses the similarity between visual relationships and the bias of the data to build a hierarchical representation of relationship labels

so as to distinguish between coarse-grained relationships and fine-grained relationships

so that the tail relationships can be structured from coarse-grained to fine-grained and get more attention. At the same time

the loss function is designed according to the nature of the label hierarchy. The loss function learns the differences between different category relationships layer by layer through structured information

so that the model can better detect the fine-grained relationship in the tail. The effect of the proposed model in detecting tail relationships is verified in the public datasets Visual Relationship Detection (VRD) and Visual Genome (VG)

respectively. Compared with the existing models

the average recall rates mR@20

mR@50 and mR@100 are improved by 0.62%

1.57% and 2.47% in the VRD dataset

and 0.67%

0.83% and 1.15% in the VG dataset

respectively.

关键词

Keywords

references

LU C W , KRISHNA R , BERNSTEIN M , et al . Visual relationship detection with language Priors [C ] // Computer Vision - ECCV 2016 . Cham : Springer International Publishing , 2016 : 852 - 869 .

SADEGHI M A , FARHADI A . Recognition using visual phrases [C ] // Proceedings of the Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2011 : 1745 - 1752 .

KRISHNA R , ZHU Y K , GROTH O , et al . Visual genome: Connecting language and vision using crowdsourced dense image annotations [J ] . International Journal of Computer Vision , 2017 , 123 ( 1 ): 32 - 73 .

周东明 , 张灿龙 , 李志欣 , 等 . 基于多层级视觉融合的图像描述模型 [J ] . 电子学报 , 2021 , 49 ( 7 ): 1286 - 1290 .

ZHOU D M , ZHANG C L , LI Z X , et al . Image captioning model based on multi-level visual fusion [J ] . Acta Electronica Sinica , 2021 , 49 ( 7 ): 1286 - 1290 . (in Chinese)

罗会兰 , 郭敏杰 , 孔繁胜 . 一种基于多级空间视觉词典集体的图像分类方法 [J ] . 电子学报 , 2015 , 43 ( 4 ): 684 - 693 .

LUO H L , GUO M J , KONG F S . An image classification method based on multiple level spatial visual dictionary ensemble [J ] . Acta Electronica Sinica , 2015 , 43 ( 4 ): 684 - 693 . (in Chinese)

俞俊 , 汪亮 , 余宙 . 视觉问答技术研究 [J ] . 计算机研究与发展 , 2018 , 55 ( 9 ): 1946 - 1958 .

YU J , WANG L , YU Z . Research on visual question answering techniques [J ] . Journal of Computer Research and Development , 2018 , 55 ( 9 ): 1946 - 1958 . (in Chinese)

ZHOU H , ZHANG C Y , HU C P . Visual relationship detection with relative location mining [C ] // Proceedings of the 27th ACM International Conference on Multimedia . New York : ACM , 2019 : 30 - 38 .

ZHAN Y , YU J , YU T , et al . On exploring undetermined relationships for visual relationship detection [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 5128 - 5137 .

JUNG J , PARK J . Visual relationship detection with language prior and softmax [C ] // 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS) . Piscataway : IEEE , 2018 : 143 - 148 .

WANG W T , WANG M , WANG S , et al . One-shot learning for long-tail visual relation detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2020 , 34 ( 7 ): 12225 - 12232 .

YU R C , LI A , MORARIU V I , et al . Visual relationship detection with internal and external linguistic knowledge distillation [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 1068 - 1076 .

HWANG S J , KIM H J , RAVI S N , et al . Tensorize, factorize and regularize: Robust visual relationship learning [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 1014 - 1023 .

YU J , CHAI Y , WANG Y J , et al . CogTree: Cognition tree loss for unbiased scene graph generation [C ] // International Joint Conference on Artificial Intelligence . California : International Joint Conferences on Artificial Intelligence Organization , 2021 : 1274 - 1280 .

TANG K , ZHANG H , WU B , et al . Learning to compose dynamic tree structures for visual contexts [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 6612 - 6621 .

吴绿 , 张馨月 , 唐茉 , 等 . Focus+Context语义表征的场景图像分割 [J ] . 电子学报 , 2021 , 49 ( 3 ): 596 - 604 .

WU L , ZHANG X Y , TANG M , et al . Focus+Context semantic representation in scene segmentation [J ] . Acta Electronica Sinica , 2021 , 49 ( 3 ): 596 - 604 . (in Chinese)

XU D , ZHU Y , CHOY C B , et al . Scene graph generation by iterative message passing [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 5410 - 5419 .

CUI W , LAN Y , PANG L , et al . Beyond language: Learning commonsense from images for reasoning [C ] // Findings of the Association for Computational Linguistics: EMNLP 2020 . Stroudsburg : Association for Computational Linguistics , 2020 : 4379 - 4389 .

GALLEGUILLOS C , RABINOVICH A , BELONGIE S . Object categorization using co-occurrence, location and appearance [C ] // 2008 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2008 : 1 - 8 .

DESAI C , RAMANAN D . Detecting actions, poses, and objects with relational phraselets [C ] // European Conference on Computer Vision . Berlin, Heidelberg : Springer , 2012 : 158 - 172 .

ZHUANG B H , LIU L Q , SHEN C H , et al . Towards context-aware interaction recognition for visual relationship detection [C ] // 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 589 - 598 .

CHEN J , AGARWAL A , ABDELKARIM S , et al . RelTransformer: A transformer-based long-tail visual relationship recognition [C ] // Proceedings of the Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 19507 - 19517 .

SHARIFZADEH S , BAHARLOU S M , BERRENDORF M , et al . Improving visual relation detection using depth maps [C ] // 2020 25th International Conference on Pattern Recognition . Piscataway : IEEE , 2021 : 3597 - 3604 .

ZHANG H , KYAW Z , CHANG S F , et al . Visual translation embedding network for visual relation detection [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 5532 - 5540 .

HAN C , SHEN F , LIU L , et al . Visual spatial attention network for relationship detection [C ] // Proceedings of the 26th ACM International Conference on Multimedia . New York : ACM , 2018 : 510 - 518 .

YIN G J , SHENG L , LIU B , et al . Zoom-Net: Mining deep feature interactions for visual relationship recognition [C ] // Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 330 - 347 .

MI L , CHEN Z . Hierarchical graph attention network for visual relationship detection [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 13886 - 13895 .

REN S Q , HE K M , GIRSHICK R , et al . Faster R-CNN: Towards real-time object detection with region proposal networks [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2017 , 39 ( 6 ): 1137 - 1149 .

CUI Y , JIA M , LIN T Y , et al . Class-balanced loss based on effective number of samples [C ] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 9268 - 9277 .

ZELLERS R , YATSKAR M , THOMSON S , et al . Neural motifs: Scene graph parsing with global context [C ] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 5831 - 5840 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据