电子学报

• •    

深度信号引导学习混合变换器的高性能无监督视频目标分割

苏天康1,2, 宋慧慧1,2(), 樊佳庆3, 张开华1,2   

  1. 1.南京信息工程大学江苏省大数据分析技术重点实验室, 江苏 南京 210044
    2.南京信息工程大学大气环境与装备技术协同创新中心, 江苏 南京 210044
    3.南京航天航空大学计算机与科学技术学院, 江苏 南京 211106
  • 收稿日期:2022-10-11 修回日期:2023-01-18 出版日期:2023-03-06
    • 通讯作者:
    • 宋慧慧
    • 作者简介:
    • 苏天康 男,1999年生,安徽芜湖人,硕士,研究生,主要研究方向为无监督视频目标分割.E-mail: tiankangsu@gmail.com
      宋慧慧(通讯作者) 女, 1986年生,山东省聊城市人,博士,教授,主要研究领域为视频目标分割和图像超分.
      樊佳庆 男,1994年生,江苏南通人,博士在读,主要研究领域为视频目标分割和视觉跟踪.E-mail: jqfan@nuaa.edu.cn
      张开华 男,1983年生,山东省日照市人,博士,教授,主要研究领域为协同显著性检测和视觉跟踪. E-mail: zhkhua@gmail
    • 基金资助:
    • 科技创新2030-“新一代人工智能”重大项目(2018AAA0100400);国家自然科学基金(62276141)

Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation

SU Tian-kang1,2, SONG Hui-hui1,2(), FAN Jia-qing3, ZHANG Kai-hua1,2   

  1. 1.Jiangsu Key Laboratory of Big Data Analysis Technology,Nanjing University of Information Science and Technology,Nanjing,Jiangsu 210044,China
    2.Collaborative Innovation Center on Atmospheric Environment and Equipment Technology,University of Information Science and Technology,Nanjing,Jiangsu 210044,China
    3.College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing,Jiangsu 211106,China
  • Received:2022-10-11 Revised:2023-01-18 Online:2023-03-06
    • Corresponding author:
    • SONG Hui-hui
    • Supported by:
    • National Key Research and Development Program of China(2018AAA0100400);National Natural Science Foundation of China(62276141)

摘要:

现存的无监督视频目标分割方法通常使用光流作为运动线索来提升模型性能.然而,光流的估计常存在误差,这将导致双流网络易对噪声过拟合.为此,本文提出一种基于混合变换器的无监督视频目标分割算法,通过引入深度信号引导变换器高效融合不同模态数据,以学习更加鲁棒的特征表达,从而减轻模型对噪声的过拟合.首先,设计一个新颖的混合注意力模块来获得全局感受野并对不同模态的特征进行充分交互,以增强特征的全局语义信息来提升模型的抗干扰能力.接着,为了进一步感知精细化的目标边缘,设计了一个局部-非局部语义增强模块,将局部语义的归纳偏置引入补充学习非局部语义特征,在提升模型抗干扰力的同时突出更精细化的目标区域.最后,增强后的特征输入变换器的解码器,预测得到高质量的分割结果.与最先进的方法相比,本文所提算法在四个标准数据集上都获得了领先的性能,充分表明了本文所提方法的有效性.

关键词: 无监督视频目标分割, 混合变换器, 混合注意力, 多模态, 深度估计, 鲁棒特征

Abstract:

The existing unsupervised video object segmentation methods usually employ optical flow as a motion cue to improve the model performance. However, the estimation of optical flow frequently involves errors, resulting in lots of noise, especially for objects with static or complicated motion interference. The two-stream networks will easily overfit to the noise, which severely degrades the segmentation model. To relieve this, we propose to a novel mixed transformer in unsupervised video object segmentation, which can efficiently fuse different modality data by introducing depth signals to learn more robust feature representation and reduce the model overfitting to noise. In specific, the video frame, optical flow and depth map that are cropped into a set of fixed-size patches and concatenated together, are first composed of a triplet as the transformer input. The linear layer followed by a position-encoding layer is applied on the triplet, producing the features to be encoded. After this, the features are integrated by a novel mixed attention module, which can obtain the global respective field and sufficiently interact with the various modality features, to enhance the global semantic features and improve the anti-interference ability of the model. The local-non-local semantic enhancement module is developed in order to further perceive the refined target edge by introducing the inductive bias of local semantic information into supplementary learning of non-local semantic features. In this way, the target region is more refined while improving the anti-interference capability of the model. In the end, the enhanced features as the transformer decoder input to produce the predicted segmentation mask. Extensive experiments on four standard challenging benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art methods.

Key words: unsupervised video object segmentation mixed transformer, mixed attention, multimodality, depth estimation, robust features

中图分类号: