Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation

SU Tian-kang; SONG Hui-hui; FAN Jia-qing; ZHANG Kai-hua

doi:10.12263/DZXB.20221162

您当前的位置：

首页 >

文章列表页 >

Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation

PAPERS | 更新时间：2025-12-08

- Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation
- ACTA ELECTRONICA SINICA Vol. 51, Issue 5, Pages: 1388-1395(2023)
- 作者机构：
  
  1.南京信息工程大学江苏省大数据分析技术重点实验室,江苏南京 210044
  2.南京信息工程大学大气环境与装备技术协同创新中心,江苏南京 210044
  3.南京航天航空大学计算机与科学技术学院,江苏南京 211106
- 作者简介：
- 基金信息：
  
  National Key Research and Development Program of China(2018AAA0100400);National Natural Science Foundation of China(62276141;U20B2065)
- DOI：10.12263/DZXB.20221162
  CLC： TP391.41
- Received：11 October 2022，
  
  Revised：2023-01-18，
  
  Published：25 May 2023
- 稿件说明：
移动端阅览
苏天康,宋慧慧,樊佳庆等.深度信号引导学习混合变换器的高性能无监督视频目标分割[J].电子学报,2023,51(05):1388-1395.

SU Tian-kang,SONG Hui-hui,FAN Jia-qing,et al.Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation[J].ACTA ELECTRONICA SINICA,2023,51(05):1388-1395.
苏天康,宋慧慧,樊佳庆等.深度信号引导学习混合变换器的高性能无监督视频目标分割[J].电子学报,2023,51(05):1388-1395. DOI： 10.12263/DZXB.20221162.

SU Tian-kang,SONG Hui-hui,FAN Jia-qing,et al.Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation[J].ACTA ELECTRONICA SINICA,2023,51(05):1388-1395. DOI： 10.12263/DZXB.20221162.

摘要

现存的无监督视频目标分割方法通常使用光流作为运动线索来提升模型性能.然而，光流的估计常存在误差，这将导致双流网络易对噪声过拟合.为此，本文提出一种基于混合变换器的无监督视频目标分割算法，通过引入深度信号引导变换器高效融合不同模态数据，以学习更加鲁棒的特征表达，从而减轻模型对噪声的过拟合.首先，设计一个新颖的混合注意力模块来获得全局感受野并对不同模态的特征进行充分交互，以增强特征的全局语义信息来提升模型的抗干扰能力.接着，为了进一步感知精细化的目标边缘，设计了一个局部-非局部语义增强模块，将局部语义的归纳偏置引入补充学习非局部语义特征，在提升模型抗干扰力的同时突出更精细化的目标区域.最后，增强后的特征输入变换器的解码器，预测得到高质量的分割结果.与最先进的方法相比，本文所提算法在四个标准数据集上都获得了领先的性能，充分表明了本文所提方法的有效性.

Abstract

The existing unsupervised video object segmentation methods usually employ optical flow as a motion cue to improve the model performance. However

the estimation of optical flow frequently involves errors

resulting in lots of noise

especially for objects with static or complicated motion interference. The two-stream networks will easily overfit to the noise

which severely degrades the segmentation model. To relieve this

we propose to a novel mixed transformer in unsupervised video object segmentation

which can efficiently fuse different modality data by introducing depth signals to learn more robust feature representation and reduce the model overfitting to noise. In specific

the video frame

optical flow and depth map that are cropped into a set of fixed-size patches and concatenated together

are first composed of a triplet as the transformer input. The linear layer followed by a position-encoding layer is applied on the triplet

producing the features to be encoded. After this

the features are integrated by a novel mixed attention module

which can obtain the global respective field and sufficiently interact with the various modality features

to enhance the global semantic features and improve the anti-interference ability of the model. The local-non-local semantic enhancement module is developed in order to further perceive the refined target edge by introducing the inductive bias of local semantic information into supplementary learning of non-local semantic features. In this way

the target region is more refined while improving the anti-interference capability of the model. In the end

the enhanced features as the transformer decoder input to produce the predicted segmentation mask. Extensive experiments on four standard challenging benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art methods.

关键词

Keywords

references

谢青松 , 刘晓庆 , 安志勇 , 等 . 基于前景优化的视觉目标跟踪算法 [J]. 电子学报 , 2022 , 50 ( 7 ): 1558 - 1566 .

XIE Q S , LIU X Q , AN Z Y , et al . Visual object tracking algorithm based on foreground optimization [J]. Acta Electronica Sinica , 2022 , 50 ( 7 ): 1558 - 1566 . (in Chinese) .

付利华 , 赵宇 , 姜涵煦 , 等 . 基于前景感知视觉注意的半监督视频目标分割 [J]. 电子学报 , 2022 , 50 ( 1 ): 195 - 206 .

FU L H , ZHAO Y , JIANG H X , et al . Semi-supervised video object segmentation based on foreground perception visual attention [J]. Acta Electronica Sinica , 2022 , 50 ( 1 ): 195 - 206 . (in Chinese) .

付利华 , 赵宇 , 孙晓威 , 等 . 基于孪生网络的快速视频目标分割 [J]. 电子学报 , 2020 , 48 ( 4 ): 625 - 630 .

FU L H , ZHAO Y , SUN X W , et al . Fast video object segmentation based on Siamese networks [J]. Acta Electronica Sinica , 2020 , 48 ( 4 ): 625 - 630 . (in Chinese) .

FAN J , ZHANG K , ZHAO Y , et al . Unsupervised video object segmentation via weak user interaction and temporal modulation [J]. Chinese Journal of Electronics , 2022 , 32 : 1 - 13 .

ZHOU T F , LI J W , WANG S Z , et al . Matnet: Motion-attentive transition network for zero-shot video object segmentation [J]. IEEE Transactions on Image Processing , 2020 , 29 : 8326 - 8338 .

JI G P , FU K R , WU Z , et al . Full-duplex strategy for video object segmentation [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 4922 - 4933 .

ZHANG K H , ZHAO Z C , LIU D , et al . Deep transport network for unsupervised video object segmentation [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 8781 - 8790 .

REN S , LIU W , LIU Y , et al . Reciprocal transformations for unsupervised video object segmentation [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 15455 - 15464 .

TOKMAKOV P , ALAHARI K , SCHMID C . Learning video object segmentation with visual memory [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4481 - 4490 .

LU X K , WANG W G , DANELLJAN M , et al . Video object segmentation with episodic graph memory Networks [C]// Computer Vision - ECCV 2020 . Cham : Springer International Publishing , 2020 : 661 - 679 .

Mahadevan S , Athar A , Ošep A , et al . Making a case for 3d convolutions for object segmentation in videos [EB/OL]. ( 2020-08-26 )[ 2022-11-01 ].arXiv preprint arXiv: 2008.11516 , 2020 .

SCHMIDT C , ATHAR A , MAHADEVAN S , et al . D2conv3 d: Dynamic dilated convolutions for object segmentation in videos[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2022 : 1200 - 1209 .

RANFTL R , LASINGER K , HAFNER D , et al . Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 44 ( 3 ): 1623 - 1637 .

TEED Z , DENG J . RAFT: Recurrent all-pairs field transforms for optical flow [C]// Computer Vision - ECCV 2020 . Cham : Springer International Publishing , 2020 : 402 - 419 .

PERAZZI F , PONT-TUSET J , MCWILLIAMS B , et al . A benchmark dataset and evaluation methodology for video object segmentation [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 724 - 732 .

OCHS P , MALIK J , BROX T . Segmentation of moving objects by long term video analysis [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2013 , 36 ( 6 ): 1187 - 1200 .

XU N , YANG L J , FAN Y C , et al . Youtube-vos: Sequence-to-sequence video object segmentation [C]// Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 585 - 601 .

FAN D P , WANG W G , CHENG M M , et al . Shifting more attention to video salient object detection [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 8554 - 8564 .

WANG W G , SHEN J B , SHAO L . Consistent video saliency using local gradient flow optimization and global refinement [J]. IEEE Transactions on Image Processing , 2015 , 24 ( 11 ): 4185 - 4196 .

CHEN Y W , JIN X J , SHEN X H , et al . Video salient object detection via contrastive features and attention modules [C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2022 : 1320 - 1329 .

ZHEN M M , LI S W , ZHOU L , et al . Learning discriminative feature with crf for unsupervised video object segmentation [C]// Computer Vision - ECCV 2020 . Cham : Springer International Publishing , 2020 : 445 - 462 .

JI Y Z , ZHANG H J , JIE Z Q , et al . CASNet: A cross-attention Siamese network for video salient object detection [J]. IEEE Transactions on Neural Networks and Learning Systems , 2020 , 32 ( 6 ): 2676 - 2690 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

No data

Related Author

ZHANG Kai-hua

Related Institution

No data

⁰