

浏览全部资源
扫码关注微信
1.南京信息工程大学江苏省大数据分析技术重点实验室,江苏南京 210044
2.南京信息工程大学大气环境与装备技术协同创新中心,江苏南京 210044
3.南京航天航空大学计算机与科学技术学院,江苏南京 211106
Received:11 October 2022,
Revised:2023-01-18,
Published:25 May 2023
移动端阅览
苏天康,宋慧慧,樊佳庆等.深度信号引导学习混合变换器的高性能无监督视频目标分割[J].电子学报,2023,51(05):1388-1395.
SU Tian-kang,SONG Hui-hui,FAN Jia-qing,et al.Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation[J].ACTA ELECTRONICA SINICA,2023,51(05):1388-1395.
苏天康,宋慧慧,樊佳庆等.深度信号引导学习混合变换器的高性能无监督视频目标分割[J].电子学报,2023,51(05):1388-1395. DOI: 10.12263/DZXB.20221162.
SU Tian-kang,SONG Hui-hui,FAN Jia-qing,et al.Learning Depth Signal Guided Mixed Transformer for High-Performance Unsupervised Video Object Segmentation[J].ACTA ELECTRONICA SINICA,2023,51(05):1388-1395. DOI: 10.12263/DZXB.20221162.
现存的无监督视频目标分割方法通常使用光流作为运动线索来提升模型性能.然而,光流的估计常存在误差,这将导致双流网络易对噪声过拟合.为此,本文提出一种基于混合变换器的无监督视频目标分割算法,通过引入深度信号引导变换器高效融合不同模态数据,以学习更加鲁棒的特征表达,从而减轻模型对噪声的过拟合.首先,设计一个新颖的混合注意力模块来获得全局感受野并对不同模态的特征进行充分交互,以增强特征的全局语义信息来提升模型的抗干扰能力.接着,为了进一步感知精细化的目标边缘,设计了一个局部-非局部语义增强模块,将局部语义的归纳偏置引入补充学习非局部语义特征,在提升模型抗干扰力的同时突出更精细化的目标区域.最后,增强后的特征输入变换器的解码器,预测得到高质量的分割结果.与最先进的方法相比,本文所提算法在四个标准数据集上都获得了领先的性能,充分表明了本文所提方法的有效性.
The existing unsupervised video object segmentation methods usually employ optical flow as a motion cue to improve the model performance. However
the estimation of optical flow frequently involves errors
resulting in lots of noise
especially for objects with static or complicated motion interference. The two-stream networks will easily overfit to the noise
which severely degrades the segmentation model. To relieve this
we propose to a novel mixed transformer in unsupervised video object segmentation
which can efficiently fuse different modality data by introducing depth signals to learn more robust feature representation and reduce the model overfitting to noise. In specific
the video frame
optical flow and depth map that are cropped into a set of fixed-size patches and concatenated together
are first composed of a triplet as the transformer input. The linear layer followed by a position-encoding layer is applied on the triplet
producing the features to be encoded. After this
the features are integrated by a novel mixed attention module
which can obtain the global respective field and sufficiently interact with the various modality features
to enhance the global semantic features and improve the anti-interference ability of the model. The local-non-local semantic enhancement module is developed in order to further perceive the refined target edge by introducing the inductive bias of local semantic information into supplementary learning of non-local semantic features. In this way
the target region is more refined while improving the anti-interference capability of the model. In the end
the enhanced features as the transformer decoder input to produce the predicted segmentation mask. Extensive experiments on four standard challenging benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art methods.
谢青松 , 刘晓庆 , 安志勇 , 等 . 基于前景优化的视觉目标跟踪算法 [J]. 电子学报 , 2022 , 50 ( 7 ): 1558 - 1566 .
XIE Q S , LIU X Q , AN Z Y , et al . Visual object tracking algorithm based on foreground optimization [J]. Acta Electronica Sinica , 2022 , 50 ( 7 ): 1558 - 1566 . (in Chinese) .
付利华 , 赵宇 , 姜涵煦 , 等 . 基于前景感知视觉注意的半监督视频目标分割 [J]. 电子学报 , 2022 , 50 ( 1 ): 195 - 206 .
FU L H , ZHAO Y , JIANG H X , et al . Semi-supervised video object segmentation based on foreground perception visual attention [J]. Acta Electronica Sinica , 2022 , 50 ( 1 ): 195 - 206 . (in Chinese) .
付利华 , 赵宇 , 孙晓威 , 等 . 基于孪生网络的快速视频目标分割 [J]. 电子学报 , 2020 , 48 ( 4 ): 625 - 630 .
FU L H , ZHAO Y , SUN X W , et al . Fast video object segmentation based on Siamese networks [J]. Acta Electronica Sinica , 2020 , 48 ( 4 ): 625 - 630 . (in Chinese) .
FAN J , ZHANG K , ZHAO Y , et al . Unsupervised video object segmentation via weak user interaction and temporal modulation [J]. Chinese Journal of Electronics , 2022 , 32 : 1 - 13 .
ZHOU T F , LI J W , WANG S Z , et al . Matnet: Motion-attentive transition network for zero-shot video object segmentation [J]. IEEE Transactions on Image Processing , 2020 , 29 : 8326 - 8338 .
JI G P , FU K R , WU Z , et al . Full-duplex strategy for video object segmentation [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2022 : 4922 - 4933 .
ZHANG K H , ZHAO Z C , LIU D , et al . Deep transport network for unsupervised video object segmentation [C]// 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 8781 - 8790 .
REN S , LIU W , LIU Y , et al . Reciprocal transformations for unsupervised video object segmentation [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 15455 - 15464 .
TOKMAKOV P , ALAHARI K , SCHMID C . Learning video object segmentation with visual memory [C]// 2017 IEEE International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2017 : 4481 - 4490 .
LU X K , WANG W G , DANELLJAN M , et al . Video object segmentation with episodic graph memory Networks [C]// Computer Vision - ECCV 2020 . Cham : Springer International Publishing , 2020 : 661 - 679 .
Mahadevan S , Athar A , Ošep A , et al . Making a case for 3d convolutions for object segmentation in videos [EB/OL]. ( 2020-08-26 )[ 2022-11-01 ].arXiv preprint arXiv: 2008.11516 , 2020 .
SCHMIDT C , ATHAR A , MAHADEVAN S , et al . D2conv3 d: Dynamic dilated convolutions for object segmentation in videos[C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2022 : 1200 - 1209 .
RANFTL R , LASINGER K , HAFNER D , et al . Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020 , 44 ( 3 ): 1623 - 1637 .
TEED Z , DENG J . RAFT: Recurrent all-pairs field transforms for optical flow [C]// Computer Vision - ECCV 2020 . Cham : Springer International Publishing , 2020 : 402 - 419 .
PERAZZI F , PONT-TUSET J , MCWILLIAMS B , et al . A benchmark dataset and evaluation methodology for video object segmentation [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 724 - 732 .
OCHS P , MALIK J , BROX T . Segmentation of moving objects by long term video analysis [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2013 , 36 ( 6 ): 1187 - 1200 .
XU N , YANG L J , FAN Y C , et al . Youtube-vos: Sequence-to-sequence video object segmentation [C]// Computer Vision - ECCV 2018 . Cham : Springer International Publishing , 2018 : 585 - 601 .
FAN D P , WANG W G , CHENG M M , et al . Shifting more attention to video salient object detection [C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 8554 - 8564 .
WANG W G , SHEN J B , SHAO L . Consistent video saliency using local gradient flow optimization and global refinement [J]. IEEE Transactions on Image Processing , 2015 , 24 ( 11 ): 4185 - 4196 .
CHEN Y W , JIN X J , SHEN X H , et al . Video salient object detection via contrastive features and attention modules [C]// 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2022 : 1320 - 1329 .
ZHEN M M , LI S W , ZHOU L , et al . Learning discriminative feature with crf for unsupervised video object segmentation [C]// Computer Vision - ECCV 2020 . Cham : Springer International Publishing , 2020 : 445 - 462 .
JI Y Z , ZHANG H J , JIE Z Q , et al . CASNet: A cross-attention Siamese network for video salient object detection [J]. IEEE Transactions on Neural Networks and Learning Systems , 2020 , 32 ( 6 ): 2676 - 2690 .
0
Views
26
下载量
1
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621