Motion-Prompts Guided Adaptive Learning for Unsupervised Video Object Segmentatio

HAN Zhi-dong; HU Sheng-long; SONG Hui-hui; ZHANG Kai-hua

doi:10.12263/DZXB.20250138

您当前的位置：

首页 >

文章列表页 >

Motion-Prompts Guided Adaptive Learning for Unsupervised Video Object Segmentatio

PAPERS | 更新时间：2025-12-10

- Motion-Prompts Guided Adaptive Learning for Unsupervised Video Object Segmentatio
- ACTA ELECTRONICA SINICA Vol. 53, Issue 7, Pages: 2305-2323(2025)
- 作者机构：
  
  南京信息工程大学自动化学院，江苏南京 210044
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62276141)
- DOI：10.12263/DZXB.20250138
  CLC： TP391.41;
- Received：26 February 2025，
  
  Revised：2025-06-30，
  
  Published：25 July 2025
- 稿件说明：
移动端阅览
韩志冬, 胡升龙, 宋慧慧, 等. 运动提示引导自适应学习无监督视频目标分割[J]. 电子学报, 2025, 53(07): 2305-2323.

HAN Zhi-dong, HU Sheng-long, SONG Hui-hui, et al. Motion-Prompts Guided Adaptive Learning for Unsupervised Video Object Segmentatio[J]. Acta Electronica Sinica, 2025, 53(07): 2305-2323.
韩志冬, 胡升龙, 宋慧慧, 等. 运动提示引导自适应学习无监督视频目标分割[J]. 电子学报, 2025, 53(07): 2305-2323. DOI：10.12263/DZXB.20250138

HAN Zhi-dong, HU Sheng-long, SONG Hui-hui, et al. Motion-Prompts Guided Adaptive Learning for Unsupervised Video Object Segmentatio[J]. Acta Electronica Sinica, 2025, 53(07): 2305-2323. DOI：10.12263/DZXB.20250138

摘要

现有无监督视频目标分割（Unsupervised Video Object Segmentation，UVOS）方法多采用像素级密集匹配策略，通过对齐融合多帧之间或单帧与光流之间的信息来提升模型性能.然而，在遮挡、相机抖动、运动模糊等挑战性场景中，光流估计误差易产生大量错误匹配，导致融合后的时空表征易过拟合运动噪声.为此，本文提出一种运动提示引导的自适应学习UVOS框架.通过设计一种无监督光流提示生成算法，将光流编码的密集运动信息转换为稀疏点和框提示，借助提示学习引导分割一切模型（Segment Anything Model，SAM）通过本文设计的两个轻量级适配器来自适应学习，从而获得更为鲁棒的时空表征，增强模型的抗噪能力.为获得有效的提示，设计了一种无监督运动提示生成算法.该算法基于光流特征计算一系列统计量，筛选出显著区域，再利用运动边缘信息去除伪显著区域的干扰，并设定自适应阈值进行过滤，生成提示显著运动目标所在区域的点和框坐标.为提升SAM在下游UVOS任务中的泛化性，提出一种自适应表征学习SAM模型.通过设计两个轻量级特征适配器，从SAM的通用知识库中自适应学习与下游UVOS任务相关的知识，以准确地粗定位目标.针对SAM基于纯Transformer架构在细节处理上的不足，基于卷积神经网络（Convolutional Neural Networks，CNN）架构设计了表观聚焦细化模块.由SAM得到的定位注意力图渐进式地引导细化过程，使模型的注意力从全局粗定位聚焦到局部细化，最终得到更加精确的分割掩码.本文方法在DAVIS16（DAVIS 2016）、FBMS（Financial and Business Management System）和YTOBJ（YouTube-OBJects）三个主流数据集上进行了充分验证.结果表明：本文方法在区域相似度指标上较当前先进方法分别提升了1.8%、1.6%和2.6%，充分表明了本文方法的有效性.

Abstract

Existing unsupervised video object segmentation (UVOS) methods often employ pixel-level dense matching strategies to enchance model performance by aligning and fusing features among multiple frames or between a single frame and its corresponding optical flow. However

in challenging scenarios such as occlusion

camera shak

and motion blur

optical flow estimation errors can easily generate numerous erroneous matches

leading to overfitting of the fused spatio-temporal representations to motion noise. To address this issue

we propose a motion-prompts guided adaptive learning UVOS framework. By designing an unsupervised motion-prompts generation algorithm

the dense motion information encoded by optical flow is transformed into sparse point and box prompts. With the help of prompt learning

the segment anything model (SAM) is guided to adaptively learn through two lightweight adapters designed in this paper

thereby obtaining more robust spatio-temporal representations and enhancing the model’s noise resistance capability. To obtain effective prompts

we design an unsupervised motion-prompt generation algorithm. This algorithm calculates a series of statistical measures from the optical flow features to identify salient regions

then utilizes motion edge information and an adaptive threshold to eliminate pseudo-salient regions

ultimately generating the point and box coordinates that highlight the locations of motion-salient objects. To enhance the generalization ability of SAM in downstream UVOS tasks

an adaptive representations learning SAM model is proposed. By incorporating two light-weight feature adapters

the model adaptively extracts knowledge relevant to the downstream UVOS task from SAM’s general knowledge base

enabling accurate coarse localization of objects. To overcome the lack of attention to details in pure Transformer-based SAM

a convolutional neural networks (CNN)-based feature focusing refinement module guided by the location map is designed. The localization attention map generated by SAM progressively guides the refinement process

shifting the model’s focus from global coarse localization to local refinement

and ultimately producing more accurate segmentation masks. Our method has been thoroughly validated on three mainstream datasets: DAVIS 2016 (DAVIS16)

financial and business management system (FBMS)

and YouTube-Objects (YTOBJ). Compared with current state-of-the-art methods

our approach achieves improvements of 1.8%

1.6%

and 2.6% in the region similarity metric

respectively

thereby fully demonstrating the effectiveness of our proposed method.

关键词

Keywords

references

MOHAMED E , EWAISHA M , SIAM M , et al . Instancemotseg: Real-time instance motion segmentation for autonomous driving [EB/OL ] . ( 2021-05-26 )[ 2025-02-26 ] . https://arxiv.org/abs/2008.07008 https://arxiv.org/abs/2008.07008 .

OCHS P , MALIK J , BROX T . Segmentation of moving objects by long term video analysis [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2014 , 36 ( 6 ): 1187 - 1200 .

DRAYER B , BROX T . Object detection, tracking, and motion segmentation for object-level video segmentation [EB/OL ] . ( 2016-08-10 )[ 2025-02-26 ] . https://arxiv.org/abs/1608.03066 https://arxiv.org/abs/1608.03066 .

SIDDIQUE A , LEE S . Object-wise video editing [J ] . Applied Sciences , 2021 , 11 ( 2 ): 671 - 692 .

WANG W G , SHEN J B , PORIKLI F , et al . Semi-supervised video object segmentation with super-trajectories [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018 , 41 ( 4 ): 985 - 998 .

PEI G S , SHEN F M , YAO Y Z , et al . Hierarchical feature alignment network forUnsupervised video object segmentation [C ] // Computer Vision - ECCV 2022 . Cham : Springer , 2022 : 596 - 613 .

LIN F C , XIE H T , LI Y , et al . Query-memory re-aggregation for weakly-supervised video object segmentation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 3 ): 2038 - 2046 .

MIAO J X , WEI Y C , YANG Y . Memory aggregation networks for efficient interactive video object segmentation [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 10363 - 10372 .

WU J N , JIANG Y , SUN P Z , et al . Language as queries for referring video object segmentation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 4964 - 4974 .

BANICA D , AGAPE A , ION A , et al . Video object segmentation by salient segment chain composition [C ] // 2013 IEEE International Conference on Computer Vision Workshops . Piscataway : IEEE , 2013 : 283 - 290 .

ZHANG D , JAVED O , SHAH M . Video object segmentation through spatially accurate and temporally dense extraction of primary object regions [C ] // 2013 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2013 : 628 - 635 .

WANG W G , SHEN J B , YANG R G , et al . Saliency-aware video object segmentation [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018 , 40 ( 1 ): 20 - 33 .

HU Y T , HUANG J B , SCHWING A G . Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 813 - 830 .

LU X K , WANG W G , MA C , et al . See more, know more: Unsupervised video object segmentation with co-attention Siamese networks [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 3618 - 3627 .

WANG W G , LU X K , SHEN J B , et al . Zero-shot video object segmentation via attentive graph neural networks [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 9235 - 9244 .

LU X K , WANG W G , DANELLJAN M , et al . Video object segmentation with episodic graph memory networks [C ] // Computer Vision - ECCV 2020 . Cham : Springer , 2020 : 661 - 679 .

MAHADEVAN S , ATHAR A , OŠEP A , et al . Making a case for 3d convolutions for object segmentation in videos [EB/OL ] . ( 2023-09-01 )[ 2025-02-26 ] . https://arxiv.org/abs/2008.11516 https://arxiv.org/abs/2008.11516 .

SCHMIDT C , ATHAR A , MAHADEVAN S , et al . D2Conv3 D: Dynamic dilated convolutions for object segmentation in videos[C ] // 2022 IEEE/CVF Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2022 : 1929 - 1938 .

ZHOU T F , LI J W , WANG S Z , et al . MATNet: Motion-attentive transition network for zero-shot video object segmentation [J ] . IEEE Transactions on Image Processing , 2020 , 29 : 8326 - 8338 .

YANG S , ZHANG L , QI J Q , et al . Learning motion-appearance co-attention for zero-shot video object segmentation [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 1544 - 1553 .

CHO S , LEE M , LEE S , et al . Treating motion as option to reduce motion dependency in unsupervised video object segmentation [C ] // 2023 IEEE/CVF Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2023 : 5129 - 5138 .

苏天康 , 宋慧慧 , 樊佳庆 , 等 . 深度信号引导学习混合变换器的高性能无监督视频目标分割 [J ] . 电子学报 , 2023 , 51 ( 5 ): 1388 - 1395 .

SU T K , SONG H H , FAN J Q , et al . Learning depth signal guided mixed transformer for high-performance unsupervised video object segmentation [J ] . Acta Electronica Sinica , 2023 , 51 ( 5 ): 1388 - 1395 . (in Chinese)

KIPF T N , WELLING M . Semi-supervised classification with graph convolutional networks [EB/OL ] . ( 2017-02-22 )[ 2025-02-26 ] . https://arxiv.org/abs/1609.02907 https://arxiv.org/abs/1609.02907 .

BROMLEY J , BENTZ J W , BOTTOU L , et al . Signature verification using a "Siamese" time delay neural network [M ] // Advances in Pattern Recognition Systems Using Neural Network Technologies . Singapore : World Scientific , 1994 : 25 - 44 .

TRAN D , BOURDEV L , FERGUS R , et al . Learning spatiotemporal features with 3D convolutional networks [C ] // 2015 IEEE International Conference on Computer Vision . Piscataway : IEEE , 2015 : 4489 - 4497 .

SIMONYAN K , ZISSERMAN A . Two-stream convolutional networks for action recognition in videos [C ] // The 28th International Conference on Neural Information Processing Systems - Volume 1 . New York : ACM , 2014 : 568 - 576 .

KIRILLOV A , MINTUN E , RAVI N , et al . Segment anything [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 3992 - 4003 .

PERAZZI F , PONT-TUSET J , MCWILLIAMS B , et al . A benchmark dataset and evaluation methodology for video object segmentation [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2016 : 724 - 732 .

PREST A , LEISTNER C , CIVERA J , et al . Learning object class detectors from weakly annotated video [C ] // 2012 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2012 : 3282 - 3289 .

MA J , HE Y T , LI F F , et al . Segment anything in medical images [J ] . Nature Communications , 2024 , 15 : 654 .

WANG D , ZHANG J , DU B , et al . Samrs: Scaling-up remote sensing segmentation dataset with segment anything model [J ] . Advances in Neural Information Processing Systems , 2023 , 36 : 8815 - 8827 .

CHEN K Y , LIU C Y , CHEN H , et al . RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2024 , 62 : 4701117 .

ZHANG Z H , WEI Z C , ZHANG S F , et al . Uvosam: A mask-free paradigm for unsupervised video object segmentation via segment anything model [EB/OL ] . ( 2024-06-06 )[ 2025-02-26 ] . https://arxiv.org/abs/2305.12659 https://arxiv.org/abs/2305.12659 .

CUI C , DENG R N , LIU Q , et al . All-in-SAM: From weak annotation to pixel-wise nuclei segmentation with prompt-based finetuning [J ] . Journal of Physics: Conference Series , 2024 , 2722 ( 1 ): 012012 .

PENG Z L , XU Z Q , ZENG Z L , et al . SAM-PARSER: Fine-tuning SAM efficiently by parameter space reconstruction [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 5 ): 4515 - 4523 .

LI Y L , ZHANG J , TENG X , et al . Refsam: Efficiently adapting segmenting anything model for referring video object segmentation [EB/OL ] . ( 2024-09-03 )[ 2025-02-26 ] . https://arxiv.org/abs/2307.00997 https://arxiv.org/abs/2307.00997 .

CHEN S , GE C , TONG Z , et al . Adaptformer: Adapting vision transformers for scalable visual recognition [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 16664 - 16678 .

SUNG Y L , CHO J , BANSAL M . VL-ADAPTER: Parameter-efficient transfer learning for vision-and-language tasks [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 5217 - 5227 .

PAN J , LIN Z , ZHU X , et al . St-adapter: Parameter-efficient image-to-video transfer learning [J ] . Advances in Neural Information Processing Systems , 2022 , 35 : 26462 - 26477 .

WU J D , FU R , FANG H H , et al . Medical sam adapter:Adapting segment anything model for medical image segmentation [EB/OL ] . ( 2023-12-29 )[ 2025-02-26 ] . https://arxiv.org/abs/2304.12620 https://arxiv.org/abs/2304.12620 .

CHEN T , ZHU L , DING C , et al . Sam fails to segment anything - sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, medical image segmentation, and more [EB/OL ] . ( 2023-05-02 )[ 2025-02-26 ] . https://arxiv.org/abs/2304.09148 https://arxiv.org/abs/2304.09148 .

GONG S Z , ZHONG Y , MA W A , et al . 3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation [J ] . Medical Image Analysis , 2024 , 98 : 103324 .

RUMELHART D E , HINTON G E , WILLIAMS R J . Learning representations by back-propagating errors [J ] . Nature , 1986 , 323 ( 6088 ): 533 - 536 .

RAJIČ F , KE L , TAI Y W , et al . Segment anything meets point tracking [EB/OL ] . ( 2023-12-03 )[ 2025-02-26 ] . https://arxiv.org/abs/2307.01197 https://arxiv.org/abs/2307.01197 .

HENDRYCKS D , GIMPEL K . Gaussian error linear units (gelus) [EB/OL ] . ( 2023-06-06 )[ 2025-02-26 ] . https://arxiv.org/abs/1606.08415 https://arxiv.org/abs/1606.08415 .

NAIR V , HINTON G E . Rectified linear units improve restricted Boltzmann machines [C ] // The 27th International Conference on International Conference on Machine Learning . New York : ACM , 2010 : 807 - 814 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2025-02-26 ] . https://arxiv.org/abs/2010.11929 https://arxiv.org/abs/2010.11929 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // The 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6000 - 6010 .

XIE E Z , WANG W H , YU Z D , et al . Segformer: Simple and efficient design for semantic segmentation with transformers [J ] . Advances in Neural Information Processing Systems , 2021 , 34 : 12077 - 12090 .

WANG L J , LU H C , WANG Y F , et al . Learning to detect salient objects with image-level supervision [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 3796 - 3805 .

SONG H M , WANG W G , ZHAO S Y , et al . Pyramid dilated deeper ConvLSTM for video salient object detection [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 744 - 760 .

WANG W G , SONG H M , ZHAO S Y , et al . Learning unsupervised video object segmentation through visual attention [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 3059 - 3069 .

YANG Z , WANG Q , BERTINETTO L , et al . Anchor diffusion for unsupervised video object segmentation [C ] // 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 931 - 940 .

JI G P , FAN D P , FU K R , et al . Full-duplex strategy for video object segmentation [J ] . Computational Visual Media , 2023 , 9 ( 1 ): 155 - 175 .

LIU D Z , YU D D , WANG C H , et al . F 2 Net: Learning to focus on the foreground for un supervised video object segmentation [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2021 , 35 ( 3 ): 2109 - 2117 .

ZHANG K H , ZHAO Z C , LIU D , et al . Deep transport network for unsupervised video object segmentation [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 8761 - 8770 .

REN S C , LIU W X , LIU Y T , et al . Reciprocal transformations for unsupervised video object segmentation [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 15430 - 15439 .

LEE M , CHO S , LEE S , et al . Unsupervised video object segmentation via prototype memory network [C ] // 2023 IEEE/CVF Winter Conference on Applications of Computer Vision . Piscataway : IEEE , 2023 : 5913 - 5923 .

LEE M , CHO S , LEE D , et al . Guided slot attention for unsupervised video object segmentation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 3807 - 3816 .

CHO S , LEE M , LEE S , et al . Dual prototype attention for unsupervised video object segmentation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 19238 - 19247 .

TEED Z , DENG J . RAFT: Recurrent all-pairs field transforms for optical flow [C ] // Computer Vision - ECCV 2020 . Cham : Springer , 2020 : 402 - 419 .

KINGMA D P , BA J . Adam: A method for stochastic optimization [EB/OL ] . ( 2017-01-30 )[ 2025-02-26 ] . https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/1412.6980 .

PONT-TUSET J , PERAZZI F , CAELLES S , et al . The 2017 davis challenge on video object segmentation [EB/OL ] . ( 2018-03-01 )[ 2025-02-26 ] . https://arxiv.org/abs/1704.00675 https://arxiv.org/abs/1704.00675 .

CHENG H K , OH S W , PRICE B , et al . Tracking anything with decoupled video segmentation [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 1316 - 1326 .

CHENG Y , LI L , XU Y , et al . Segment and track anything [EB/OL ] . ( 2023-05-11 )[ 2025-02-26 ] . https://arxiv.org/abs/2305.06558 https://arxiv.org/abs/2305.06558 .

YANG J , GAO M , LI Z , et al . Track anything:Segment anything meets videos [EB/OL ] . ( 2023-04-28 )[ 2025-02-26 ] . https://arxiv.org/abs/2304.11968 https://arxiv.org/abs/2304.11968 .

ZHANG X Y , ZHOU X Y , LIN M X , et al . ShuffleNet: An extremely efficient convolutional neural network for mobile devices [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 6848 - 6856 .

CHEN Y P , DAI X Y , LIU M C , et al . Dynamic convolution: Attention over convolution kernels [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 11027 - 11036 .

HAN K , WANG Y H , TIAN Q , et al . GhostNet: More features from cheap operations [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 1577 - 1586 .

CHEN L W , GU L , ZHENG D Z , et al . Frequency-adaptive dilated convolution for semantic segmentation [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 3414 - 3425 .

CHEN L W , GU L , LI L , et al . Frequency dynamic convolution for dense image prediction [C ] // 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference . Piscataway : IEEE , 2025 : 30178 - 30188 .

LIN T Y , DOLLÁR P , GIRSHICK R , et al . Feature pyramid networks for object detection [C ] // 2017 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2017 : 936 - 944 .

LIU S , QI L , QIN H F , et al . Path aggregation network for instance segmentation [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 8759 - 8768 .

TAN M X , PANG R M , LE Q V . EfficientDet: Scalable and efficient object detection [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 10778 - 10787 .

CHEN L W , FU Y , GU L , et al . Frequency-aware feature fusion for dense image prediction [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 12 ): 10763 - 10780 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

Open World Object Detection Based on Causal Prompt Distillation

Related Author

LIU Bing

YAO Rui

DU Wen-liang

ZHOU Yong

WANG Ping-an

ZHAO Jia-qi

Related Institution

School of Computer Science and Technology, China University of Mining and Technology

Mine Digitization Engineering Research Center of the Ministry of Education

⁰