

浏览全部资源
扫码关注微信
1.长安大学信息工程学院,陕西西安 710064
2.长安大学电子与控制工程学院,陕西西安 710064
3.长安大学数据科学与人工智能研究院,陕西西安 710064
Received:23 September 2025,
Accepted:06 January 2026,
Published:25 January 2026
移动端阅览
王荣, 胡海祥, 魏弘凯, 等. CIMOT3D:基于中文引导的单目视角下三维多目标跟踪研究[J]. 电子学报, 2026, 54(01): 102-114.
WANG Rong, HU Haixiang, WEI Hongkai, et al. CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking[J]. Acta Electronica Sinica, 2026, 54(01): 102-114.
王荣, 胡海祥, 魏弘凯, 等. CIMOT3D:基于中文引导的单目视角下三维多目标跟踪研究[J]. 电子学报, 2026, 54(01): 102-114. DOI:10.12263/DZXB.20250826
WANG Rong, HU Haixiang, WEI Hongkai, et al. CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking[J]. Acta Electronica Sinica, 2026, 54(01): 102-114. DOI:10.12263/DZXB.20250826
自然语言描述驱动的目标跟踪通过解析符合人类表达习惯的语言描述,并将其与视觉信息融合,从而实现复杂环境中特定目标的精准识别与持续跟踪.然而,现有方法主要集中于二维场景或三维单目标跟踪,尚未扩展至三维多目标跟踪,缺乏将文本与三维视觉空间中多个候选目标进行特征对齐与关联建立的能力;此外,现有自然语言描述驱动三维目标跟踪任务在语言层面存在冗余问题,难以模拟人类基于灵活简练的指令对多个特定目标进行跟踪的能力.针对这些挑战,本文提出基于中文引导的单目视角下三维多目标跟踪新任务(Chinese-Instruction-based monocular 3D Multi-Object Tracking,CIMOT3D),并构建了含有5 562个视频序列的数据集CIMOT3D-5k,且所有序列均标注有符合人类表达习惯的中文描述.同时,本文设计了一种专用于该任务的神经网络模型CIMOT3D-SyncTracker(Chinese-Instruction-based monocular 3D Multi-Object tracking Synchronization Tracker),其框架由多模态特征提取器、视觉语言编解码器与检测跟踪模块三部分组成.相比于基线方法,本文方法在CIMOT3D-5k数据集上的跟踪准确率和身份一致性指标上分别提高了4.1和5.0个百分点,验证了其性能优势.本文拓展了视觉语言融合在三维多目标跟踪方向的研究深度,并为相关领域的后续探索提供了新的思路.
Natural language-driven object tracking parses human-like language descriptions and fuses them with visual information to achieve accurate recognition and continuous tracking of specific targets in complex environments. However
existing methods focus on 2D tracking or 3D single-target tracking
and they have not been effectively extended to 3D multi-target tracking. They lack the capability to align text with multiple candidate targets in 3D visual space and to establish associations. In addition
existing natural language-driven 3D object tracking tasks suffer from redundancy in language descriptions
which makes it hard to track multiple specific targets using flexible and concise instructions as humans do. To address these challenges
this paper introduces a new task
chinese-instruction-based monocular 3D multi-object tracking (CIMOT3D). The paper also constructs a new dataset
CIMOT3D-5k
which contains 5 562 video sequences with human-like Chinese descriptions. Furthermore
this paper designs a neural network model chinese-instruction-based monocular 3D multi-object tracking synchronization tracker (CIMOT3D-SyncTracker) for this task
which consists of a multimodal feature extractor
a vision-language encoder-decoder
and a detection-tracking module. Compared with baseline methods
the proposed approach achieves an improvement of 4.1% in tracking accuracy and 5.0% in identity consistency metric on the CIMOT3D-5k dataset
verifying its performance advantage. This paper advances research on vision-language fusion in 3D multi-object tracking and offers new ideas for further exploration in related fields.
伍瀚 , 孙浩 , 计科峰 , 等 . 时序信息引导跨视角特征融合的多无人机多目标跟踪方法 [J ] . 电子学报 , 2025 , 53 ( 3 ): 728 - 743 .
Wu Han , Sun Hao , Ji Kefeng , et al . Temporal-guided cross-view feature fusion network for multi-drone multi-object tracking [J ] . Acta Electronica Sinica , 2025 , 53 ( 3 ): 728 - 743 . (in Chinese)
郑锦 , 蒋博韬 , 彭微 , 等 . LiDar点云指导下特征分布趋同与语义关联的3D目标检测 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1700 - 1715 .
Zheng Jin , Jiang Botao , Peng Wei , et al . 3D object detection based on feature distribution convergence guided by LiDar point cloud and semantic association [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1700 - 1715 . (in Chinese)
Yu Haibao , Yang Wenxian , Ruan Hongzhi , et al . V2X-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 5486 - 5495 . DOI: 10.1109/cvpr52729.2023.00531 http://dx.doi.org/10.1109/cvpr52729.2023.00531
Bai Shuai , Chen Keqin , Liu Xuejing , et al . Qwen 2 .5-vl technical report[PP/OL ] . V1.arXiv ( 2025-02-19 )[ 2025-09-10 ] . https://arxiv.org/abs/2502.13923 https://arxiv.org/abs/2502.13923 .
Chen Zedu , Zhong Bineng , Li Guorong , et al . SiamBAN: Target-aware tracking with Siamese box adaptive network [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 4 ): 5158 - 5173 .
Kristan M , Leonardis A , Matas J , et al . The sixth visual object tracking VOT2018 challenge results [C ] // International Conference on Computer Vision . Munich : Springer , 2019 : 3 - 53 .
Kristan M , Matas J , Leonardis A , et al . The seventh visual object tracking VOT2019 challenge results [C ] // 2019 IEEE/CVF International Conference on Computer Vision Workshop . Piscataway : IEEE , 2019 : 2206 - 2241 .
Mueller M , Smith N , Ghanem B . A benchmark and simulator for UAV tracking [C ] // 14th European Conference on Computer Vision . Cham : Springer , 2016 : 445 - 461 . DOI: 10.1007/978-3-319-46448-0_27 http://dx.doi.org/10.1007/978-3-319-46448-0_27
Fan Heng , Lin Liting , Yang Fan , et al . LaSOT: A high-quality benchmark for large-scale single object tracking [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 5369 - 5378 . DOI: 10.1109/cvpr.2019.00552 http://dx.doi.org/10.1109/cvpr.2019.00552
Maggiolino G , Ahmad A , Cao J K , et al . Deep OC-sort: Multi-pedestrian tracking by adaptive re-identification [C ] // 2023 IEEE International Conference on Image Processing . Piscataway : IEEE , 2023 : 3025 - 3029 . DOI: 10.1109/icip49359.2023.10222576 http://dx.doi.org/10.1109/icip49359.2023.10222576
Cao Jinkun , Pang Jiangmiao , Weng Xinshuo , et al . Observation-centric SORT: Rethinking SORT for robust multi-object tracking [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 9686 - 9696 . DOI: 10.1109/cvpr52729.2023.00934 http://dx.doi.org/10.1109/cvpr52729.2023.00934
Weng Xinshuo , Wang Jianren , Held D , et al . AB3DMOT: A baseline for 3D multi-object tracking and new evaluation metrics [PP/OL ] . V1.arXiv ( 2020-08-18 )[ 2025-09-23 ] . https://arXiv.org/abs/2008.08063 https://arXiv.org/abs/2008.08063 . DOI: 10.1109/iros45743.2020.9341164 http://dx.doi.org/10.1109/iros45743.2020.9341164
Geiger A , Lenz P , Stiller C , et al . Vision meets robotics: The KITTI dataset [J ] . The International Journal of Robotics Research , 2013 , 32 ( 11 ): 1231 - 1237 . DOI: 10.1177/0278364913491297 http://dx.doi.org/10.1177/0278364913491297
Yin Junbo , Shen Jianbing , Chen Runnan , et al . IS-fusion: Instance-scene collaborative fusion for multimodal 3D object detection [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 14905 - 14915 . DOI: 10.1109/cvpr52733.2024.01412 http://dx.doi.org/10.1109/cvpr52733.2024.01412
Wu Dongming , Han Wencheng , Wang Tiancai , et al . Referring multi-object tracking [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 14633 - 14642 . DOI: 10.1109/cvpr52729.2023.01406 http://dx.doi.org/10.1109/cvpr52729.2023.01406
Zhao Zeyong , Hao Yanchao , Zhang Minghao , et al . HFF-tracker: A hierarchical fine-grained fusion tracker for referring multi-object tracking [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 10 ): 10528 - 10536 . DOI: 10.1609/aaai.v39i10.33143 http://dx.doi.org/10.1609/aaai.v39i10.33143
杨洋 , 魏弘凯 , 孙士杰 , 等 . NLOT3D: 单目视角下自然语言描述驱动的三维目标跟踪研究 [J ] . 电子学报 , 2025 , 53 ( 6 ): 2038 - 2049 .
Yang Yang , Wei Hongkai , Sun Shijie , et al . NLOT3D: Natural-language-driven 3D object tracking in monocular view [J ] . Acta Electronica Sinica , 2025 , 53 ( 6 ): 2038 - 2049 . (in Chinese)
Touvron H , Cord M , Douze M , et al . Training data-efficient image transformers & distillation through attention [C ] // Proceedings of the 38th International Conference on Machine Learning . PMLR , 2021 : 10347 - 10357 . DOI: 10.1109/iccv48922.2021.00091 http://dx.doi.org/10.1109/iccv48922.2021.00091
Cui Yiming , Che Wanxiang , Liu Ting , et al . Pre-training with whole word masking for Chinese BERT [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2021 , 29 : 3504 - 3514 . DOI: 10.1109/taslp.2021.3124365 http://dx.doi.org/10.1109/taslp.2021.3124365
Koonce B . ResNet 50 [M ] // Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization . Berkeley : Apress , 2021 : 63 - 72 . DOI: 10.1007/978-1-4842-6168-2_6 http://dx.doi.org/10.1007/978-1-4842-6168-2_6
Vaswani A , Shazeer N , Parmar N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach : Curran Associates Inc. , 2017 : 6000 - 6010 .
袁丁 , 李源 , 孟羽倩 , 等 . 基于时空注意力Transformer的自动驾驶运动规划方法 [J ] . 电子学报 , 2025 , 53 ( 7 ): 2418 - 2427 .
Yuan Ding , Li Yuan , Meng Yuqian , et al . A motion planning method for autonomous driving based on spatiotemporal attention Transformer [J ] . Acta Electronica Sinica , 2025 , 53 ( 7 ): 2418 - 2427 . (in Chinese)
钟芯 , 唐春明 , 彭凌西 . 基于注意力融合多尺度特征的解压缩点云质量增强方法 [J ] . 电子学报 , 2025 , 53 ( 8 ): 2794 - 2804 .
Zhong Xin , Tang Chunming , Peng Lingxi . A method for enhancing the quality of decompressed point clouds based on attention-fused multi-scale features [J ] . Acta Electronica Sinica , 2025 , 53 ( 8 ): 2794 - 2804 . (in Chinese)
Han Kai , Wang Yunhe , Chen Hanting , et al . A survey on vision transformer [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 1 ): 87 - 110 . DOI: 10.1109/tpami.2022.3152247 http://dx.doi.org/10.1109/tpami.2022.3152247
Lopez R , Regier J , Jordan M I , et al . Information constraints on auto-encoding variational Bayes [C ] // Proceedings of the 32nd International Conference on Neural Information Processing Systems . New York : ACM , 2018 : 6117 - 6128 .
Li Junnan , Selvaraju R R , Gotmare A D , et al . Align before fuse: Vision and language representation learning with momentum distillation [C ] // Proceedings of the 35th International Conference on Neural Information Processing Systems . Curran Associates Inc. , 2021 : 742 .
Ridnik T , Ben-Baruch E , Zamir N , et al . Asymmetric loss for multi-label classification [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 82 - 91 . DOI: 10.1109/iccv48922.2021.00015 http://dx.doi.org/10.1109/iccv48922.2021.00015
王炼红 , 罗志辉 , 林飞鹏 , 等 . 采用多头注意力机制的C&RM-MAKT预测算法 [J ] . 电子学报 , 2023 , 51 ( 5 ): 1215 - 1222 .
Wang Lianhong , Luo Zhihui , Lin Feipeng , et al . C&RM-MAKT prediction algorithm using multi-head attention mechanism [J ] . Acta Electronica Sinica , 2023 , 51 ( 5 ): 1215 - 1222 . (in Chinese)
Carion N , Massa F , Synnaeve G , et al . End-to-end object detection with transformers [C ] // 16th European Conference on Computer Vision . Cham : Springer , 2020 : 213 - 229 . DOI: 10.1007/978-3-030-58452-8_13 http://dx.doi.org/10.1007/978-3-030-58452-8_13
Gong Yan , Cosma G . Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval [J ] . Pattern Recognition , 2023 , 137 : 109272 . DOI: 10.1016/j.patcog.2022.109272 http://dx.doi.org/10.1016/j.patcog.2022.109272
Rahad M , Shabab R , Ahammad M S , et al . KL-FedDis: A federated learning approach with distribution information sharing using Kullback-Leibler divergence for non-IID data [J ] . Neuroscience Informatics , 2025 , 5 ( 1 ): 100182 . DOI: 10.1016/j.neuri.2024.100182 http://dx.doi.org/10.1016/j.neuri.2024.100182
Kim S , Petrunin I , Shin H S . A review of Kalman filter with artificial intelligence techniques [C ] // 2022 Integrated Communication, Navigation and Surveillance Conference . Piscataway : IEEE , 2022 : 1 - 12 . DOI: 10.1109/icns54818.2022.9771520 http://dx.doi.org/10.1109/icns54818.2022.9771520
Stadler D , Beyerer J . BYTEv2: Associating more detection boxes under occlusion for improved multi-person tracking [C ] // Proceedings of the International Conference on Pattern Recognition, Computer Vision, and Image Processing . Cham : Springer , 2023 : 79 - 94 . DOI: 10.1007/978-3-031-37660-3_6 http://dx.doi.org/10.1007/978-3-031-37660-3_6
Liu Yingfei , Yan Junjie , Jia Fan , et al . PETRv2: A unified framework for 3D perception from multi-camera images [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 3239 - 3249 . DOI: 10.1109/iccv51070.2023.00302 http://dx.doi.org/10.1109/iccv51070.2023.00302
Ali Kamal Mohammed S , Ab Razak M Z , Abd Rahman A H . 3D-DIoU: 3D distance intersection over union for multi-object tracking in point cloud [J ] . Sensors , 2023 , 23 ( 7 ): 3390 . DOI: 10.3390/s23073390 http://dx.doi.org/10.3390/s23073390
Zhou Pan , Xie Xingyu , Lin Zhouchen , et al . Towards understanding convergence and generalization of AdamW [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 9 ): 6486 - 6493 . DOI: 10.1109/tpami.2024.3382294 http://dx.doi.org/10.1109/tpami.2024.3382294
Sim H S , Cho H C . Enhanced DeepSORT and StrongSORT for multicattle tracking with optimized detection and re-identification [J ] . IEEE Access , 2025 , 13 : 19353 - 19364 . DOI: 10.1109/access.2025.3535092 http://dx.doi.org/10.1109/access.2025.3535092
Chen Yao , Ding Shuyan , Guo Jianhui , et al . CSTrack: A comprehensive and concise vision transformer tracker [C ] // 6th Chinese Conference on Pattern Recognition and Computer Vision . Singapore : Springer , 2024 : 120 - 132 . DOI: 10.1007/978-981-99-8555-5_10 http://dx.doi.org/10.1007/978-981-99-8555-5_10
Lin Jiacheng , Chen Jiajun , Peng Kunyu , et al . EchoTrack: Auditory referring multi-object tracking for autonomous driving [J ] . IEEE Transactions on Intelligent Transportation Systems , 2024 , 25 ( 11 ): 18964 - 18977 . DOI: 10.1109/tits.2024.3437645 http://dx.doi.org/10.1109/tits.2024.3437645
Hu Houning , Yang Y H , Fischer T , et al . Monocular quasi-dense 3D object tracking [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 1992 - 2008 . DOI: 10.1109/tpami.2022.3168781 http://dx.doi.org/10.1109/tpami.2022.3168781
Huang K C , Yang M H , Tsai Y H . Delving into motion-aware matching for monocular 3D object tracking [C ] // 2023 IEEE/CVF international conference on computer vision . Piscataway : IEEE , 2023 : 6886 - 6895 . DOI: 10.1109/iccv51070.2023.00636 http://dx.doi.org/10.1109/iccv51070.2023.00636
0
Views
19
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621