CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking

WANG Rong; HU Haixiang; WEI Hongkai; LIANG Haoxiang; QIAN Xiaowei; LI Kaifei; GUO Keyu; SONG Xiangyu; SUN Shijie

doi:10.12263/DZXB.20250826

您当前的位置：

首页 >

文章列表页 >

CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking

PAPERS | 更新时间：2026-06-04

- CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking
- ACTA ELECTRONICA SINICA Vol. 54, Issue 1, Pages: 102-114(2026)
- 作者机构：
  
  1.长安大学信息工程学院，陕西西安 710064
  2.长安大学电子与控制工程学院，陕西西安 710064
  3.长安大学数据科学与人工智能研究院，陕西西安 710064
- 作者简介：
- 基金信息：
  
  National Key Research and Development Program of China(2023YFB4301800);National Natural Science Foundation of China(62576050);National Postdoctoral Researcher Program(GZC20241447);Fundamental Research Funds for the Central Universities, CHD(300102325101)
- DOI：10.12263/DZXB.20250826
  CLC： TP391.41;
- Received：23 September 2025，
  
  Accepted：06 January 2026，
  
  Published：25 January 2026
- 稿件说明：
移动端阅览
王荣, 胡海祥, 魏弘凯, 等. CIMOT3D：基于中文引导的单目视角下三维多目标跟踪研究[J]. 电子学报, 2026, 54(01): 102-114.

WANG Rong, HU Haixiang, WEI Hongkai, et al. CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking[J]. Acta Electronica Sinica, 2026, 54(01): 102-114.
王荣, 胡海祥, 魏弘凯, 等. CIMOT3D：基于中文引导的单目视角下三维多目标跟踪研究[J]. 电子学报, 2026, 54(01): 102-114. DOI：10.12263/DZXB.20250826

WANG Rong, HU Haixiang, WEI Hongkai, et al. CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking[J]. Acta Electronica Sinica, 2026, 54(01): 102-114. DOI：10.12263/DZXB.20250826

摘要

自然语言描述驱动的目标跟踪通过解析符合人类表达习惯的语言描述，并将其与视觉信息融合，从而实现复杂环境中特定目标的精准识别与持续跟踪.然而，现有方法主要集中于二维场景或三维单目标跟踪，尚未扩展至三维多目标跟踪，缺乏将文本与三维视觉空间中多个候选目标进行特征对齐与关联建立的能力；此外，现有自然语言描述驱动三维目标跟踪任务在语言层面存在冗余问题，难以模拟人类基于灵活简练的指令对多个特定目标进行跟踪的能力.针对这些挑战，本文提出基于中文引导的单目视角下三维多目标跟踪新任务（Chinese-Instruction-based monocular 3D Multi-Object Tracking，CIMOT3D），并构建了含有5 562个视频序列的数据集CIMOT3D-5k，且所有序列均标注有符合人类表达习惯的中文描述.同时，本文设计了一种专用于该任务的神经网络模型CIMOT3D-SyncTracker（Chinese-Instruction-based monocular 3D Multi-Object tracking Synchronization Tracker），其框架由多模态特征提取器、视觉语言编解码器与检测跟踪模块三部分组成.相比于基线方法，本文方法在CIMOT3D-5k数据集上的跟踪准确率和身份一致性指标上分别提高了4.1和5.0个百分点，验证了其性能优势.本文拓展了视觉语言融合在三维多目标跟踪方向的研究深度，并为相关领域的后续探索提供了新的思路.

Abstract

Natural language-driven object tracking parses human-like language descriptions and fuses them with visual information to achieve accurate recognition and continuous tracking of specific targets in complex environments. However

existing methods focus on 2D tracking or 3D single-target tracking

and they have not been effectively extended to 3D multi-target tracking. They lack the capability to align text with multiple candidate targets in 3D visual space and to establish associations. In addition

existing natural language-driven 3D object tracking tasks suffer from redundancy in language descriptions

which makes it hard to track multiple specific targets using flexible and concise instructions as humans do. To address these challenges

this paper introduces a new task

chinese-instruction-based monocular 3D multi-object tracking (CIMOT3D). The paper also constructs a new dataset

CIMOT3D-5k

which contains 5 562 video sequences with human-like Chinese descriptions. Furthermore

this paper designs a neural network model chinese-instruction-based monocular 3D multi-object tracking synchronization tracker (CIMOT3D-SyncTracker) for this task

which consists of a multimodal feature extractor

a vision-language encoder-decoder

and a detection-tracking module. Compared with baseline methods

the proposed approach achieves an improvement of 4.1% in tracking accuracy and 5.0% in identity consistency metric on the CIMOT3D-5k dataset

verifying its performance advantage. This paper advances research on vision-language fusion in 3D multi-object tracking and offers new ideas for further exploration in related fields.

关键词

Keywords

references

伍瀚 , 孙浩 , 计科峰 , 等 . 时序信息引导跨视角特征融合的多无人机多目标跟踪方法 [J ] . 电子学报 , 2025 , 53 ( 3 ): 728 - 743 .

Wu Han , Sun Hao , Ji Kefeng , et al . Temporal-guided cross-view feature fusion network for multi-drone multi-object tracking [J ] . Acta Electronica Sinica , 2025 , 53 ( 3 ): 728 - 743 . (in Chinese)

郑锦 , 蒋博韬 , 彭微 , 等 . LiDar点云指导下特征分布趋同与语义关联的3D目标检测 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1700 - 1715 .

Zheng Jin , Jiang Botao , Peng Wei , et al . 3D object detection based on feature distribution convergence guided by LiDar point cloud and semantic association [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1700 - 1715 . (in Chinese)

Yu Haibao , Yang Wenxian , Ruan Hongzhi , et al . V2X-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 5486 - 5495 . DOI: 10.1109/cvpr52729.2023.00531 http://dx.doi.org/10.1109/cvpr52729.2023.00531

Bai Shuai , Chen Keqin , Liu Xuejing , et al . Qwen 2 .5-vl technical report[PP/OL ] . V1.arXiv ( 2025-02-19 )[ 2025-09-10 ] . https://arxiv.org/abs/2502.13923 https://arxiv.org/abs/2502.13923 .

Chen Zedu , Zhong Bineng , Li Guorong , et al . SiamBAN: Target-aware tracking with Siamese box adaptive network [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 4 ): 5158 - 5173 .

Kristan M , Leonardis A , Matas J , et al . The sixth visual object tracking VOT2018 challenge results [C ] // International Conference on Computer Vision . Munich : Springer , 2019 : 3 - 53 .

Kristan M , Matas J , Leonardis A , et al . The seventh visual object tracking VOT2019 challenge results [C ] // 2019 IEEE/CVF International Conference on Computer Vision Workshop . Piscataway : IEEE , 2019 : 2206 - 2241 .

Mueller M , Smith N , Ghanem B . A benchmark and simulator for UAV tracking [C ] // 14th European Conference on Computer Vision . Cham : Springer , 2016 : 445 - 461 . DOI: 10.1007/978-3-319-46448-0_27 http://dx.doi.org/10.1007/978-3-319-46448-0_27

Fan Heng , Lin Liting , Yang Fan , et al . LaSOT: A high-quality benchmark for large-scale single object tracking [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2019 : 5369 - 5378 . DOI: 10.1109/cvpr.2019.00552 http://dx.doi.org/10.1109/cvpr.2019.00552

Maggiolino G , Ahmad A , Cao J K , et al . Deep OC-sort: Multi-pedestrian tracking by adaptive re-identification [C ] // 2023 IEEE International Conference on Image Processing . Piscataway : IEEE , 2023 : 3025 - 3029 . DOI: 10.1109/icip49359.2023.10222576 http://dx.doi.org/10.1109/icip49359.2023.10222576

Cao Jinkun , Pang Jiangmiao , Weng Xinshuo , et al . Observation-centric SORT: Rethinking SORT for robust multi-object tracking [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 9686 - 9696 . DOI: 10.1109/cvpr52729.2023.00934 http://dx.doi.org/10.1109/cvpr52729.2023.00934

Weng Xinshuo , Wang Jianren , Held D , et al . AB3DMOT: A baseline for 3D multi-object tracking and new evaluation metrics [PP/OL ] . V1.arXiv ( 2020-08-18 )[ 2025-09-23 ] . https://arXiv.org/abs/2008.08063 https://arXiv.org/abs/2008.08063 . DOI: 10.1109/iros45743.2020.9341164 http://dx.doi.org/10.1109/iros45743.2020.9341164

Geiger A , Lenz P , Stiller C , et al . Vision meets robotics: The KITTI dataset [J ] . The International Journal of Robotics Research , 2013 , 32 ( 11 ): 1231 - 1237 . DOI: 10.1177/0278364913491297 http://dx.doi.org/10.1177/0278364913491297

Yin Junbo , Shen Jianbing , Chen Runnan , et al . IS-fusion: Instance-scene collaborative fusion for multimodal 3D object detection [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 14905 - 14915 . DOI: 10.1109/cvpr52733.2024.01412 http://dx.doi.org/10.1109/cvpr52733.2024.01412

Wu Dongming , Han Wencheng , Wang Tiancai , et al . Referring multi-object tracking [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 14633 - 14642 . DOI: 10.1109/cvpr52729.2023.01406 http://dx.doi.org/10.1109/cvpr52729.2023.01406

Zhao Zeyong , Hao Yanchao , Zhang Minghao , et al . HFF-tracker: A hierarchical fine-grained fusion tracker for referring multi-object tracking [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2025 , 39 ( 10 ): 10528 - 10536 . DOI: 10.1609/aaai.v39i10.33143 http://dx.doi.org/10.1609/aaai.v39i10.33143

杨洋 , 魏弘凯 , 孙士杰 , 等 . NLOT3D: 单目视角下自然语言描述驱动的三维目标跟踪研究 [J ] . 电子学报 , 2025 , 53 ( 6 ): 2038 - 2049 .

Yang Yang , Wei Hongkai , Sun Shijie , et al . NLOT3D: Natural-language-driven 3D object tracking in monocular view [J ] . Acta Electronica Sinica , 2025 , 53 ( 6 ): 2038 - 2049 . (in Chinese)

Touvron H , Cord M , Douze M , et al . Training data-efficient image transformers & distillation through attention [C ] // Proceedings of the 38th International Conference on Machine Learning . PMLR , 2021 : 10347 - 10357 . DOI: 10.1109/iccv48922.2021.00091 http://dx.doi.org/10.1109/iccv48922.2021.00091

Cui Yiming , Che Wanxiang , Liu Ting , et al . Pre-training with whole word masking for Chinese BERT [J ] . IEEE/ACM Transactions on Audio , Speech, and Language Processing, 2021 , 29 : 3504 - 3514 . DOI: 10.1109/taslp.2021.3124365 http://dx.doi.org/10.1109/taslp.2021.3124365

Koonce B . ResNet 50 [M ] // Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization . Berkeley : Apress , 2021 : 63 - 72 . DOI: 10.1007/978-1-4842-6168-2_6 http://dx.doi.org/10.1007/978-1-4842-6168-2_6

Vaswani A , Shazeer N , Parmar N , et al . Attention is all you need [C ] // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach : Curran Associates Inc. , 2017 : 6000 - 6010 .

袁丁 , 李源 , 孟羽倩 , 等 . 基于时空注意力Transformer的自动驾驶运动规划方法 [J ] . 电子学报 , 2025 , 53 ( 7 ): 2418 - 2427 .

Yuan Ding , Li Yuan , Meng Yuqian , et al . A motion planning method for autonomous driving based on spatiotemporal attention Transformer [J ] . Acta Electronica Sinica , 2025 , 53 ( 7 ): 2418 - 2427 . (in Chinese)

钟芯 , 唐春明 , 彭凌西 . 基于注意力融合多尺度特征的解压缩点云质量增强方法 [J ] . 电子学报 , 2025 , 53 ( 8 ): 2794 - 2804 .

Zhong Xin , Tang Chunming , Peng Lingxi . A method for enhancing the quality of decompressed point clouds based on attention-fused multi-scale features [J ] . Acta Electronica Sinica , 2025 , 53 ( 8 ): 2794 - 2804 . (in Chinese)

Han Kai , Wang Yunhe , Chen Hanting , et al . A survey on vision transformer [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 1 ): 87 - 110 . DOI: 10.1109/tpami.2022.3152247 http://dx.doi.org/10.1109/tpami.2022.3152247

Lopez R , Regier J , Jordan M I , et al . Information constraints on auto-encoding variational Bayes [C ] // Proceedings of the 32nd International Conference on Neural Information Processing Systems . New York : ACM , 2018 : 6117 - 6128 .

Li Junnan , Selvaraju R R , Gotmare A D , et al . Align before fuse: Vision and language representation learning with momentum distillation [C ] // Proceedings of the 35th International Conference on Neural Information Processing Systems . Curran Associates Inc. , 2021 : 742 .

Ridnik T , Ben-Baruch E , Zamir N , et al . Asymmetric loss for multi-label classification [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2021 : 82 - 91 . DOI: 10.1109/iccv48922.2021.00015 http://dx.doi.org/10.1109/iccv48922.2021.00015

王炼红 , 罗志辉 , 林飞鹏 , 等 . 采用多头注意力机制的C&RM-MAKT预测算法 [J ] . 电子学报 , 2023 , 51 ( 5 ): 1215 - 1222 .

Wang Lianhong , Luo Zhihui , Lin Feipeng , et al . C&RM-MAKT prediction algorithm using multi-head attention mechanism [J ] . Acta Electronica Sinica , 2023 , 51 ( 5 ): 1215 - 1222 . (in Chinese)

Carion N , Massa F , Synnaeve G , et al . End-to-end object detection with transformers [C ] // 16th European Conference on Computer Vision . Cham : Springer , 2020 : 213 - 229 . DOI: 10.1007/978-3-030-58452-8_13 http://dx.doi.org/10.1007/978-3-030-58452-8_13

Gong Yan , Cosma G . Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval [J ] . Pattern Recognition , 2023 , 137 : 109272 . DOI: 10.1016/j.patcog.2022.109272 http://dx.doi.org/10.1016/j.patcog.2022.109272

Rahad M , Shabab R , Ahammad M S , et al . KL-FedDis: A federated learning approach with distribution information sharing using Kullback-Leibler divergence for non-IID data [J ] . Neuroscience Informatics , 2025 , 5 ( 1 ): 100182 . DOI: 10.1016/j.neuri.2024.100182 http://dx.doi.org/10.1016/j.neuri.2024.100182

Kim S , Petrunin I , Shin H S . A review of Kalman filter with artificial intelligence techniques [C ] // 2022 Integrated Communication, Navigation and Surveillance Conference . Piscataway : IEEE , 2022 : 1 - 12 . DOI: 10.1109/icns54818.2022.9771520 http://dx.doi.org/10.1109/icns54818.2022.9771520

Stadler D , Beyerer J . BYTEv2: Associating more detection boxes under occlusion for improved multi-person tracking [C ] // Proceedings of the International Conference on Pattern Recognition, Computer Vision, and Image Processing . Cham : Springer , 2023 : 79 - 94 . DOI: 10.1007/978-3-031-37660-3_6 http://dx.doi.org/10.1007/978-3-031-37660-3_6

Liu Yingfei , Yan Junjie , Jia Fan , et al . PETRv2: A unified framework for 3D perception from multi-camera images [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 3239 - 3249 . DOI: 10.1109/iccv51070.2023.00302 http://dx.doi.org/10.1109/iccv51070.2023.00302

Ali Kamal Mohammed S , Ab Razak M Z , Abd Rahman A H . 3D-DIoU: 3D distance intersection over union for multi-object tracking in point cloud [J ] . Sensors , 2023 , 23 ( 7 ): 3390 . DOI: 10.3390/s23073390 http://dx.doi.org/10.3390/s23073390

Zhou Pan , Xie Xingyu , Lin Zhouchen , et al . Towards understanding convergence and generalization of AdamW [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024 , 46 ( 9 ): 6486 - 6493 . DOI: 10.1109/tpami.2024.3382294 http://dx.doi.org/10.1109/tpami.2024.3382294

Sim H S , Cho H C . Enhanced DeepSORT and StrongSORT for multicattle tracking with optimized detection and re-identification [J ] . IEEE Access , 2025 , 13 : 19353 - 19364 . DOI: 10.1109/access.2025.3535092 http://dx.doi.org/10.1109/access.2025.3535092

Chen Yao , Ding Shuyan , Guo Jianhui , et al . CSTrack: A comprehensive and concise vision transformer tracker [C ] // 6th Chinese Conference on Pattern Recognition and Computer Vision . Singapore : Springer , 2024 : 120 - 132 . DOI: 10.1007/978-981-99-8555-5_10 http://dx.doi.org/10.1007/978-981-99-8555-5_10

Lin Jiacheng , Chen Jiajun , Peng Kunyu , et al . EchoTrack: Auditory referring multi-object tracking for autonomous driving [J ] . IEEE Transactions on Intelligent Transportation Systems , 2024 , 25 ( 11 ): 18964 - 18977 . DOI: 10.1109/tits.2024.3437645 http://dx.doi.org/10.1109/tits.2024.3437645

Hu Houning , Yang Y H , Fischer T , et al . Monocular quasi-dense 3D object tracking [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 2 ): 1992 - 2008 . DOI: 10.1109/tpami.2022.3168781 http://dx.doi.org/10.1109/tpami.2022.3168781

Huang K C , Yang M H , Tsai Y H . Delving into motion-aware matching for monocular 3D object tracking [C ] // 2023 IEEE/CVF international conference on computer vision . Piscataway : IEEE , 2023 : 6886 - 6895 . DOI: 10.1109/iccv51070.2023.00636 http://dx.doi.org/10.1109/iccv51070.2023.00636

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View

Encrypted Traffic Detection Based on Gradient Collaboration and Feature Fusion

Adaptive Output Consensus of Heterogeneous Open Multi-Agent Systems under Actuator Attacks

Low Sidelobe and High Aperture Efficiency Reflectarray Based on Metasurface

Feature Masking and Contrastive Learning Integrating Multi-Dimensional Decorrelation in Sequential Recommendation

Related Author

YANG Yang

WEI Hong-kai

SUN Shi-jie

SONG Xiang-yu

HU Hong-li

GUO Ke-yu

SONG Huan-sheng

YANG Yang

Related Institution

School of Information Engineering, Chang'an University

School of Data Science and Artificial Intelligence, Chang'an University

Guangdong Engineering Technology Research Center of Blockchain

School of Software Engineering, Sun Yat-sen University

School of Artificial Intelligence and Computer Science, Shaanxi Normal University

⁰