NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究

杨洋; 魏弘凯; 孙士杰; 宋翔宇; 胡红利; 郭柯宇; 宋焕生

doi:10.12263/DZXB.20241160

您当前的位置：

首页 >

文章列表页 >

NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究

学术论文 | 更新时间：2025-10-16

- NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究
- NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View
- 电子学报 2025年53卷第6期页码：2038-2049
- 作者机构：
  
  1.长安大学信息工程学院，陕西西安 710064
  2.长安大学数据科学与人工智能研究院，陕西西安 710064
- 作者简介：
  
  [ "杨洋男，2000年12月生，陕西汉中人.长安大学信息工程学院硕士研究生.主要研究方向为视觉定位、多模态融合与计算机视觉. E-mail: yangyy@chd.edu.cn" ]
  [ "魏弘凯男，2001年5月生，福建南平人.长安大学信息工程学院博士研究生.主要研究方向为三维计算机视觉、交通场景理解与医学图像处理. E-mail: hongkaiwei@chd.edu.cn" ]
  [ "孙士杰男，1989年10月生，河南商丘人.长安大学数据科学与人工智能研究院副教授、国际生博士生导师.主要研究方向为多目标检测跟踪、交通三维重建与多目标位姿估计. E-mail: shijieSun@chd.edu.cn" ]
  [ "宋翔宇男，1991年3月生，陕西西安人.长安大学数据科学与人工智能研究院副教授、博士生导师.主要研究方向为交通异常事件视频语言大模型与多模态学习应用. E-mail: xiangyu.song@chd.edu.cn" ]
  [ "胡红利女，2002年4月生，河南濮阳人.长安大学信息工程学院硕士研究生.主要研究方向为计算机视觉、位姿估计、三维检测. E-mail: hhlhu@chd.edu.cn" ]
  [ "郭柯宇男，1999年9月生，贵州黔南人.长安大学信息工程学院博士研究生.主要研究方向为计算机视觉与场景理解. E-mail: keyuguo@chd.edu.cn" ]
  [ "宋焕生男，1964年10月生，内蒙古赤峰人.长安大学信息工程学院博士生导师、二级教授，国务院政府特殊津贴专家.主要研究方向为基于机器视觉的交通感知及交通预警. E-mail: hshsong@chd.edu.cn" ]
- 基金信息：
  
  江西省青年基金(S2024QNJJL0062);长安大学中央高校基本科研业务费专项资金(300102244202)
- DOI：10.12263/DZXB.20241160
  中图分类号： TP391.41;
- 收稿：2024-12-25，
  
  修回：2025-05-13，
  
  纸质出版：2025-06-25
- 稿件说明：
移动端阅览
杨洋, 魏弘凯, 孙士杰, 等. NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究[J]. 电子学报, 2025, 53(06): 2038-2049.

YANG Yang, WEI Hong-kai, SUN Shi-jie, et al. NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View[J]. Acta Electronica Sinica, 2025, 53(06): 2038-2049.
杨洋, 魏弘凯, 孙士杰, 等. NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究[J]. 电子学报, 2025, 53(06): 2038-2049. DOI：10.12263/DZXB.20241160

YANG Yang, WEI Hong-kai, SUN Shi-jie, et al. NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View[J]. Acta Electronica Sinica, 2025, 53(06): 2038-2049. DOI：10.12263/DZXB.20241160

摘要

自然语言描述驱动的目标跟踪是指通过自然语言描述引导视觉目标跟踪，通过融合文本描述和图像视觉信息，使机器能够“像人类一样”感知和理解真实的三维世界.随着深度学习的发展，自然语言描述驱动的视觉目标跟踪领域不断涌现新的方法.但现有方法大多局限于二维空间，未能充分利用三维空间的位姿信息，因此无法像人类一样自然地进行三维感知；而传统三维目标跟踪任务又依赖于昂贵的传感器，并且数据采集和处理存在局限性，这使得三维目标跟踪变得更加复杂.针对上述挑战，本文提出了单目视角下自然语言描述驱动的三维目标跟踪（Natural Language-driven Object Tracking in 3D，NLOT3D）新任务，并构建了对应的数据集NLOT3D-SPD.此外，本文还设计了一个端到端的NLOT3D-TR（Natural Language-driven Object Tracking in 3D based on Transformer）模型，该模型融合了视觉与文本的跨模态特征，在NLOT3D-SPD数据集上取得了优异的实验结果.本文为NLOT3D任务提供了全面的基准测试，并进行了对比实验与消融研究，为三维目标跟踪领域的进一步发展提供了支持.

Abstract

Natural language description-driven object tracking refers to guiding the visual tracking task through natural language descriptions

and fusing textual descriptions and image visual information to realize the model’s perception and understanding of the world “like a human”. With the development of deep learning

new methods in the field of natural language description-driven visual tracking are emerging. However

most of the existing methods are limited to two-dimensional space and fail to fully utilize the position information in three-dimensional space

and thus are unable to naturally perceive the world in three dimensions as humans do. Most of the existing 3D object tracking tasks rely on expensive sensors and have limitations in data acquisition

which makes 3D object tracking even more complicated. To address the above challenges

this paper proposes a new task of natural language-driven object tracking in 3D(NLOT3D) in monocular view and constructs the corresponding dataset

NLOT3D-SPD. In addition

this paper designs an end-to-end NLOT3D-TR(Natural Language-driven Object Tracking in 3D based on Transformer) model

which fuses visual and textual cross-modal features and achieves excellent experimental results. This paper provides a comprehensive benchmarking of the NLOT3D task with several comparative experiments and ablation studies

providing strong support for further development in the field of 3D object tracking.

关键词

Keywords

references

郑锦 , 蒋博韬 , 彭微 , 等 . LiDar点云指导下特征分布趋同与语义关联的3D目标检测 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1700 - 1715 .

ZHENG J , JIANG B T , PENG W , et al . 3D Object detection based on feature distribution convergence guided by liDar point cloud and semantic association [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1700 - 1715 . (in Chinese)

孟琭 , 杨旭 . 目标跟踪算法综述 [J ] . 自动化学报 , 2019 , 45 ( 7 ): 1244 - 1260 .

MENG L , YANG X . A survey of object tracking algorithms [J ] . Acta Automatica Sinica , 2019 , 45 ( 7 ): 1244 - 1260 . (in Chinese)

LUO W H , XING J L , MILAN A , et al . Multiple object tracking: A literature review [J ] . Artificial Intelligence , 2021 , 293 : 103448 .

MARVASTI-ZADEH S M , CHENG L , GHANEI-YAKHDAN H , et al . Deep learning for visual tracking: A comprehensive survey [J ] . IEEE Transactions on Intelligent Transportation Systems , 2022 , 23 ( 5 ): 3943 - 3968 .

JIAO L C , WANG D , BAI Y D , et al . Deep learning in visual tracking: A review [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2023 , 34 ( 9 ): 5497 - 5516 .

NAM H , HAN B . Learning multi-domain convolutional neural networks for visual tracking [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 4293 - 4302 .

LI B , YAN J J , WU W , et al . High performance visual tracking with Siamese Region proposal network [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 8971 - 8980 .

LI B , WU W , WANG Q , et al . SiamRPN++: Evolution of Siamese visual tracking with very deep networks [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 4282 - 4291 .

何俊 , 张彩庆 , 李小珍 , 等 . 面向深度学习的多模态融合技术研究综述 [J ] . 计算机工程 , 2020 , 46 ( 5 ): 1 - 11 .

HE J , ZHANG C Q , LI X Z , et al . Survey of research on multimodal fusion technology for deep learning [J ] . Computer Engineering , 2020 , 46 ( 5 ): 1 - 11 . (in Chinese)

郭宗洋 , 刘立东 , 蒋东华 , 等 . 基于语义引导神经网络的人体动作识别算法 [J ] . 图学学报 , 2024 , 45 ( 1 ): 26 - 34 .

GUO Z Y , LIU L D , JIANG D H , et al . Human action recognition algorithm based on semantics guided neural networks [J ] . Journal of Graphics , 2024 , 45 ( 1 ): 26 - 34 . (in Chinese)

KIM G , KWON T , YE J C . DiffusionCLIP: Text-guided diffusion models for robust image manipulation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 2416 - 2425 .

陆庆阳 , 袁广林 , 朱虹 , 等 . 一种基于对比学习大模型的视觉定位方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3448 - 3458 .

LU Q Y , YUAN G L , ZHU H , et al . A Visual grounding method with contrastive learning large model [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3448 - 3458 . (in Chinese)

RAMESH A , PAVLOV M , GOH G , et al . Zero-shot text-to-image generation [C ] // International Conference on Machine Learning . San Diego : PMLR , 2021 : 8821 - 8831 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [EB/OL ] . ( 2021-02-26 )[ 2025-05-20 ] . https://arxiv.org/abs/2103.00020v1 https://arxiv.org/abs/2103.00020v1 .

樊琳 , 龚勋 , 郑岑洋 . 基于文本引导下的多模态医学图像分析算法 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2341 - 2355 .

FAN L , GONG X , ZHENG C Y . A multi-modal medical image analysis algorithm based on text guidance [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2341 - 2355 . (in Chinese)

XIA W H , YANG Y J , XUE J H , et al . TediGAN: Text-guided diverse face image generation and manipulation [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 2256 - 2265 .

KOCASARı U , DIRIK A , TIFTIKCI M , et al . StyleMC: Multi-channel based fast text-guided image generation and manipulation [C ] // 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2022 : 3441 - 3450 .

ABDAL R , ZHU P H , FEMIANI J , et al . CLIP2StyleGAN: Unsupervised extraction of StyleGAN edit directions [C ] // Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings . New York : ACM , 2022 : 1 - 9 .

LOU X D , LIU Y G , LI X W . TeCM-CLIP: TeCM-CLIP: Text-Based Controllable Multi-Attribute Face Image Manipulation [M ] // Computer Vision-ACCV 2022 . Cham : Springer Nature Switzerland , 2023 : 71 - 87 .

LIU S L , ZENG Z Y , REN T H , et al . Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 38 - 55 .

LI L H , ZHANG P C , ZHANG H T , et al . Grounded language-image pre-training [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 10955 - 10965 .

CHENG T H , SONG L , GE Y X , et al . YOLO-world: Real-time open-vocabulary object detection [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2024 : 16901 - 16911 .

JIANG Q , LI F , ZENG Z Y , et al . T-Rex2: Towards generic object detection via text-visual prompt synergy [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 38 - 57 .

KAMATH A , SINGH M , LECUN Y , et al . MDETR - modulated detection for end-to-end multi-modal understanding [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 1760 - 1770 .

ZHAN Y , YUAN Y , XIONG Z T . Mono3DVG: 3D visual grounding in monocular images [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 7 ): 6988 - 6996 .

YANG L , YUAN C F , ZHANG Z Q , et al . Exploiting contextual objects and relations for 3D visual grounding [J ] . Advances in Neural Information Processing Systems , 2024 , 1 : 36 .

CHEN D Z , CHANG A X , NIEßNER M . ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language [M ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 202 - 221 .

LIN Z X , PENG X D , CONG P S , et al . WildRefer: 3D Object Localization in Large-Scale Dynamic Scenes with Multi-Modal Visual Data and Natural Language [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 456 - 473 .

YU H B , YANG W X , RUAN H Z , et al . V2X-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 5486 - 5495 .

YU H B , LUO Y Z , SHU M , et al . DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 21329 - 21338 .

ZENG A H , XU B , WANG B W , et al . ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools [EB/OL ] . ( 2024-07-30 )[ 2025-05-20 ] . https://arxiv.org/abs/2406.12793v2 https://arxiv.org/abs/2406.12793v2 .

CUI Y M , CHE W X , LIU T , et al . Pre-training with whole word masking for Chinese BERT [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 3504 - 3514 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

LAN M , RONG F , JIAO H Z , et al . Language query-based transformer with multiscale cross-modal alignment for visual grounding on remote sensing images [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2024 , 62 : 5626513 .

SADHU A , CHEN K , NEVATIA R . Zero-shot grounding of objects from natural language queries [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4694 - 4703 .

YANG Z Y , GONG B Q , WANG L W , et al . A fast and accurate one-stage approach to visual grounding [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4683 - 4693 .

YANG Z Y , CHEN T L , WANG L W , et al . Improving One-Stage Visual Grounding by Recursive Sub-Query Construction [M ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 387 - 404 .

DENG J J , YANG Z Y , CHEN T L , et al . TransVG: End-to-end visual grounding with transformers [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 1749 - 1759 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于视觉的无人机定位与导航方法研究综述

3D稀疏卷积结构下融合空间点与体素关系建模的LiDAR点云跟踪方法

涉水视觉