NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View

YANG Yang; WEI Hong-kai; SUN Shi-jie; SONG Xiang-yu; HU Hong-li; GUO Ke-yu; SONG Huan-sheng

doi:10.12263/DZXB.20241160

您当前的位置：

首页 >

文章列表页 >

NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View

PAPERS | 更新时间：2025-10-16

- NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View
- ACTA ELECTRONICA SINICA Vol. 53, Issue 6, Pages: 2038-2049(2025)
- 作者机构：
  
  1.长安大学信息工程学院，陕西西安 710064
  2.长安大学数据科学与人工智能研究院，陕西西安 710064
- 作者简介：
- 基金信息：
  
  Jiangxi Provincial Youth Fund Project(S2024QNJJL0062);Fundamental Research Funds for the Central Universities, Chang'an University(300102244202)
- DOI：10.12263/DZXB.20241160
  CLC： TP391.41;
- Received：25 December 2024，
  
  Revised：2025-05-13，
  
  Published：25 June 2025
- 稿件说明：
移动端阅览
杨洋, 魏弘凯, 孙士杰, 等. NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究[J]. 电子学报, 2025, 53(06): 2038-2049.

YANG Yang, WEI Hong-kai, SUN Shi-jie, et al. NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View[J]. Acta Electronica Sinica, 2025, 53(06): 2038-2049.
杨洋, 魏弘凯, 孙士杰, 等. NLOT3D：单目视角下自然语言描述驱动的三维目标跟踪研究[J]. 电子学报, 2025, 53(06): 2038-2049. DOI：10.12263/DZXB.20241160

YANG Yang, WEI Hong-kai, SUN Shi-jie, et al. NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View[J]. Acta Electronica Sinica, 2025, 53(06): 2038-2049. DOI：10.12263/DZXB.20241160

摘要

自然语言描述驱动的目标跟踪是指通过自然语言描述引导视觉目标跟踪，通过融合文本描述和图像视觉信息，使机器能够“像人类一样”感知和理解真实的三维世界.随着深度学习的发展，自然语言描述驱动的视觉目标跟踪领域不断涌现新的方法.但现有方法大多局限于二维空间，未能充分利用三维空间的位姿信息，因此无法像人类一样自然地进行三维感知；而传统三维目标跟踪任务又依赖于昂贵的传感器，并且数据采集和处理存在局限性，这使得三维目标跟踪变得更加复杂.针对上述挑战，本文提出了单目视角下自然语言描述驱动的三维目标跟踪（Natural Language-driven Object Tracking in 3D，NLOT3D）新任务，并构建了对应的数据集NLOT3D-SPD.此外，本文还设计了一个端到端的NLOT3D-TR（Natural Language-driven Object Tracking in 3D based on Transformer）模型，该模型融合了视觉与文本的跨模态特征，在NLOT3D-SPD数据集上取得了优异的实验结果.本文为NLOT3D任务提供了全面的基准测试，并进行了对比实验与消融研究，为三维目标跟踪领域的进一步发展提供了支持.

Abstract

Natural language description-driven object tracking refers to guiding the visual tracking task through natural language descriptions

and fusing textual descriptions and image visual information to realize the model’s perception and understanding of the world “like a human”. With the development of deep learning

new methods in the field of natural language description-driven visual tracking are emerging. However

most of the existing methods are limited to two-dimensional space and fail to fully utilize the position information in three-dimensional space

and thus are unable to naturally perceive the world in three dimensions as humans do. Most of the existing 3D object tracking tasks rely on expensive sensors and have limitations in data acquisition

which makes 3D object tracking even more complicated. To address the above challenges

this paper proposes a new task of natural language-driven object tracking in 3D(NLOT3D) in monocular view and constructs the corresponding dataset

NLOT3D-SPD. In addition

this paper designs an end-to-end NLOT3D-TR(Natural Language-driven Object Tracking in 3D based on Transformer) model

which fuses visual and textual cross-modal features and achieves excellent experimental results. This paper provides a comprehensive benchmarking of the NLOT3D task with several comparative experiments and ablation studies

providing strong support for further development in the field of 3D object tracking.

关键词

Keywords

references

郑锦 , 蒋博韬 , 彭微 , 等 . LiDar点云指导下特征分布趋同与语义关联的3D目标检测 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1700 - 1715 .

ZHENG J , JIANG B T , PENG W , et al . 3D Object detection based on feature distribution convergence guided by liDar point cloud and semantic association [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1700 - 1715 . (in Chinese)

孟琭 , 杨旭 . 目标跟踪算法综述 [J ] . 自动化学报 , 2019 , 45 ( 7 ): 1244 - 1260 .

MENG L , YANG X . A survey of object tracking algorithms [J ] . Acta Automatica Sinica , 2019 , 45 ( 7 ): 1244 - 1260 . (in Chinese)

LUO W H , XING J L , MILAN A , et al . Multiple object tracking: A literature review [J ] . Artificial Intelligence , 2021 , 293 : 103448 .

MARVASTI-ZADEH S M , CHENG L , GHANEI-YAKHDAN H , et al . Deep learning for visual tracking: A comprehensive survey [J ] . IEEE Transactions on Intelligent Transportation Systems , 2022 , 23 ( 5 ): 3943 - 3968 .

JIAO L C , WANG D , BAI Y D , et al . Deep learning in visual tracking: A review [J ] . IEEE Transactions on Neural Networks and Learning Systems , 2023 , 34 ( 9 ): 5497 - 5516 .

NAM H , HAN B . Learning multi-domain convolutional neural networks for visual tracking [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 4293 - 4302 .

LI B , YAN J J , WU W , et al . High performance visual tracking with Siamese Region proposal network [C ] // 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2018 : 8971 - 8980 .

LI B , WU W , WANG Q , et al . SiamRPN++: Evolution of Siamese visual tracking with very deep networks [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2019 : 4282 - 4291 .

何俊 , 张彩庆 , 李小珍 , 等 . 面向深度学习的多模态融合技术研究综述 [J ] . 计算机工程 , 2020 , 46 ( 5 ): 1 - 11 .

HE J , ZHANG C Q , LI X Z , et al . Survey of research on multimodal fusion technology for deep learning [J ] . Computer Engineering , 2020 , 46 ( 5 ): 1 - 11 . (in Chinese)

郭宗洋 , 刘立东 , 蒋东华 , 等 . 基于语义引导神经网络的人体动作识别算法 [J ] . 图学学报 , 2024 , 45 ( 1 ): 26 - 34 .

GUO Z Y , LIU L D , JIANG D H , et al . Human action recognition algorithm based on semantics guided neural networks [J ] . Journal of Graphics , 2024 , 45 ( 1 ): 26 - 34 . (in Chinese)

KIM G , KWON T , YE J C . DiffusionCLIP: Text-guided diffusion models for robust image manipulation [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 2416 - 2425 .

陆庆阳 , 袁广林 , 朱虹 , 等 . 一种基于对比学习大模型的视觉定位方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3448 - 3458 .

LU Q Y , YUAN G L , ZHU H , et al . A Visual grounding method with contrastive learning large model [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3448 - 3458 . (in Chinese)

RAMESH A , PAVLOV M , GOH G , et al . Zero-shot text-to-image generation [C ] // International Conference on Machine Learning . San Diego : PMLR , 2021 : 8821 - 8831 .

RADFORD A , KIM J W , HALLACY C , et al . Learning transferable visual models from natural language supervision [EB/OL ] . ( 2021-02-26 )[ 2025-05-20 ] . https://arxiv.org/abs/2103.00020v1 https://arxiv.org/abs/2103.00020v1 .

樊琳 , 龚勋 , 郑岑洋 . 基于文本引导下的多模态医学图像分析算法 [J ] . 电子学报 , 2024 , 52 ( 7 ): 2341 - 2355 .

FAN L , GONG X , ZHENG C Y . A multi-modal medical image analysis algorithm based on text guidance [J ] . Acta Electronica Sinica , 2024 , 52 ( 7 ): 2341 - 2355 . (in Chinese)

XIA W H , YANG Y J , XUE J H , et al . TediGAN: Text-guided diverse face image generation and manipulation [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2021 : 2256 - 2265 .

KOCASARı U , DIRIK A , TIFTIKCI M , et al . StyleMC: Multi-channel based fast text-guided image generation and manipulation [C ] // 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . Piscataway : IEEE , 2022 : 3441 - 3450 .

ABDAL R , ZHU P H , FEMIANI J , et al . CLIP2StyleGAN: Unsupervised extraction of StyleGAN edit directions [C ] // Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings . New York : ACM , 2022 : 1 - 9 .

LOU X D , LIU Y G , LI X W . TeCM-CLIP: TeCM-CLIP: Text-Based Controllable Multi-Attribute Face Image Manipulation [M ] // Computer Vision-ACCV 2022 . Cham : Springer Nature Switzerland , 2023 : 71 - 87 .

LIU S L , ZENG Z Y , REN T H , et al . Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 38 - 55 .

LI L H , ZHANG P C , ZHANG H T , et al . Grounded language-image pre-training [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 10955 - 10965 .

CHENG T H , SONG L , GE Y X , et al . YOLO-world: Real-time open-vocabulary object detection [C ] // 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2024 : 16901 - 16911 .

JIANG Q , LI F , ZENG Z Y , et al . T-Rex2: Towards generic object detection via text-visual prompt synergy [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 38 - 57 .

KAMATH A , SINGH M , LECUN Y , et al . MDETR - modulated detection for end-to-end multi-modal understanding [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 1760 - 1770 .

ZHAN Y , YUAN Y , XIONG Z T . Mono3DVG: 3D visual grounding in monocular images [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 7 ): 6988 - 6996 .

YANG L , YUAN C F , ZHANG Z Q , et al . Exploiting contextual objects and relations for 3D visual grounding [J ] . Advances in Neural Information Processing Systems , 2024 , 1 : 36 .

CHEN D Z , CHANG A X , NIEßNER M . ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language [M ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 202 - 221 .

LIN Z X , PENG X D , CONG P S , et al . WildRefer: 3D Object Localization in Large-Scale Dynamic Scenes with Multi-Modal Visual Data and Natural Language [M ] // Computer Vision-ECCV 2024 . Cham : Springer Nature Switzerland , 2024 : 456 - 473 .

YU H B , YANG W X , RUAN H Z , et al . V2X-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2023 : 5486 - 5495 .

YU H B , LUO Y Z , SHU M , et al . DAIR-V2X: A large-scale dataset for vehicle-infrastructure cooperative 3D object detection [C ] // 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2022 : 21329 - 21338 .

ZENG A H , XU B , WANG B W , et al . ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools [EB/OL ] . ( 2024-07-30 )[ 2025-05-20 ] . https://arxiv.org/abs/2406.12793v2 https://arxiv.org/abs/2406.12793v2 .

CUI Y M , CHE W X , LIU T , et al . Pre-training with whole word masking for Chinese BERT [J ] . IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2021 , 29 : 3504 - 3514 .

HE K M , ZHANG X Y , REN S Q , et al . Deep residual learning for image recognition [C ] // 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . Piscataway : IEEE , 2016 : 770 - 778 .

LAN M , RONG F , JIAO H Z , et al . Language query-based transformer with multiscale cross-modal alignment for visual grounding on remote sensing images [J ] . IEEE Transactions on Geoscience and Remote Sensing , 2024 , 62 : 5626513 .

SADHU A , CHEN K , NEVATIA R . Zero-shot grounding of objects from natural language queries [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4694 - 4703 .

YANG Z Y , GONG B Q , WANG L W , et al . A fast and accurate one-stage approach to visual grounding [C ] // 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2019 : 4683 - 4693 .

YANG Z Y , CHEN T L , WANG L W , et al . Improving One-Stage Visual Grounding by Recursive Sub-Query Construction [M ] // Computer Vision-ECCV 2020 . Cham : Springer International Publishing , 2020 : 387 - 404 .

DENG J J , YANG Z Y , CHEN T L , et al . TransVG: End-to-end visual grounding with transformers [C ] // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Piscataway : IEEE , 2021 : 1749 - 1759 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

A Review of Vision-Based UAV Localization and Navigation Methods

LiDAR Point Cloud Tracking Method Using Point-Voxel Relationship Modeling Under 3D Sparse Convolutional Framework

Water-Related Vision

Related Author

SONG Huan-sheng

GUO Ke-yu

HU Hong-li

SONG Xiang-yu

SUN Shi-jie

WEI Hong-kai

YANG Yang

GU Mei-ying

Related Institution

School of Data Science and Artificial Intelligence,Chang'an University

School of Information Engineering, Chang'an University

Jiangxi Research Institute, Beihang University

School of Computer,Beihang University

School of Mathematical Sciences, Dalian University of Technology

⁰