Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images

WANG Tiantian; GUO Keyu; LUO Hanke; SUN Shijie; CHENG Huize; SUN Zhanglong

doi:10.12263/DZXB.20251129

您当前的位置：

首页 >

文章列表页 >

Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images

PAPERS | 更新时间：2026-06-16

- Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images
- ACTA ELECTRONICA SINICA Vol. 54, Issue 3, Pages: 1105-1117(2026)
- 作者机构：
  
  1.长安大学信息工程学院，陕西西安 710064
  2.长安大学数据科学与人工智能研究院，陕西西安 710064
- 作者简介：
- 基金信息：
  
  National Key Research and Development Program of China(2023YFB4301800);National Natural Science Foundation of China(62576050);Jiangxi Provincial Youth Fund Project(S2024QNJJL0062)
- DOI：10.12263/DZXB.20251129
  CLC： TP391.41;
- Received：02 February 2026，
  
  Accepted：07 March 2026，
  
  Published：25 March 2026
- 稿件说明：
移动端阅览
王天添, 郭柯宇, 罗函轲, 等. 面向单目图像下三维目标检测与空间语义描述的统一建模[J]. 电子学报, 2026, 54(03): 1105-1117.

WANG Tiantian, GUO Keyu, LUO Hanke, et al. Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images[J]. Acta Electronica Sinica, 2026, 54(03): 1105-1117.
王天添, 郭柯宇, 罗函轲, 等. 面向单目图像下三维目标检测与空间语义描述的统一建模[J]. 电子学报, 2026, 54(03): 1105-1117. DOI：10.12263/DZXB.20251129

WANG Tiantian, GUO Keyu, LUO Hanke, et al. Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images[J]. Acta Electronica Sinica, 2026, 54(03): 1105-1117. DOI：10.12263/DZXB.20251129

摘要

单目三维目标检测为三维感知提供了低成本解决方案，但现有方法难以生成可供人类直观理解的场景描述，从而限制了它们在人机交互、自动驾驶和其他需要丰富语义理解的场景中的适用性。视觉描述作为人类语言智能的直接体现，提供了理想的沟通媒介，赋予机器直观“讲述”场景的能力。但现有的视觉描述方法主要聚焦于单目图像内容，仅能表述物体间的二维拓扑关系，缺乏对三维几何信息（如精确距离、空间位置与运动状态）的精确建模与表达能力。若采用“先进行三维检测，再借助大模型生成描述”的两阶段方法，则存在系统效率低、信息一致性差的问题。而大模型的描述内容也只能局限于物体间的拓扑关系，无法准确反映三维几何信息，且因其固有的“幻觉”现象，也会导致空间信息的不准确并伴随冗余描述。为此，本文首次提出了单目三维视觉检测与空间语义描述（Monocular 3D Detection and Captioning，Mono3DDC）这一新颖任务。该任务旨在统一单目三维目标检测与描述生成，要求模型同时学习深度感知的视觉特征与语言语义，通过端到端的网络架构使生成的描述能够准确地学习到一致的三维空间信息，确保描述的几何准确性与三维目标检测的高精度。为支撑该交叉研究领域的深入探索，本文构建了首个支持中文语义的单目三维视觉描述基准数据集KITTI3DDC。该数据集基于KITTI数据集设计了一套高效的自动化数据生成流程，通过大语言模型与结构化验证模板的协同机制，在保证描述多样性与语言流畅性的同时，严格控制了空间信息的几何一致性，为后续研究提供了高质量的多模态监督信号。此外，本文设计了一个统一的新型框架，即Mono3DDC-TR（Monocular 3D Detection and Captioning based on Transformer）。该框架通过深度融合优化后的几何与视觉特征，在生成描述准确率及多类别三维检测效果上均展现出显著优势，在构建的KITTI3DDC基准上取得了最优性能，验证了该框架在几何信息与语言语义联合建模方面的有效性。本文为Mono3DDC任务提供了全面的基准测试，有效地推动了该任务的发展与实际应用。

Abstract

Monocular 3D object detection provides a low-cost solution for 3D perception. However

existing methods struggle to generate scene descriptions that are intuitively understandable to humans

which limits their applicability in scenarios requiring rich semantic understanding

such as human-computer interaction and autonomous driving. As a direct manifestation of human linguistic intelligence

visual captioning offers an ideal communication medium

endowing machines with the ability to intuitively “narrate” a scene. However

existing visual captioning methods primarily focus on monocular image content and can only describe two-dimensional topological relationships between objects

lacking the capability to accurately model and express 3D geometric information (e.g.

precise distance

spatial location

and motion state). If a two-stage approach is adopted—first performing 3D detection and then leveraging a large model to generate descriptions—it suffers from low system efficiency and poor information consistency. Moreover

the descriptions generated by large models are limited to topological relationships and fail to accurately reflect 3D geometric information. In addition

the inherent “hallucination” phenomenon of large models often leads to inaccurate spatial information accompanied by redundant descriptions. To address these issues

this paper proposes for the first time a novel task: monocular 3D detection and captioning(Mono3DDC). This task aims to unify monocular 3D object detection with caption generation

requiring the model to simultaneously learn depth-aware visual features and linguistic semantics. Through an end-to-end network architecture

the generated descriptions are enabled to accurately capture consistent 3D spatial information

ensuring both geometric accuracy in the descriptions and high precision in 3D object detection. To support in-depth exploration in this interdisciplinary research area

we construct the first benchmark dataset for monocular 3D visual captioning with Chinese semantic support

named KITTI3DDC. Based on the KITTI dataset

this dataset employs an efficient automated data generation pipeline that leverages the synergy between a large language model and structured verification templates. While ensuring diversity and linguistic fluency

the pipeline strictly maintains geometric consistency in spatial information

providing high-quality multimodal supervision signals for subsequent research. Furthermore

we design a novel unified framework

Mono3DDC-TR (Monocular 3D Detection and Captioning based on Transformer). By deeply integrating optimized geometric and visual features

this framework achieves significant advantages in both caption generation accuracy and multi-category 3D detection performance. It attains state-of-the-art results on the constructed KITTI3DDC benchmark

validating the effectiveness of the proposed end-to-end unified framework in jointly modeling geometric information and linguistic semantics. This paper provides a comprehensive benchmark for the Mono3DDC task

effectively promoting its development and practical application.

关键词

Keywords

references

Yuan Z X , Song X , Bai L , et al . Temporal-channel transformer for 3D lidar-based video object detection for autonomous driving [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2022 , 32 ( 4 ): 2068 - 2078 . DOI: 10.1109/tcsvt.2021.3082763 http://dx.doi.org/10.1109/tcsvt.2021.3082763

李昌财 , 陈刚 , 侯作勋 , 等 . 自动驾驶中的三维目标检测算法研究综述 [J ] . 中国图象图形学报 , 2024 , 29 ( 11 ): 3238 - 3264 .

Li Changcai , Chen Gang , Hou Zuoxun , et al . Survey of 3D object detection algorithms for autonomous driving [J ] . Journal of Image and Graphics , 2024 , 29 ( 11 ): 3238 - 3264 . (in Chinese)

Han D , Mulyana B , Stankovic V , et al . A survey on deep reinforcement learning algorithms for robotic manipulation [J ] . Sensors , 2023 , 23 ( 7 ): 3762 . DOI: 10.3390/s23073762 http://dx.doi.org/10.3390/s23073762

Qin Y R , Wang C Q , Kang Z J , et al . SupFusion: Supervised LiDAR-camera fusion for 3D object detection [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 22014 - 22024 . DOI: 10.1109/iccv51070.2023.02012 http://dx.doi.org/10.1109/iccv51070.2023.02012

Yin J B , Zhou D F , Zhang L J , et al . ProposalContrast: Unsupervised pre-training for LiDAR-based 3D object detection [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 17 - 33 . DOI: 10.1007/978-3-031-19842-7_2 http://dx.doi.org/10.1007/978-3-031-19842-7_2

Shi P C , Liu Z Q , Dong X L , et al . CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View [J ] . Complex & Intelligent Systems , 2024 , 10 ( 6 ): 7681 - 7696 . DOI: 10.1007/s40747-024-01567-0 http://dx.doi.org/10.1007/s40747-024-01567-0

Zhang R R , Qiu H , Wang T , et al . MonoDETR: Depth-guided transformer for monocular 3D object detection [C ] // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 9155 - 9166 . DOI: 10.1109/iccv51070.2023.00840 http://dx.doi.org/10.1109/iccv51070.2023.00840

Qin Z Q , Li X . MonoGround: Detecting monocular 3D objects from the ground [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 3793 - 3802 . DOI: 10.1109/cvpr52688.2022.00377 http://dx.doi.org/10.1109/cvpr52688.2022.00377

Liu X P , Xue N , Wu T F . Learning auxiliary monocular contexts helps monocular 3D object detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 2 ): 1810 - 1818 . DOI: 10.1609/aaai.v36i2.20074 http://dx.doi.org/10.1609/aaai.v36i2.20074

Lian Q , Li P L , Chen X Z . MonoJSG: Joint semantic and geometric cost volume for monocular 3D object detection [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 1070 - 1079 . DOI: 10.1109/cvpr52688.2022.00114 http://dx.doi.org/10.1109/cvpr52688.2022.00114

Yan L F , Yan P , Xiong S Z , et al . MonoCD: Monocular 3D object detection with complementary depths [C ] // Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 10248 - 10257 . DOI: 10.1109/cvpr52733.2024.00976 http://dx.doi.org/10.1109/cvpr52733.2024.00976

Pu F Q , Wang Y F , Deng J R , et al . MonoDGP: Monocular 3D object detection with decoupled-query and geometry-error priors [C ] // Proceedings of 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2025 : 6520 - 6530 . DOI: 10.1109/cvpr52734.2025.00611 http://dx.doi.org/10.1109/cvpr52734.2025.00611

Wu Z Z , Gan Y Z , Wu Y Z , et al . FD3D: Exploiting foreground depth map for feature-supervised monocular 3D object detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 6 ): 6189 - 6197 . DOI: 10.1609/aaai.v38i6.28436 http://dx.doi.org/10.1609/aaai.v38i6.28436

Legaspi R , Xu W Z , Konishi T , et al . The sense of agency in human-AI interactions [J ] . Knowledge-Based Systems , 2024 , 286 : 111298 . DOI: 10.1016/j.knosys.2023.111298 http://dx.doi.org/10.1016/j.knosys.2023.111298

魏忠钰 , 范智昊 , 王瑞泽 , 等 . 从视觉到文本: 图像描述生成的研究进展综述 [J ] . 中文信息学报 , 2020 , 34 ( 7 ): 19 - 29 .

Wei Zhongyu , Fan Zhihao , Wang Ruize , et al . From vision to text: A brief survey for image captioning [J ] . Journal of Chinese Information Processing , 2020 , 34 ( 7 ): 19 - 29 . (in Chinese)

郑锦 , 蒋博韬 , 彭微 , 等 . LiDar点云指导下特征分布趋同与语义关联的3D目标检测 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1700 - 1715 .

Zheng Jin , Jiang Botao , Peng Wei , et al . 3D object detection based on feature distribution convergence guided by LiDar point cloud and semantic association [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1700 - 1715 . (in Chinese)

葛同澳 , 李辉 , 郭颖 , 等 . 基于双融合框架的多模态3D目标检测算法 [J ] . 电子学报 , 2023 , 51 ( 11 ): 3100 - 3110 .

Ge Tongao , Li Hui , Guo Ying , et al . A multimodal 3D object detection method based on double-fusion framework [J ] . Acta Electronica Sinica , 2023 , 51 ( 11 ): 3100 - 3110 . (in Chinese)

周治国 , 马文浩 . 一种多层多模态融合3D目标检测方法 [J ] . 电子学报 , 2024 , 52 ( 3 ): 696 - 708 . DOI: 10.12263/DZXB.20220593 http://dx.doi.org/10.12263/DZXB.20220593

Zhou Zhiguo , Ma Wenhao . 3D object detection based on multilayer multimodal fusion [J ] . Acta Electronica Sinica , 2024 , 52 ( 3 ): 696 - 708 . (in Chinese) . DOI: 10.12263/DZXB.20220593 http://dx.doi.org/10.12263/DZXB.20220593

Xu K , Ba J , Kiros R , et al . Show, attend and tell: Neural image caption generation with visual attention [C ] // Proceedings of the 32nd International Conference on Machine Learning . Lille : PMLR , 2015 : 2048 - 2057 . DOI: 10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935

Vinyals O , Toshev A , Bengio S , et al . Show and tell: A neural image caption generator [C ] // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2015 : 3156 - 3164 . DOI: 10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935

Zhang J , Xie Y S , Ding W C , et al . Cross on cross attention: Deep fusion transformer for image captioning [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 8 ): 4257 - 4268 . DOI: 10.1109/tcsvt.2023.3243725 http://dx.doi.org/10.1109/tcsvt.2023.3243725

Wang Y Y , Xu J G , Sun Y F . End-to-end transformer based model for image captioning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 3 ): 2585 - 2594 . DOI: 10.1609/aaai.v36i3.20160 http://dx.doi.org/10.1609/aaai.v36i3.20160

Krishna R , Zhu Y K , Groth O , et al . Visual genome: Connecting language and vision using crowdsourced dense image annotations [J ] . International Journal of Computer Vision , 2017 , 123 ( 1 ): 32 - 73 . DOI: 10.1007/s11263-016-0981-7 http://dx.doi.org/10.1007/s11263-016-0981-7

Yang K Y , Russakovsky O , Deng J . SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition [C ] // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 2051 - 2060 . DOI: 10.1109/iccv.2019.00214 http://dx.doi.org/10.1109/iccv.2019.00214

Achlioptas P , Abdelreheem A , Xia F , et al . Referit3D: Neural listeners for fine-grained 3D object identification in real-world scenes [C ] // Proceedings of the 16th European Conference on Computer Vision . Heidelberg : Springer , 2020 : 422 - 440 . DOI: 10.1007/978-3-030-58452-8_25 http://dx.doi.org/10.1007/978-3-030-58452-8_25

Zhan Y , Yuan Y , Xiong Z T . Mono3DVG: 3D visual grounding in monocular images [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 7 ): 6988 - 6996 . DOI: 10.1609/aaai.v38i7.28525 http://dx.doi.org/10.1609/aaai.v38i7.28525

杨洋 , 魏弘凯 , 孙士杰 , 等 . NLOT3D: 单目视角下自然语言描述驱动的三维目标跟踪研究 [J ] . 电子学报 , 2025 , 53 ( 6 ): 2038 - 2049 .

Yang Yang , Wei Hongkai , Sun Shijie , et al . NLOT3D: Natural-language-driven 3D object tracking in monocular view [J ] . Acta Electronica Sinica , 2025 , 53 ( 6 ): 2038 - 2049 . (in Chinese)

Liu Aixin , Feng Bei , Xue Bing , et al . DeepSeek-V3 technical report [PP/OL ] . V2. arXiv ( 2024-12-27 )[ 2025-11-08 ] . https://arXiv.org/abs/2412.19437 https://arXiv.org/abs/2412.19437 .

Geiger A , Lenz P , Stiller C , et al . Vision meets robotics: The KITTI dataset [J ] . The International Journal of Robotics Research , 2013 , 32 ( 11 ): 1231 - 1237 . DOI: 10.1177/0278364913491297 http://dx.doi.org/10.1177/0278364913491297

Zhang Y M , Gong Z M , Chang A X . Multi3DRefer: Grounding text description to multiple 3D objects [C ] // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 15225 - 15236 . DOI: 10.1109/iccv51070.2023.01397 http://dx.doi.org/10.1109/iccv51070.2023.01397

Liu H L , Lin A R , Han X G , et al . Refer-it-in-RGBD: A bottom-up approach for 3D visual grounding in RGBD images [C ] // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 6032 - 6041 . DOI: 10.1109/cvpr46437.2021.00597 http://dx.doi.org/10.1109/cvpr46437.2021.00597

Lin Z X , Peng X D , Cong P S , et al . WildRefer: 3D object localization in large-scale dynamic scenes with multi-modal visual data and natural language [C ] // Proceedings of the 18th European Conference on Computer Vision . Heidelberg : Springer , 2025 : 456 - 473 . DOI: 10.1007/978-3-031-72952-2_26 http://dx.doi.org/10.1007/978-3-031-72952-2_26

Ranftl R , Lasinger K , Hafner D , et al . Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 3 ): 1623 - 1637 . DOI: 10.1109/tpami.2020.3019967 http://dx.doi.org/10.1109/tpami.2020.3019967

Wei H R , Kong L Y , Chen J Y , et al . Vary: Scaling up the vision vocabulary for large vision-language model [C ] // Proceedings of the 18th European Conference on Computer Vision . Heidelberg : Springer , 2025 : 408 - 424 . DOI: 10.1007/978-3-031-73235-5_23 http://dx.doi.org/10.1007/978-3-031-73235-5_23

Chu X X , Qiao L M , Zhang X Y , et al . MobileVLM V2: Faster and stronger baseline for vision language model [PP/OL ] . V1. arXiv ( 2024-02-06 )[ 2025-12-07 ] . https://arXiv.org/abs/2402.03766 https://arXiv.org/abs/2402.03766 .

Peng L , Xu J K , Cheng H R , et al . Learning occupancy for monocular 3D object detection [C ] // Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 10281 - 10292 . DOI: 10.1109/cvpr52733.2024.00979 http://dx.doi.org/10.1109/cvpr52733.2024.00979

Wu Z Z , Gan Y Z , Wang L , et al . MonoPGC: Monocular 3D object detection with pixel geometry contexts [C ] // Proceedings of 2023 IEEE International Conference on Robotics and Automation . Piscataway : IEEE , 2023 : 4842 - 4849 . DOI: 10.1109/icra48891.2023.10161442 http://dx.doi.org/10.1109/icra48891.2023.10161442

Huang K C , Wu T H , Su H T , et al . MonoDTR: Monocular 3D object detection with depth-aware transformer [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 4012 - 4021 . DOI: 10.1109/cvpr52688.2022.00398 http://dx.doi.org/10.1109/cvpr52688.2022.00398

Li Z L , Qu Z , Zhou Y , et al . Diversity matters: Fully exploiting depth clues for reliable monocular 3D object detection [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 2791 - 2800 . DOI: 10.1109/cvpr52688.2022.00281 http://dx.doi.org/10.1109/cvpr52688.2022.00281

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

CIMOT3D: Chinese-Instruction-Based Monocular 3D Multi-Object Tracking

NLOT3D: Natural-Language-Driven 3D Object Tracking in Monocular View

Automatic Theorem Identification for Scenario-Aware Verification in Ubiquitous Operating Systems

A Review of Techniques for Bypassing Censorship in Encrypted Tunnels and Traffic Analysis

Variable Length-Restricted X-Architecture Steiner Minimum Tree Algorithm Considering Cluster Reduction

Related Author

GUO Keyu

WANG Tiantian

LUO Hanke

SUN Shijie

CHENG Huize

SUN Zhanglong

LIANG Haoxiang

WANG Rong

Related Institution

School of Data Science and Artificial Intelligence, Chang'an University

School of Information Engineering, Chang'an University

School of Electronic and Control Engineering, Chang’an University

School of Software Engineering, Sun Yat-sen University

Guangdong Engineering Technology Research Center of Blockchain

⁰