

浏览全部资源
扫码关注微信
1.长安大学信息工程学院,陕西西安 710064
2.长安大学数据科学与人工智能研究院,陕西西安 710064
Received:02 February 2026,
Accepted:07 March 2026,
Published:25 March 2026
移动端阅览
王天添, 郭柯宇, 罗函轲, 等. 面向单目图像下三维目标检测与空间语义描述的统一建模[J]. 电子学报, 2026, 54(03): 1105-1117.
WANG Tiantian, GUO Keyu, LUO Hanke, et al. Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images[J]. Acta Electronica Sinica, 2026, 54(03): 1105-1117.
王天添, 郭柯宇, 罗函轲, 等. 面向单目图像下三维目标检测与空间语义描述的统一建模[J]. 电子学报, 2026, 54(03): 1105-1117. DOI:10.12263/DZXB.20251129
WANG Tiantian, GUO Keyu, LUO Hanke, et al. Towards Unified Modeling of 3D Detection and Spatial Semantic Captioning with Monocular Images[J]. Acta Electronica Sinica, 2026, 54(03): 1105-1117. DOI:10.12263/DZXB.20251129
单目三维目标检测为三维感知提供了低成本解决方案,但现有方法难以生成可供人类直观理解的场景描述,从而限制了它们在人机交互、自动驾驶和其他需要丰富语义理解的场景中的适用性。视觉描述作为人类语言智能的直接体现,提供了理想的沟通媒介,赋予机器直观“讲述”场景的能力。但现有的视觉描述方法主要聚焦于单目图像内容,仅能表述物体间的二维拓扑关系,缺乏对三维几何信息(如精确距离、空间位置与运动状态)的精确建模与表达能力。若采用“先进行三维检测,再借助大模型生成描述”的两阶段方法,则存在系统效率低、信息一致性差的问题。而大模型的描述内容也只能局限于物体间的拓扑关系,无法准确反映三维几何信息,且因其固有的“幻觉”现象,也会导致空间信息的不准确并伴随冗余描述。为此,本文首次提出了单目三维视觉检测与空间语义描述(Monocular 3D Detection and Captioning,Mono3DDC)这一新颖任务。该任务旨在统一单目三维目标检测与描述生成,要求模型同时学习深度感知的视觉特征与语言语义,通过端到端的网络架构使生成的描述能够准确地学习到一致的三维空间信息,确保描述的几何准确性与三维目标检测的高精度。为支撑该交叉研究领域的深入探索,本文构建了首个支持中文语义的单目三维视觉描述基准数据集KITTI3DDC。该数据集基于KITTI数据集设计了一套高效的自动化数据生成流程,通过大语言模型与结构化验证模板的协同机制,在保证描述多样性与语言流畅性的同时,严格控制了空间信息的几何一致性,为后续研究提供了高质量的多模态监督信号。此外,本文设计了一个统一的新型框架,即Mono3DDC-TR(Monocular 3D Detection and Captioning based on Transformer)。该框架通过深度融合优化后的几何与视觉特征,在生成描述准确率及多类别三维检测效果上均展现出显著优势,在构建的KITTI3DDC基准上取得了最优性能,验证了该框架在几何信息与语言语义联合建模方面的有效性。本文为Mono3DDC任务提供了全面的基准测试,有效地推动了该任务的发展与实际应用。
Monocular 3D object detection provides a low-cost solution for 3D perception. However
existing methods struggle to generate scene descriptions that are intuitively understandable to humans
which limits their applicability in scenarios requiring rich semantic understanding
such as human-computer interaction and autonomous driving. As a direct manifestation of human linguistic intelligence
visual captioning offers an ideal communication medium
endowing machines with the ability to intuitively “narrate” a scene. However
existing visual captioning methods primarily focus on monocular image content and can only describe two-dimensional topological relationships between objects
lacking the capability to accurately model and express 3D geometric information (e.g.
precise distance
spatial location
and motion state). If a two-stage approach is adopted—first performing 3D detection and then leveraging a large model to generate descriptions—it suffers from low system efficiency and poor information consistency. Moreover
the descriptions generated by large models are limited to topological relationships and fail to accurately reflect 3D geometric information. In addition
the inherent “hallucination” phenomenon of large models often leads to inaccurate spatial information accompanied by redundant descriptions. To address these issues
this paper proposes for the first time a novel task: monocular 3D detection and captioning(Mono3DDC). This task aims to unify monocular 3D object detection with caption generation
requiring the model to simultaneously learn depth-aware visual features and linguistic semantics. Through an end-to-end network architecture
the generated descriptions are enabled to accurately capture consistent 3D spatial information
ensuring both geometric accuracy in the descriptions and high precision in 3D object detection. To support in-depth exploration in this interdisciplinary research area
we construct the first benchmark dataset for monocular 3D visual captioning with Chinese semantic support
named KITTI3DDC. Based on the KITTI dataset
this dataset employs an efficient automated data generation pipeline that leverages the synergy between a large language model and structured verification templates. While ensuring diversity and linguistic fluency
the pipeline strictly maintains geometric consistency in spatial information
providing high-quality multimodal supervision signals for subsequent research. Furthermore
we design a novel unified framework
Mono3DDC-TR (Monocular 3D Detection and Captioning based on Transformer). By deeply integrating optimized geometric and visual features
this framework achieves significant advantages in both caption generation accuracy and multi-category 3D detection performance. It attains state-of-the-art results on the constructed KITTI3DDC benchmark
validating the effectiveness of the proposed end-to-end unified framework in jointly modeling geometric information and linguistic semantics. This paper provides a comprehensive benchmark for the Mono3DDC task
effectively promoting its development and practical application.
Yuan Z X , Song X , Bai L , et al . Temporal-channel transformer for 3D lidar-based video object detection for autonomous driving [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2022 , 32 ( 4 ): 2068 - 2078 . DOI: 10.1109/tcsvt.2021.3082763 http://dx.doi.org/10.1109/tcsvt.2021.3082763
李昌财 , 陈刚 , 侯作勋 , 等 . 自动驾驶中的三维目标检测算法研究综述 [J ] . 中国图象图形学报 , 2024 , 29 ( 11 ): 3238 - 3264 .
Li Changcai , Chen Gang , Hou Zuoxun , et al . Survey of 3D object detection algorithms for autonomous driving [J ] . Journal of Image and Graphics , 2024 , 29 ( 11 ): 3238 - 3264 . (in Chinese)
Han D , Mulyana B , Stankovic V , et al . A survey on deep reinforcement learning algorithms for robotic manipulation [J ] . Sensors , 2023 , 23 ( 7 ): 3762 . DOI: 10.3390/s23073762 http://dx.doi.org/10.3390/s23073762
Qin Y R , Wang C Q , Kang Z J , et al . SupFusion: Supervised LiDAR-camera fusion for 3D object detection [C ] // Proceedings of the IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 22014 - 22024 . DOI: 10.1109/iccv51070.2023.02012 http://dx.doi.org/10.1109/iccv51070.2023.02012
Yin J B , Zhou D F , Zhang L J , et al . ProposalContrast: Unsupervised pre-training for LiDAR-based 3D object detection [M ] // Computer Vision - ECCV 2022 . Cham : Springer Nature Switzerland , 2022 : 17 - 33 . DOI: 10.1007/978-3-031-19842-7_2 http://dx.doi.org/10.1007/978-3-031-19842-7_2
Shi P C , Liu Z Q , Dong X L , et al . CL-fusionBEV: 3D object detection method with camera-LiDAR fusion in Bird’s Eye View [J ] . Complex & Intelligent Systems , 2024 , 10 ( 6 ): 7681 - 7696 . DOI: 10.1007/s40747-024-01567-0 http://dx.doi.org/10.1007/s40747-024-01567-0
Zhang R R , Qiu H , Wang T , et al . MonoDETR: Depth-guided transformer for monocular 3D object detection [C ] // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 9155 - 9166 . DOI: 10.1109/iccv51070.2023.00840 http://dx.doi.org/10.1109/iccv51070.2023.00840
Qin Z Q , Li X . MonoGround: Detecting monocular 3D objects from the ground [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 3793 - 3802 . DOI: 10.1109/cvpr52688.2022.00377 http://dx.doi.org/10.1109/cvpr52688.2022.00377
Liu X P , Xue N , Wu T F . Learning auxiliary monocular contexts helps monocular 3D object detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 2 ): 1810 - 1818 . DOI: 10.1609/aaai.v36i2.20074 http://dx.doi.org/10.1609/aaai.v36i2.20074
Lian Q , Li P L , Chen X Z . MonoJSG: Joint semantic and geometric cost volume for monocular 3D object detection [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 1070 - 1079 . DOI: 10.1109/cvpr52688.2022.00114 http://dx.doi.org/10.1109/cvpr52688.2022.00114
Yan L F , Yan P , Xiong S Z , et al . MonoCD: Monocular 3D object detection with complementary depths [C ] // Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 10248 - 10257 . DOI: 10.1109/cvpr52733.2024.00976 http://dx.doi.org/10.1109/cvpr52733.2024.00976
Pu F Q , Wang Y F , Deng J R , et al . MonoDGP: Monocular 3D object detection with decoupled-query and geometry-error priors [C ] // Proceedings of 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2025 : 6520 - 6530 . DOI: 10.1109/cvpr52734.2025.00611 http://dx.doi.org/10.1109/cvpr52734.2025.00611
Wu Z Z , Gan Y Z , Wu Y Z , et al . FD3D: Exploiting foreground depth map for feature-supervised monocular 3D object detection [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 6 ): 6189 - 6197 . DOI: 10.1609/aaai.v38i6.28436 http://dx.doi.org/10.1609/aaai.v38i6.28436
Legaspi R , Xu W Z , Konishi T , et al . The sense of agency in human-AI interactions [J ] . Knowledge-Based Systems , 2024 , 286 : 111298 . DOI: 10.1016/j.knosys.2023.111298 http://dx.doi.org/10.1016/j.knosys.2023.111298
魏忠钰 , 范智昊 , 王瑞泽 , 等 . 从视觉到文本: 图像描述生成的研究进展综述 [J ] . 中文信息学报 , 2020 , 34 ( 7 ): 19 - 29 .
Wei Zhongyu , Fan Zhihao , Wang Ruize , et al . From vision to text: A brief survey for image captioning [J ] . Journal of Chinese Information Processing , 2020 , 34 ( 7 ): 19 - 29 . (in Chinese)
郑锦 , 蒋博韬 , 彭微 , 等 . LiDar点云指导下特征分布趋同与语义关联的3D目标检测 [J ] . 电子学报 , 2024 , 52 ( 5 ): 1700 - 1715 .
Zheng Jin , Jiang Botao , Peng Wei , et al . 3D object detection based on feature distribution convergence guided by LiDar point cloud and semantic association [J ] . Acta Electronica Sinica , 2024 , 52 ( 5 ): 1700 - 1715 . (in Chinese)
葛同澳 , 李辉 , 郭颖 , 等 . 基于双融合框架的多模态3D目标检测算法 [J ] . 电子学报 , 2023 , 51 ( 11 ): 3100 - 3110 .
Ge Tongao , Li Hui , Guo Ying , et al . A multimodal 3D object detection method based on double-fusion framework [J ] . Acta Electronica Sinica , 2023 , 51 ( 11 ): 3100 - 3110 . (in Chinese)
周治国 , 马文浩 . 一种多层多模态融合3D目标检测方法 [J ] . 电子学报 , 2024 , 52 ( 3 ): 696 - 708 . DOI: 10.12263/DZXB.20220593 http://dx.doi.org/10.12263/DZXB.20220593
Zhou Zhiguo , Ma Wenhao . 3D object detection based on multilayer multimodal fusion [J ] . Acta Electronica Sinica , 2024 , 52 ( 3 ): 696 - 708 . (in Chinese) . DOI: 10.12263/DZXB.20220593 http://dx.doi.org/10.12263/DZXB.20220593
Xu K , Ba J , Kiros R , et al . Show, attend and tell: Neural image caption generation with visual attention [C ] // Proceedings of the 32nd International Conference on Machine Learning . Lille : PMLR , 2015 : 2048 - 2057 . DOI: 10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935
Vinyals O , Toshev A , Bengio S , et al . Show and tell: A neural image caption generator [C ] // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2015 : 3156 - 3164 . DOI: 10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935
Zhang J , Xie Y S , Ding W C , et al . Cross on cross attention: Deep fusion transformer for image captioning [J ] . IEEE Transactions on Circuits and Systems for Video Technology , 2023 , 33 ( 8 ): 4257 - 4268 . DOI: 10.1109/tcsvt.2023.3243725 http://dx.doi.org/10.1109/tcsvt.2023.3243725
Wang Y Y , Xu J G , Sun Y F . End-to-end transformer based model for image captioning [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2022 , 36 ( 3 ): 2585 - 2594 . DOI: 10.1609/aaai.v36i3.20160 http://dx.doi.org/10.1609/aaai.v36i3.20160
Krishna R , Zhu Y K , Groth O , et al . Visual genome: Connecting language and vision using crowdsourced dense image annotations [J ] . International Journal of Computer Vision , 2017 , 123 ( 1 ): 32 - 73 . DOI: 10.1007/s11263-016-0981-7 http://dx.doi.org/10.1007/s11263-016-0981-7
Yang K Y , Russakovsky O , Deng J . SpatialSense: An adversarially crowdsourced benchmark for spatial relation recognition [C ] // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2019 : 2051 - 2060 . DOI: 10.1109/iccv.2019.00214 http://dx.doi.org/10.1109/iccv.2019.00214
Achlioptas P , Abdelreheem A , Xia F , et al . Referit3D: Neural listeners for fine-grained 3D object identification in real-world scenes [C ] // Proceedings of the 16th European Conference on Computer Vision . Heidelberg : Springer , 2020 : 422 - 440 . DOI: 10.1007/978-3-030-58452-8_25 http://dx.doi.org/10.1007/978-3-030-58452-8_25
Zhan Y , Yuan Y , Xiong Z T . Mono3DVG: 3D visual grounding in monocular images [J ] . Proceedings of the AAAI Conference on Artificial Intelligence , 2024 , 38 ( 7 ): 6988 - 6996 . DOI: 10.1609/aaai.v38i7.28525 http://dx.doi.org/10.1609/aaai.v38i7.28525
杨洋 , 魏弘凯 , 孙士杰 , 等 . NLOT3D: 单目视角下自然语言描述驱动的三维目标跟踪研究 [J ] . 电子学报 , 2025 , 53 ( 6 ): 2038 - 2049 .
Yang Yang , Wei Hongkai , Sun Shijie , et al . NLOT3D: Natural-language-driven 3D object tracking in monocular view [J ] . Acta Electronica Sinica , 2025 , 53 ( 6 ): 2038 - 2049 . (in Chinese)
Liu Aixin , Feng Bei , Xue Bing , et al . DeepSeek-V3 technical report [PP/OL ] . V2. arXiv ( 2024-12-27 )[ 2025-11-08 ] . https://arXiv.org/abs/2412.19437 https://arXiv.org/abs/2412.19437 .
Geiger A , Lenz P , Stiller C , et al . Vision meets robotics: The KITTI dataset [J ] . The International Journal of Robotics Research , 2013 , 32 ( 11 ): 1231 - 1237 . DOI: 10.1177/0278364913491297 http://dx.doi.org/10.1177/0278364913491297
Zhang Y M , Gong Z M , Chang A X . Multi3DRefer: Grounding text description to multiple 3D objects [C ] // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2023 : 15225 - 15236 . DOI: 10.1109/iccv51070.2023.01397 http://dx.doi.org/10.1109/iccv51070.2023.01397
Liu H L , Lin A R , Han X G , et al . Refer-it-in-RGBD: A bottom-up approach for 3D visual grounding in RGBD images [C ] // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 6032 - 6041 . DOI: 10.1109/cvpr46437.2021.00597 http://dx.doi.org/10.1109/cvpr46437.2021.00597
Lin Z X , Peng X D , Cong P S , et al . WildRefer: 3D object localization in large-scale dynamic scenes with multi-modal visual data and natural language [C ] // Proceedings of the 18th European Conference on Computer Vision . Heidelberg : Springer , 2025 : 456 - 473 . DOI: 10.1007/978-3-031-72952-2_26 http://dx.doi.org/10.1007/978-3-031-72952-2_26
Ranftl R , Lasinger K , Hafner D , et al . Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022 , 44 ( 3 ): 1623 - 1637 . DOI: 10.1109/tpami.2020.3019967 http://dx.doi.org/10.1109/tpami.2020.3019967
Wei H R , Kong L Y , Chen J Y , et al . Vary: Scaling up the vision vocabulary for large vision-language model [C ] // Proceedings of the 18th European Conference on Computer Vision . Heidelberg : Springer , 2025 : 408 - 424 . DOI: 10.1007/978-3-031-73235-5_23 http://dx.doi.org/10.1007/978-3-031-73235-5_23
Chu X X , Qiao L M , Zhang X Y , et al . MobileVLM V2: Faster and stronger baseline for vision language model [PP/OL ] . V1. arXiv ( 2024-02-06 )[ 2025-12-07 ] . https://arXiv.org/abs/2402.03766 https://arXiv.org/abs/2402.03766 .
Peng L , Xu J K , Cheng H R , et al . Learning occupancy for monocular 3D object detection [C ] // Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2024 : 10281 - 10292 . DOI: 10.1109/cvpr52733.2024.00979 http://dx.doi.org/10.1109/cvpr52733.2024.00979
Wu Z Z , Gan Y Z , Wang L , et al . MonoPGC: Monocular 3D object detection with pixel geometry contexts [C ] // Proceedings of 2023 IEEE International Conference on Robotics and Automation . Piscataway : IEEE , 2023 : 4842 - 4849 . DOI: 10.1109/icra48891.2023.10161442 http://dx.doi.org/10.1109/icra48891.2023.10161442
Huang K C , Wu T H , Su H T , et al . MonoDTR: Monocular 3D object detection with depth-aware transformer [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 4012 - 4021 . DOI: 10.1109/cvpr52688.2022.00398 http://dx.doi.org/10.1109/cvpr52688.2022.00398
Li Z L , Qu Z , Zhou Y , et al . Diversity matters: Fully exploiting depth clues for reliable monocular 3D object detection [C ] // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2022 : 2791 - 2800 . DOI: 10.1109/cvpr52688.2022.00281 http://dx.doi.org/10.1109/cvpr52688.2022.00281
0
Views
20
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621