北京航空航天大学宇航学院,北京 102206
Received:11 November 2024,
Revised:2025-07-17,
Published:25 July 2025
移动端阅览
袁丁, 李源, 孟羽倩, 等. 基于时空注意力Transformer的自动驾驶运动规划方法[J]. 电子学报, 2025, 53(07): 2418-2427.
YUAN Ding, LI Yuan, MENG Yu-qian, et al. A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer[J]. Acta Electronica Sinica, 2025, 53(07): 2418-2427.
袁丁, 李源, 孟羽倩, 等. 基于时空注意力Transformer的自动驾驶运动规划方法[J]. 电子学报, 2025, 53(07): 2418-2427. DOI:10.12263/DZXB.20241022
YUAN Ding, LI Yuan, MENG Yu-qian, et al. A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer[J]. Acta Electronica Sinica, 2025, 53(07): 2418-2427. DOI:10.12263/DZXB.20241022
驾驶场景中的静态智能体、动态智能体、道路结构及各元素间的交互通常是复杂且随时空快速变化的.因此,自动驾驶车辆的运动预测是一项十分具有挑战性的任务,其中一个尚未解决的难题就是如何高效表征和融合多模态场景信息,包括路况信息、不同智能体状态及其历史交互信息.现有方法大多依靠独立设计的模块并行处理多个模态的数据,但这种方式会造成系统灵活度较差、调整困难,且独立组件往往会引起较高的计算冗余,系统计算效率较低.此外,由自动驾驶场景的时间信息和空间信息解码获得保障安全驾驶的动作指令本身就是一项十分具有挑战性的任务.本文提出基于时空注意力Transformer的自动驾驶运动规划方法,由分阶段多模态场景编码器和时空融合解码器组成,能够逐过程构建多模态运动场景描述,同时在时空融合下预测自车的未来安全运动.本文在大规模自动驾驶数据集nuScenes上搭建了全新的比较基线,取得了较为领先的结果.
The static and dynamic agents
road structures
and interactions among various elements in driving scenarios are typically complex and rapidly change across time and space. Consequently
motion prediction for autonomous vehicles remains a challenging task
especially with the open problem of efficiently representing and integrating multi-modal scene information
including road conditions
various agent states
and historical interaction information. Current approaches often rely on independently designed modules to process each modality in parallel. However
this approach tends to result in limited system flexibility
challenging adjustments
and
frequently
high computational redundancy
which reduces overall system efficiency. Furthermore
decoding the spatiotemporal information from autonomous driving scenarios to generate safe driving commands is inherently challenging. This paper proposes an autonomous driving motion planning method based on a spatiotemporal attention Transformer
comprising a phased multi-modal scene encoder and a spatiotemporal fusion decoder. This model progressively constructs a multi-modal scene representation and predicts the future safe trajectory of the autonomous vehicle under spatiotemporal fusion. The proposed approach establishes a new baseline on the large-scale nuScenes autonomous driving dataset
achieving competitive results.
GIRGIS R , GOLEMO F , CODEVILLA F , et al . Latent variable sequential set transformers for joint multi-agent motion prediction [EB/OL ] . ( 2022-02-11 )[ 2024-11-10 ] . https://arXiv.org/abs/2104.00563 https://arXiv.org/abs/2104.00563 .
MERCAT J , GILLES T , EL ZOGHBY N , et al . Multi-head attention for multi-modal joint vehicle motion forecasting [C ] // 2020 IEEE International Conference on Robotics and Automation . Piscataway : IEEE , 2020 : 9638 - 9644 .
NGIAM J , CAINE B , VASUDEVAN V , et al . Scene Transformer: A unified architecture for predicting multiple agent trajectories [EB/OL ] . ( 2022-03-04 )[ 2024-11-10 ] . https://arXiv.org/abs/2106.08417 https://arXiv.org/abs/2106.08417 .
YUAN Y , WENG X S , OU Y L , et al . AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 9793 - 9803 .
LIU Y C , ZHANG J H , FANG L J , et al . Multimodal motion prediction with stacked transformers [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 7573 - 7582 .
VARADARAJAN B , HEFNY A , SRIVASTAVA A , et al . MultiPath++: Efficient information fusion and trajectory aggregation for behavior prediction [C ] // 2022 International Conference on Robotics and Automation . Piscataway : IEEE , 2022 : 7814 - 7821 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6000 - 6010 .
GIRDHAR R , JOÃO CARREIRA J , DOERSCH C , et al . Video action transformer network [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 244 - 253 .
NEIMARK D , BAR O , ZOHAR M , et al . Video transformer network [C ] // 2021 IEEE/CVF International Conference on Computer Vision Workshops . Piscataway : IEEE , 2021 : 3156 - 3165 .
ZOLFAGHARI M , SINGH K , BROX T . ECO: Efficient convolutional network for online video understanding [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 713 - 730 .
DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2024-11-11 ] . https://arXiv.org/abs/2010.11929 https://arXiv.org/abs/2010.11929 .
BALTRUSAITIS T , AHUJA C , MORENCY L P . Multimodal machine learning: A survey and taxonomy [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019 , 41 ( 2 ): 423 - 443 .
NGIAM J , KHOSLA A , KIM M , et al . Multimodal deep learning [C ] // ICML’11: Proceedings of the 28th International Conference on International Conference on Machine Learning . New York : ACM , 2011 : 689 - 696 .
SRIVASTAVA N , SALAKHUTDINOV R R . Multimodal learning with deep boltzmann machines [J ] . The Journal of Machine Learning Research , 2014 , 15 ( 1 ): 2949 - 2980 .
GADZICKI K , ASHARI R K , ZETZSCHE C . Multi-modal convolutional neural networks for human activity recognition [C ] // 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems . Piscataway : IEEE , 2018 : 1 - 6 .
WANG W , ZHANG M . Tensor deep learning model for heterogeneous data fusion in Internet of Things [J ] . IEEE Transactions on Emerging Topics in Computational Intelligence , 2020 , 4 ( 1 ): 32 - 41 .
ZHANG J Q , CAO M X , YANG X , et al . E2E-MFD: Towards end-to-end synchronous multimodal fusion detection [EB/OL ] . ( 2024-05-23 )[ 2024-11-10 ] . https://arXiv.org/abs/2403.09323 https://arXiv.org/abs/2403.09323 .
YE T J , JING W , HU C Y , et al . FusionAD: Multi-modality fusion for prediction and planning tasks of autonomous driving [EB/OL ] . ( 2023-08-14 )[ 2024-11-11 ] . https://arXiv.org/abs/2308.01006 https://arXiv.org/abs/2308.01006 .
GUO T . From theory to practice: Advancing multi-robot path planning algorithms and applications [EB/OL ] . ( 2025-06-11 )[ 2025-07-10 ] . https://arXiv.org/abs/2506.09914 https://arXiv.org/abs/2506.09914 .
BAEK S , MOON B , KIM S , et al . PIPE planner: Pathwise information gain with map predictions for indoor robot exploration [EB/OL ] . ( 2025-03-10 )[ 2025-07-10 ] . https://arXiv.org/abs/2503.07504 https://arXiv.org/abs/2503.07504 .
KENDALL A , HAWKE J , JANZ D , et al . Learning to drive in a day [C ] // 2019 International Conference on Robotics and Automation . Piscataway : IEEE , 2019 : 8248 - 8254 .
LIANG X D , WANG T R , YANG L N , et al . CIRL: Controllable imitative reinforcement learning for vision-based self-driving [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 604 - 620 .
WANG Z Z , MEGER D . Leveraging world model disentanglement in value-based multi-agent reinforcement learning [EB/OL ] . ( 2023-09-08 )[ 2024-11-10 ] . https://arXiv.org/abs/2309.04615 https://arXiv.org/abs/2309.04615 .
ZHANG D K , LIANG J M , GUO K , et al . CarPlanner: Consistent auto-regressive trajectory planning for large-scale reinforcement learning in autonomous driving [C ] // 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2025 : 17239 - 17248 .
POLACK P , ALTCHÉ F , D’ANDRÉA-NOVEL B , et al . The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles [C ] // 2017 IEEE Intelligent Vehicles Symposium (IV) . Piscataway : IEEE , 2017 : 812 - 818 .
CAESAR H , BANKITI V , LANG A H , et al . nuScenes: A multimodal dataset for autonomous driving [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 11618 - 11628 .
ZENG W Y , LUO W J , SUO S , et al . End-to-end interpretable neural motion planner [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 8652 - 8661 .
JAEGLE A , BORGEAUD S , ALAYRAC J B , et al . Perceiver IO: A general architecture for structured inputs & outputs [EB/OL ] . ( 2022-03-15 )[ 2024-11-10 ] . https://arXiv.org/abs/2107.14795 https://arXiv.org/abs/2107.14795 .
HU S C , CHEN L , WU P H , et al . ST-P3: End-to-end vision-based autonomous driving viaSpatial-temporal feature learning [C ] // Computer Vision - ECCV 2022 . Cham : Springer , 2022 : 533 - 549 .
CHEN L , SINAVSKI O , HÜNERMANN J , et al . Driving with LLMs: Fusing object-level vector modality for explainable autonomous driving [C ] // 2024 IEEE International Conference on Robotics and Automation . Piscataway : IEEE , 2024 : 14093 - 14100 .
HU P Y , HUANG A , DOLAN J , et al . Safe local motion planning with self-supervised freespace forecasting [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 12727 - 12736 .
CASAS S , SADAT A , URTASUN R . MP3: A unified model to map, perceive, predict and plan [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 14398 - 14407 .
HU Y H , YANG J Z , CHEN L , et al . Planning-oriented autonomous driving [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 17853 - 17862 .
TONG W W , SIMA C , WANG T , et al . Scene as occupancy [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 8372 - 8381 .
0
Views
9
下载量
0
CSCD
Publicity Resources
Related Articles
Related Author
Related Institution
京公网安备11010802024621