A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer

YUAN Ding; LI Yuan; MENG Yu-qian; ZHANG Hong; YANG Yi-fan

doi:10.12263/DZXB.20241022

您当前的位置：

首页 >

文章列表页 >

A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer

PAPERS | 更新时间：2025-12-10

- A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer
- ACTA ELECTRONICA SINICA Vol. 53, Issue 7, Pages: 2418-2427(2025)
- 作者机构：
  
  北京航空航天大学宇航学院，北京 102206
- 作者简介：
- 基金信息：
  
  National Natural Science Foundation of China(62002005;61972015)
- DOI：10.12263/DZXB.20241022
  CLC： TP183;
- Received：11 November 2024，
  
  Revised：2025-07-17，
  
  Published：25 July 2025
- 稿件说明：
移动端阅览
袁丁, 李源, 孟羽倩, 等. 基于时空注意力Transformer的自动驾驶运动规划方法[J]. 电子学报, 2025, 53(07): 2418-2427.

YUAN Ding, LI Yuan, MENG Yu-qian, et al. A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer[J]. Acta Electronica Sinica, 2025, 53(07): 2418-2427.
袁丁, 李源, 孟羽倩, 等. 基于时空注意力Transformer的自动驾驶运动规划方法[J]. 电子学报, 2025, 53(07): 2418-2427. DOI：10.12263/DZXB.20241022

YUAN Ding, LI Yuan, MENG Yu-qian, et al. A Motion Planning Method for Autonomous Driving Based on Spatiotemporal Attention Transformer[J]. Acta Electronica Sinica, 2025, 53(07): 2418-2427. DOI：10.12263/DZXB.20241022

摘要

驾驶场景中的静态智能体、动态智能体、道路结构及各元素间的交互通常是复杂且随时空快速变化的.因此，自动驾驶车辆的运动预测是一项十分具有挑战性的任务，其中一个尚未解决的难题就是如何高效表征和融合多模态场景信息，包括路况信息、不同智能体状态及其历史交互信息.现有方法大多依靠独立设计的模块并行处理多个模态的数据，但这种方式会造成系统灵活度较差、调整困难，且独立组件往往会引起较高的计算冗余，系统计算效率较低.此外，由自动驾驶场景的时间信息和空间信息解码获得保障安全驾驶的动作指令本身就是一项十分具有挑战性的任务.本文提出基于时空注意力Transformer的自动驾驶运动规划方法，由分阶段多模态场景编码器和时空融合解码器组成，能够逐过程构建多模态运动场景描述，同时在时空融合下预测自车的未来安全运动.本文在大规模自动驾驶数据集nuScenes上搭建了全新的比较基线，取得了较为领先的结果.

Abstract

The static and dynamic agents

road structures

and interactions among various elements in driving scenarios are typically complex and rapidly change across time and space. Consequently

motion prediction for autonomous vehicles remains a challenging task

especially with the open problem of efficiently representing and integrating multi-modal scene information

including road conditions

various agent states

and historical interaction information. Current approaches often rely on independently designed modules to process each modality in parallel. However

this approach tends to result in limited system flexibility

challenging adjustments

and

frequently

high computational redundancy

which reduces overall system efficiency. Furthermore

decoding the spatiotemporal information from autonomous driving scenarios to generate safe driving commands is inherently challenging. This paper proposes an autonomous driving motion planning method based on a spatiotemporal attention Transformer

comprising a phased multi-modal scene encoder and a spatiotemporal fusion decoder. This model progressively constructs a multi-modal scene representation and predicts the future safe trajectory of the autonomous vehicle under spatiotemporal fusion. The proposed approach establishes a new baseline on the large-scale nuScenes autonomous driving dataset

achieving competitive results.

关键词

Keywords

references

GIRGIS R , GOLEMO F , CODEVILLA F , et al . Latent variable sequential set transformers for joint multi-agent motion prediction [EB/OL ] . ( 2022-02-11 )[ 2024-11-10 ] . https://arXiv.org/abs/2104.00563 https://arXiv.org/abs/2104.00563 .

MERCAT J , GILLES T , EL ZOGHBY N , et al . Multi-head attention for multi-modal joint vehicle motion forecasting [C ] // 2020 IEEE International Conference on Robotics and Automation . Piscataway : IEEE , 2020 : 9638 - 9644 .

NGIAM J , CAINE B , VASUDEVAN V , et al . Scene Transformer: A unified architecture for predicting multiple agent trajectories [EB/OL ] . ( 2022-03-04 )[ 2024-11-10 ] . https://arXiv.org/abs/2106.08417 https://arXiv.org/abs/2106.08417 .

YUAN Y , WENG X S , OU Y L , et al . AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 9793 - 9803 .

LIU Y C , ZHANG J H , FANG L J , et al . Multimodal motion prediction with stacked transformers [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 7573 - 7582 .

VARADARAJAN B , HEFNY A , SRIVASTAVA A , et al . MultiPath++: Efficient information fusion and trajectory aggregation for behavior prediction [C ] // 2022 International Conference on Robotics and Automation . Piscataway : IEEE , 2022 : 7814 - 7821 .

VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [C ] // NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems . New York : ACM , 2017 : 6000 - 6010 .

GIRDHAR R , JOÃO CARREIRA J , DOERSCH C , et al . Video action transformer network [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 244 - 253 .

NEIMARK D , BAR O , ZOHAR M , et al . Video transformer network [C ] // 2021 IEEE/CVF International Conference on Computer Vision Workshops . Piscataway : IEEE , 2021 : 3156 - 3165 .

ZOLFAGHARI M , SINGH K , BROX T . ECO: Efficient convolutional network for online video understanding [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 713 - 730 .

DOSOVITSKIY A , BEYER L , KOLESNIKOV A , et al . An image is worth 16 x 16 words: Transformers for image recognition at scale[EB/OL ] . ( 2021-06-03 )[ 2024-11-11 ] . https://arXiv.org/abs/2010.11929 https://arXiv.org/abs/2010.11929 .

BALTRUSAITIS T , AHUJA C , MORENCY L P . Multimodal machine learning: A survey and taxonomy [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2019 , 41 ( 2 ): 423 - 443 .

NGIAM J , KHOSLA A , KIM M , et al . Multimodal deep learning [C ] // ICML’11: Proceedings of the 28th International Conference on International Conference on Machine Learning . New York : ACM , 2011 : 689 - 696 .

SRIVASTAVA N , SALAKHUTDINOV R R . Multimodal learning with deep boltzmann machines [J ] . The Journal of Machine Learning Research , 2014 , 15 ( 1 ): 2949 - 2980 .

GADZICKI K , ASHARI R K , ZETZSCHE C . Multi-modal convolutional neural networks for human activity recognition [C ] // 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems . Piscataway : IEEE , 2018 : 1 - 6 .

WANG W , ZHANG M . Tensor deep learning model for heterogeneous data fusion in Internet of Things [J ] . IEEE Transactions on Emerging Topics in Computational Intelligence , 2020 , 4 ( 1 ): 32 - 41 .

ZHANG J Q , CAO M X , YANG X , et al . E2E-MFD: Towards end-to-end synchronous multimodal fusion detection [EB/OL ] . ( 2024-05-23 )[ 2024-11-10 ] . https://arXiv.org/abs/2403.09323 https://arXiv.org/abs/2403.09323 .

YE T J , JING W , HU C Y , et al . FusionAD: Multi-modality fusion for prediction and planning tasks of autonomous driving [EB/OL ] . ( 2023-08-14 )[ 2024-11-11 ] . https://arXiv.org/abs/2308.01006 https://arXiv.org/abs/2308.01006 .

GUO T . From theory to practice: Advancing multi-robot path planning algorithms and applications [EB/OL ] . ( 2025-06-11 )[ 2025-07-10 ] . https://arXiv.org/abs/2506.09914 https://arXiv.org/abs/2506.09914 .

BAEK S , MOON B , KIM S , et al . PIPE planner: Pathwise information gain with map predictions for indoor robot exploration [EB/OL ] . ( 2025-03-10 )[ 2025-07-10 ] . https://arXiv.org/abs/2503.07504 https://arXiv.org/abs/2503.07504 .

KENDALL A , HAWKE J , JANZ D , et al . Learning to drive in a day [C ] // 2019 International Conference on Robotics and Automation . Piscataway : IEEE , 2019 : 8248 - 8254 .

LIANG X D , WANG T R , YANG L N , et al . CIRL: Controllable imitative reinforcement learning for vision-based self-driving [C ] // Computer Vision - ECCV 2018 . Cham : Springer , 2018 : 604 - 620 .

WANG Z Z , MEGER D . Leveraging world model disentanglement in value-based multi-agent reinforcement learning [EB/OL ] . ( 2023-09-08 )[ 2024-11-10 ] . https://arXiv.org/abs/2309.04615 https://arXiv.org/abs/2309.04615 .

ZHANG D K , LIANG J M , GUO K , et al . CarPlanner: Consistent auto-regressive trajectory planning for large-scale reinforcement learning in autonomous driving [C ] // 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2025 : 17239 - 17248 .

POLACK P , ALTCHÉ F , D’ANDRÉA-NOVEL B , et al . The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles [C ] // 2017 IEEE Intelligent Vehicles Symposium (IV) . Piscataway : IEEE , 2017 : 812 - 818 .

CAESAR H , BANKITI V , LANG A H , et al . nuScenes: A multimodal dataset for autonomous driving [C ] // 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 11618 - 11628 .

ZENG W Y , LUO W J , SUO S , et al . End-to-end interpretable neural motion planner [C ] // 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2020 : 8652 - 8661 .

JAEGLE A , BORGEAUD S , ALAYRAC J B , et al . Perceiver IO: A general architecture for structured inputs & outputs [EB/OL ] . ( 2022-03-15 )[ 2024-11-10 ] . https://arXiv.org/abs/2107.14795 https://arXiv.org/abs/2107.14795 .

HU S C , CHEN L , WU P H , et al . ST-P3: End-to-end vision-based autonomous driving viaSpatial-temporal feature learning [C ] // Computer Vision - ECCV 2022 . Cham : Springer , 2022 : 533 - 549 .

CHEN L , SINAVSKI O , HÜNERMANN J , et al . Driving with LLMs: Fusing object-level vector modality for explainable autonomous driving [C ] // 2024 IEEE International Conference on Robotics and Automation . Piscataway : IEEE , 2024 : 14093 - 14100 .

HU P Y , HUANG A , DOLAN J , et al . Safe local motion planning with self-supervised freespace forecasting [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 12727 - 12736 .

CASAS S , SADAT A , URTASUN R . MP3: A unified model to map, perceive, predict and plan [C ] // 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2021 : 14398 - 14407 .

HU Y H , YANG J Z , CHEN L , et al . Planning-oriented autonomous driving [C ] // 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Piscataway : IEEE , 2023 : 17853 - 17862 .

TONG W W , SIMA C , WANG T , et al . Scene as occupancy [C ] // 2023 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2024 : 8372 - 8381 .

Views

下载量

CSCD

Alert me when the article has been cited

提交

Tools

Publicity Resources

A Medical Image Segmentation Network Based on Cross-Visual State Space and Multi-Branch Interactive Attention

Cross-Modal Light-3Dformer Model for Lung Tumor Classification

Discriminative Category Prompt Learning Based on Image Content Understanding

Medical Image Segmentation Based on Multi‑Scale Convolution Modulation

Related Author

XUE Wei

CHEN Chuang-hui

DU Ming-yang

ZHONG Ping

ZHENG Xiao

ZHOU Tao

NIU Yu-xia

YE Xin-yu

Related Institution

College of Electronic Science and Technology, National University of Defense Technology

College of Electronic Engineering, National University of Defense Technology

School of Computer Science and Technology, Anhui University of Technology, Maanshan

School of medical information & Engineering, Ningxia Medical University

Laboratory of Image & Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University

⁰