面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究

王金强; 宋利蓉; 蒋远博; 雍宾宾; 李妍; 周庆国

doi:10.12263/DZXB.20250977

您当前的位置：

首页 >

文章列表页 >

面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究

学术论文 | 更新时间：2026-06-16

- 面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究
- Research on Mixed Architecture Hamilton-Jacobi-Bellman Proximal Policy Optimization Method for Autonomous Driving
- 电子学报 2026年54卷第3期页码：1024-1035
- 作者机构：
  
  兰州大学信息科学与工程学院，甘肃兰州 730000
- 作者简介：
  
  王金强男，1993年4月出生于甘肃省定西市。现为兰州大学核科学与技术学院萃英博士后。主要研究方向为深度强化学习、AI4Science和机器人。 E-mail: jqwang16@lzu.edu.cn
  宋利蓉女，2000年10月出生于青海省海东市。现为兰州大学信息科学与工程学院硕士研究生。主要研究方向为深度强化学习、自动驾驶。 E-mail: songlr2023@lzu.edu.cn
  蒋远博男，1999年7月出生于河南省平顶山市。现为兰州大学信息科学与工程学院博士研究生。主要研究方向为深度强化学习、自动驾驶。 E-mail: jyuanbo2025@lzu.edu.cn
  雍宾宾男，1988年11月出生于河南省商丘市。现为兰州大学信息科学与工程学院副教授，硕士生导师。主要研究方向为深度学习、并行计算和自动驾驶。 E-mail: yongbb@lzu.edu.cn
  李妍女，1976年10月出生于甘肃省武威市。现为兰州大学信息科学与工程学院副教授，硕士生导师。主要研究方向为自然语言处理、深度强化学习。 E-mail: liyan_2007@lzu.edu.cn
  周庆国男，1973年10月出生于福建省三明市。现为兰州大学信息科学与工程学院教授，博士生导师。主要研究方向为嵌入式系统、网络安全、具身智能。 E-mail: zhouqg@lzu.edu.cn
- 基金信息：
  
  兰州大学中央高校基本科研业务费专项资金(lzujbky-2024-eyt01);国家自然科学基金(61402210);甘肃省拔尖领军人才项目
- DOI：10.12263/DZXB.20250977
  中图分类号： TP273.5;
- 收稿：2025-12-11，
  
  录用：2026-01-06，
  
  纸质出版：2026-03-25
- 稿件说明：
移动端阅览
王金强, 宋利蓉, 蒋远博, 等. 面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究[J]. 电子学报, 2026, 54(03): 1024-1035.

WANG Jinqiang, SONG Lirong, JIANG Yuanbo, et al. Research on Mixed Architecture Hamilton-Jacobi-Bellman Proximal Policy Optimization Method for Autonomous Driving[J]. Acta Electronica Sinica, 2026, 54(03): 1024-1035.
王金强, 宋利蓉, 蒋远博, 等. 面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究[J]. 电子学报, 2026, 54(03): 1024-1035. DOI：10.12263/DZXB.20250977

WANG Jinqiang, SONG Lirong, JIANG Yuanbo, et al. Research on Mixed Architecture Hamilton-Jacobi-Bellman Proximal Policy Optimization Method for Autonomous Driving[J]. Acta Electronica Sinica, 2026, 54(03): 1024-1035. DOI：10.12263/DZXB.20250977

摘要

深度强化学习（Deep Reinforcement Learning，DRL）为解决自动驾驶中复杂的序列决策问题提供了强大的端到端学习框架，但车辆控制策略的安全性仍是一个核心难题，基于哈密顿-雅可比-贝尔曼（Hamilton-Jacobi-Bellman，HJB）方程的物理信息强化学习（Physics-Informed Reinforcement Learning，PIRL）方法展现了巨大潜力。然而，这类方法在实践中受限于选用神经网络的性能。采用传统的多层感知机（MultiLayer Perceptron，MLP）时，难以为HJB物理约束提供高保真的梯度信号，从而引发训练不稳定和模型效率低下问题。为解决这一难题，本文提出了一种面向自动驾驶任务的混合架构哈密顿-雅可比-贝尔曼近端策略优化（Mixed Architecture Hamilton-Jacobi-Bellman Proximal policy Optimization，MAHPO）算法，该方法创新性地构建了一个异构的Actor-Critic框架，其策略网络（Actor）采用MLP以保证决策效率，而值函数网络（Critic）采用柯尔莫哥洛夫-阿诺德网络（Kolmogorov-Arnold Network，KAN）网络进行近似。进一步地，通过训练值函数表征网络KAN的内部可学习光滑B样条函数，可利用轨迹数据自适应地学习非线性变换，从而高效地建模复杂的价值函数及其平滑的梯度场，确保策略网络稳定更新。在自动驾驶模拟环境MetaDrive中的实验结果表明：相较于基线算法，MAHPO算法在任务成功率、碰撞率和离路率等关键性能指标上均取得明显提升，相较于最优基准的软演员-评论家算法（Soft Actor-Critic，SAC）在平均成功率上提升了5.88%，离路率相较于原始HJBPPO算法下降了约78.22%。

Abstract

Deep reinforcement learning (DRL) provides a powerful end-to-end learning framework for addressing complex sequential decision-making problems in autonomous driving

but the safety of vehicle control policies remains a core challenge. physics-informed reinforcement learning (PIRL) methods based on the hamilton-jacobi-bellman (HJB) equation have demonstrated significant potential. However

such methods are severely limited in practice by the performance of the selected neural networks. Conventional multilayer perceptrons (MLPs) struggle to provide high-fidelity gradient signals for HJB physical constraints

thereby leading to training instability and model inefficiency issues. To address this challenge

we proposes a mixed architecture Hamilton-Jacobi-Bellman proximal policy optimization (MAHPO) algorithm tailored for autonomous driving tasks. This method innovatively constructs a heterogeneous Actor-Critic framework. Its policy network (Actor) uses an MLP to ensure efficient decision-making

while the value function network (Critic) is approximated by a kolmogorov-arnold network (KAN). Furthermore

the KAN-based value function representation network employs internal learnable smooth B-spline functions that can adaptively learn nonlinear transformations from trajectory data. This capability enables efficient modeling of complex value functions and their smooth gradient fields

thereby ensuring stable policy network updates. Experimental results in the MetaDrive simulation environment validate the efficacy of the MAHPO algorithm

which yields significant improvements over baselines across key performance metrics such as success rate

collision rate

and off-road rate. It has an average success rate improvement of 5.88% compared with the optimal benchmark soft actor-critic (SAC)

and the off-road rate has decreased by about 78.22% compared with the original HJBPPO algorithm.

关键词

Keywords

references

刘全 , 翟建伟 , 章宗长 , 等 . 深度强化学习综述 [J ] . 计算机学报 , 2018 , 41 ( 1 ): 1 - 27 . DOI: 10.11897/SP.J.1016.2018.00001 http://dx.doi.org/10.11897/SP.J.1016.2018.00001

Liu Quan , Zhai Jianwei , Zhang Zongzhang , et al . A survey on deep reinforcement learning [J ] . Chinese Journal of Computers , 2018 , 41 ( 1 ): 1 - 27 . (in Chinese) . DOI: 10.11897/SP.J.1016.2018.00001 http://dx.doi.org/10.11897/SP.J.1016.2018.00001

François-Lavet V , Henderson P , Islam R , et al . An introduction to deep reinforcement learning [J ] . Foundations and Trends in Machine Learning , 2018 , 11 ( 3/4 ): 219 - 354 . DOI: 10.1561/2200000071 http://dx.doi.org/10.1561/2200000071

Silver D , Huang A , Maddison C J , et al . Mastering the game of Go with deep neural networks and tree search [J ] . Nature , 2016 , 529 ( 7587 ): 484 - 489 . DOI: 10.1038/nature16961 http://dx.doi.org/10.1038/nature16961

Silver D , Hubert T , Schrittwieser J , et al . A general reinforcement learning algorithm that masters chess, shogi, and go through self-play [J ] . Science , 2018 , 362 ( 6419 ): 1140 - 1144 . DOI: 10.1126/science.aar6404 http://dx.doi.org/10.1126/science.aar6404

Vinyals O , Babuschkin I , Czarnecki W M , et al . Grandmaster level in StarCraft II using multi-agent reinforcement learning [J ] . Nature , 2019 , 575 ( 7782 ): 350 - 354 . DOI: 10.1038/s41586-019-1724-z http://dx.doi.org/10.1038/s41586-019-1724-z

Fawzi A , Balog M , Huang A , et al . Discovering faster matrix multiplication algorithms with reinforcement learning [J ] . Nature , 2022 , 610 ( 7930 ): 47 - 53 . DOI: 10.1038/s41586-022-05172-4 http://dx.doi.org/10.1038/s41586-022-05172-4

Singh B , Kumar R , Singh V P . Reinforcement learning in robotic applications: A comprehensive survey [J ] . Artificial Intelligence Review , 2022 , 55 ( 2 ): 945 - 990 . DOI: 10.1007/s10462-021-09997-9 http://dx.doi.org/10.1007/s10462-021-09997-9

Ouyang L , Wu J , Jiang X , et al . Training language models to follow instructions with human feedback [C ] // Proceedings of the 36th International Conference on Neural Information Processing Systems . New York : Curran Associates Inc. , 2022 : 2011 . DOI: 10.52202/068431-2011 http://dx.doi.org/10.52202/068431-2011

胡瑜洪 , 王德光 , 杨明 , 等 . 基于强化学习的离散事件系统最优定向监控 [J ] . 电子学报 , 2024 , 52 ( 9 ): 3172 - 3184 .

Hu Yuhong , Wang Deguang , Yang Ming , et al . Optimal directed control of discrete event systems based on reinforcement learning [J ] . Acta Electronica Sinica , 2024 , 52 ( 9 ): 3172 - 3184 . (in Chinese)

陈爽 , 田烨 , 付莹 . 基于强化学习的免调参即插即用单光子图像重建方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3600 - 3612 .

Chen Shuang , Tian Ye , Fu Ying . Reinforcement learning based tuning-free plug-and-play image reconstruction method for single photon imaging [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3600 - 3612 . (in Chinese)

Schulman J , Wolski F , Dhariwal P , et al . Proximal policy optimization algorithms [PP/OL ] . V2.arXiv ( 2017-08-28 )[ 2025-10-21 ] . https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/1707.06347 . DOI: 10.5260/chara.21.2.8 http://dx.doi.org/10.5260/chara.21.2.8

Liu Z M , Wang Y X , Vaidya S , et al . KAN: Kolmogorov-Arnold networks [C ] // Proceedings of the Thirteenth International Conference on Learning Representations . Singapore : OpenReview.net , 2025 : 70367 - 70413 .

Kiran B R , Sobh I , Talpaert V , et al . Deep reinforcement learning for autonomous driving: A survey [J ] . IEEE Transactions on Intelligent Transportation Systems , 2022 , 23 ( 6 ): 4909 - 4926 . DOI: 10.1109/tits.2021.3054625 http://dx.doi.org/10.1109/tits.2021.3054625

Elallid B B , Benamar N , Hafid A S , et al . A comprehensive survey on the application of deep and reinforcement learning approaches in autonomous driving [J ] . Journal of King Saud University - Computer and Information Sciences , 2022 , 34 ( 9 ): 7366 - 7390 . DOI: 10.1016/j.jksuci.2022.03.013 http://dx.doi.org/10.1016/j.jksuci.2022.03.013

Kendall A , Hawke J , Janz D , et al ., Learning to drive in a day [C ] // Proceedings of the International Conference on Robotics and Automation (ICRA) . Piscataway : IEEE , 2019 : 8248 - 8254 . DOI: 10.1109/icra.2019.8793742 http://dx.doi.org/10.1109/icra.2019.8793742

You C X , Lu J B , Filev D , et al . Highway traffic modeling and decision making for autonomous vehicle using reinforcement learning [C ] // Proceedings of the IEEE Intelligent Vehicles Symposium (IV) . Piscataway : IEEE , 2018 : 1227 - 1232 . DOI: 10.1109/ivs.2018.8500675 http://dx.doi.org/10.1109/ivs.2018.8500675

Mirchevska B , Pek C , Werling M , et al . High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning [C ] // Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC) . Piscataway : IEEE , 2018 : 2156 - 2162 . DOI: 10.1109/itsc.2018.8569448 http://dx.doi.org/10.1109/itsc.2018.8569448

Da C , Qian Y S , Zeng J W , et al . ST-PPO: A spatio-temporal attention enhanced proximal policy optimization algorithm for autonomous driving in complex traffic scenarios [J ] . Machine Learning , 2025 , 114 ( 11 ): 245 . DOI: 10.1007/s10994-025-06887-x http://dx.doi.org/10.1007/s10994-025-06887-x

Zhang C Z , Dai L F , Zhang H , et al . Control barrier function-guided deep reinforcement learning for decision-making of autonomous vehicle at on-ramp merging [J ] . IEEE Transactions on Intelligent Transportation Systems , 2025 , 26 ( 6 ): 8919 - 8932 . DOI: 10.1109/tits.2025.3540862 http://dx.doi.org/10.1109/tits.2025.3540862

Feng S , Sun H W , Yan X T , et al . Dense reinforcement learning for safety validation of autonomous vehicles [J ] . Nature , 2023 , 615 ( 7953 ): 620 - 627 . DOI: 10.1038/s41586-023-05732-2 http://dx.doi.org/10.1038/s41586-023-05732-2

Schulman J , Moritz P , Levine S , et al . High-dimensional continuous control using generalized advantage estimation [PP/OL ] . V6.arXiv ( 2018-10-20 )[ 2025-10-21 ] . https://arxiv.org/abs/1506.02438 https://arxiv.org/abs/1506.02438 . DOI: 10.5260/chara.21.2.8 http://dx.doi.org/10.5260/chara.21.2.8

De Boor C . Package for calculating with B-splines [J ] . SIAM Journal on Numerical Analysis , 1977 , 14 ( 3 ): 441 - 472 . DOI: 10.1137/0714026 http://dx.doi.org/10.1137/0714026

Li Q Y , Peng Z H , Feng L , et al . MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 3 ): 3461 - 3475 . DOI: 10.1109/tpami.2022.3190471 http://dx.doi.org/10.1109/tpami.2022.3190471

Haarnoja T , Zhou A , Abbeel P , et al . Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor [C ] // Proceedings of the 35th International Conference on Machine Learning . Stockholm : PMLR , 2018 : 1861 - 1870 .

Mukherjee A , Liu Jun . Bridging physics-informed neural networks with reinforcement learning: Hamilton-Jacobi-bellman proximal policy optimization (HJBPPO) [C ] // Proceedings of the Workshop on New Frontiers in Learning, Control, and Dynamical Systems at the International Conference on Machine Learning . Honolulu : PMLR , 2023 .

Tsitsiklis J N , Van Roy B . Feature-based methods for large scale dynamic programming [C ] // Proceedings of 1995 34th IEEE Conference on Decision and Control . Piscataway : IEEE , 1995 : 565 - 567 .

浏览量

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

自动驾驶中的3D目标检测研究进展

面向三维多目标追踪的运动补偿优化方法

面向空地一体化交通的虚拟车道：发展阶段与关键技术

基于最小回环检测的多车协同SLAM框架

考虑簇减少的可变长度限制X结构Steiner最小树算法