兰州大学信息科学与工程学院,甘肃兰州 730000
王金强 男,1993年4月出生于甘肃省定西市。现为兰州大学核科学与技术学院萃英博士后。主要研究方向为深度强化学习、AI4Science和机器人。 E-mail: jqwang16@lzu.edu.cn
宋利蓉 女,2000年10月出生于青海省海东市。现为兰州大学信息科学与工程学院硕士研究生。主要研究方向为深度强化学习、自动驾驶。 E-mail: songlr2023@lzu.edu.cn
蒋远博 男,1999年7月出生于河南省平顶山市。现为兰州大学信息科学与工程学院博士研究生。主要研究方向为深度强化学习、自动驾驶。 E-mail: jyuanbo2025@lzu.edu.cn
雍宾宾 男,1988年11月出生于河南省商丘市。现为兰州大学信息科学与工程学院副教授,硕士生导师。主要研究方向为深度学习、并行计算和自动驾驶。 E-mail: yongbb@lzu.edu.cn
李妍 女,1976年10月出生于甘肃省武威市。现为兰州大学信息科学与工程学院副教授,硕士生导师。主要研究方向为自然语言处理、深度强化学习。 E-mail: liyan_2007@lzu.edu.cn
周庆国 男,1973年10月出生于福建省三明市。现为兰州大学信息科学与工程学院教授,博士生导师。主要研究方向为嵌入式系统、网络安全、具身智能。 E-mail: zhouqg@lzu.edu.cn
收稿:2025-12-11,
录用:2026-01-06,
纸质出版:2026-03-25
移动端阅览
王金强, 宋利蓉, 蒋远博, 等. 面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究[J]. 电子学报, 2026, 54(03): 1024-1035.
WANG Jinqiang, SONG Lirong, JIANG Yuanbo, et al. Research on Mixed Architecture Hamilton-Jacobi-Bellman Proximal Policy Optimization Method for Autonomous Driving[J]. Acta Electronica Sinica, 2026, 54(03): 1024-1035.
王金强, 宋利蓉, 蒋远博, 等. 面向自动驾驶的混合架构哈密顿-雅可比-贝尔曼近端策略优化方法研究[J]. 电子学报, 2026, 54(03): 1024-1035. DOI:10.12263/DZXB.20250977
WANG Jinqiang, SONG Lirong, JIANG Yuanbo, et al. Research on Mixed Architecture Hamilton-Jacobi-Bellman Proximal Policy Optimization Method for Autonomous Driving[J]. Acta Electronica Sinica, 2026, 54(03): 1024-1035. DOI:10.12263/DZXB.20250977
深度强化学习(Deep Reinforcement Learning,DRL)为解决自动驾驶中复杂的序列决策问题提供了强大的端到端学习框架,但车辆控制策略的安全性仍是一个核心难题,基于哈密顿-雅可比-贝尔曼(Hamilton-Jacobi-Bellman,HJB)方程的物理信息强化学习(Physics-Informed Reinforcement Learning,PIRL)方法展现了巨大潜力。然而,这类方法在实践中受限于选用神经网络的性能。采用传统的多层感知机(MultiLayer Perceptron,MLP)时,难以为HJB物理约束提供高保真的梯度信号,从而引发训练不稳定和模型效率低下问题。为解决这一难题,本文提出了一种面向自动驾驶任务的混合架构哈密顿-雅可比-贝尔曼近端策略优化(Mixed Architecture Hamilton-Jacobi-Bellman Proximal policy Optimization,MAHPO)算法,该方法创新性地构建了一个异构的Actor-Critic框架,其策略网络(Actor)采用MLP以保证决策效率,而值函数网络(Critic)采用柯尔莫哥洛夫-阿诺德网络(Kolmogorov-Arnold Network,KAN)网络进行近似。进一步地,通过训练值函数表征网络KAN的内部可学习光滑B样条函数,可利用轨迹数据自适应地学习非线性变换,从而高效地建模复杂的价值函数及其平滑的梯度场,确保策略网络稳定更新。在自动驾驶模拟环境MetaDrive中的实验结果表明:相较于基线算法,MAHPO算法在任务成功率、碰撞率和离路率等关键性能指标上均取得明显提升,相较于最优基准的软演员-评论家算法(Soft Actor-Critic,SAC)在平均成功率上提升了5.88%,离路率相较于原始HJBPPO算法下降了约78.22%。
Deep reinforcement learning (DRL) provides a powerful end-to-end learning framework for addressing complex sequential decision-making problems in autonomous driving
but the safety of vehicle control policies remains a core challenge. physics-informed reinforcement learning (PIRL) methods based on the hamilton-jacobi-bellman (HJB) equation have demonstrated significant potential. However
such methods are severely limited in practice by the performance of the selected neural networks. Conventional multilayer perceptrons (MLPs) struggle to provide high-fidelity gradient signals for HJB physical constraints
thereby leading to training instability and model inefficiency issues. To address this challenge
we proposes a mixed architecture Hamilton-Jacobi-Bellman proximal policy optimization (MAHPO) algorithm tailored for autonomous driving tasks. This method innovatively constructs a heterogeneous Actor-Critic framework. Its policy network (Actor) uses an MLP to ensure efficient decision-making
while the value function network (Critic) is approximated by a kolmogorov-arnold network (KAN). Furthermore
the KAN-based value function representation network employs internal learnable smooth B-spline functions that can adaptively learn nonlinear transformations from trajectory data. This capability enables efficient modeling of complex value functions and their smooth gradient fields
thereby ensuring stable policy network updates. Experimental results in the MetaDrive simulation environment validate the efficacy of the MAHPO algorithm
which yields significant improvements over baselines across key performance metrics such as success rate
collision rate
and off-road rate. It has an average success rate improvement of 5.88% compared with the optimal benchmark soft actor-critic (SAC)
and the off-road rate has decreased by about 78.22% compared with the original HJBPPO algorithm.
刘全 , 翟建伟 , 章宗长 , 等 . 深度强化学习综述 [J ] . 计算机学报 , 2018 , 41 ( 1 ): 1 - 27 . DOI: 10.11897/SP.J.1016.2018.00001 http://dx.doi.org/10.11897/SP.J.1016.2018.00001
Liu Quan , Zhai Jianwei , Zhang Zongzhang , et al . A survey on deep reinforcement learning [J ] . Chinese Journal of Computers , 2018 , 41 ( 1 ): 1 - 27 . (in Chinese) . DOI: 10.11897/SP.J.1016.2018.00001 http://dx.doi.org/10.11897/SP.J.1016.2018.00001
François-Lavet V , Henderson P , Islam R , et al . An introduction to deep reinforcement learning [J ] . Foundations and Trends in Machine Learning , 2018 , 11 ( 3/4 ): 219 - 354 . DOI: 10.1561/2200000071 http://dx.doi.org/10.1561/2200000071
Silver D , Huang A , Maddison C J , et al . Mastering the game of Go with deep neural networks and tree search [J ] . Nature , 2016 , 529 ( 7587 ): 484 - 489 . DOI: 10.1038/nature16961 http://dx.doi.org/10.1038/nature16961
Silver D , Hubert T , Schrittwieser J , et al . A general reinforcement learning algorithm that masters chess, shogi, and go through self-play [J ] . Science , 2018 , 362 ( 6419 ): 1140 - 1144 . DOI: 10.1126/science.aar6404 http://dx.doi.org/10.1126/science.aar6404
Vinyals O , Babuschkin I , Czarnecki W M , et al . Grandmaster level in StarCraft II using multi-agent reinforcement learning [J ] . Nature , 2019 , 575 ( 7782 ): 350 - 354 . DOI: 10.1038/s41586-019-1724-z http://dx.doi.org/10.1038/s41586-019-1724-z
Fawzi A , Balog M , Huang A , et al . Discovering faster matrix multiplication algorithms with reinforcement learning [J ] . Nature , 2022 , 610 ( 7930 ): 47 - 53 . DOI: 10.1038/s41586-022-05172-4 http://dx.doi.org/10.1038/s41586-022-05172-4
Singh B , Kumar R , Singh V P . Reinforcement learning in robotic applications: A comprehensive survey [J ] . Artificial Intelligence Review , 2022 , 55 ( 2 ): 945 - 990 . DOI: 10.1007/s10462-021-09997-9 http://dx.doi.org/10.1007/s10462-021-09997-9
Ouyang L , Wu J , Jiang X , et al . Training language models to follow instructions with human feedback [C ] // Proceedings of the 36th International Conference on Neural Information Processing Systems . New York : Curran Associates Inc. , 2022 : 2011 . DOI: 10.52202/068431-2011 http://dx.doi.org/10.52202/068431-2011
胡瑜洪 , 王德光 , 杨明 , 等 . 基于强化学习的离散事件系统最优定向监控 [J ] . 电子学报 , 2024 , 52 ( 9 ): 3172 - 3184 .
Hu Yuhong , Wang Deguang , Yang Ming , et al . Optimal directed control of discrete event systems based on reinforcement learning [J ] . Acta Electronica Sinica , 2024 , 52 ( 9 ): 3172 - 3184 . (in Chinese)
陈爽 , 田烨 , 付莹 . 基于强化学习的免调参即插即用单光子图像重建方法 [J ] . 电子学报 , 2024 , 52 ( 10 ): 3600 - 3612 .
Chen Shuang , Tian Ye , Fu Ying . Reinforcement learning based tuning-free plug-and-play image reconstruction method for single photon imaging [J ] . Acta Electronica Sinica , 2024 , 52 ( 10 ): 3600 - 3612 . (in Chinese)
Schulman J , Wolski F , Dhariwal P , et al . Proximal policy optimization algorithms [PP/OL ] . V2.arXiv ( 2017-08-28 )[ 2025-10-21 ] . https://arxiv.org/abs/1707.06347 https://arxiv.org/abs/1707.06347 . DOI: 10.5260/chara.21.2.8 http://dx.doi.org/10.5260/chara.21.2.8
Liu Z M , Wang Y X , Vaidya S , et al . KAN: Kolmogorov-Arnold networks [C ] // Proceedings of the Thirteenth International Conference on Learning Representations . Singapore : OpenReview.net , 2025 : 70367 - 70413 .
Kiran B R , Sobh I , Talpaert V , et al . Deep reinforcement learning for autonomous driving: A survey [J ] . IEEE Transactions on Intelligent Transportation Systems , 2022 , 23 ( 6 ): 4909 - 4926 . DOI: 10.1109/tits.2021.3054625 http://dx.doi.org/10.1109/tits.2021.3054625
Elallid B B , Benamar N , Hafid A S , et al . A comprehensive survey on the application of deep and reinforcement learning approaches in autonomous driving [J ] . Journal of King Saud University - Computer and Information Sciences , 2022 , 34 ( 9 ): 7366 - 7390 . DOI: 10.1016/j.jksuci.2022.03.013 http://dx.doi.org/10.1016/j.jksuci.2022.03.013
Kendall A , Hawke J , Janz D , et al ., Learning to drive in a day [C ] // Proceedings of the International Conference on Robotics and Automation (ICRA) . Piscataway : IEEE , 2019 : 8248 - 8254 . DOI: 10.1109/icra.2019.8793742 http://dx.doi.org/10.1109/icra.2019.8793742
You C X , Lu J B , Filev D , et al . Highway traffic modeling and decision making for autonomous vehicle using reinforcement learning [C ] // Proceedings of the IEEE Intelligent Vehicles Symposium (IV) . Piscataway : IEEE , 2018 : 1227 - 1232 . DOI: 10.1109/ivs.2018.8500675 http://dx.doi.org/10.1109/ivs.2018.8500675
Mirchevska B , Pek C , Werling M , et al . High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning [C ] // Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC) . Piscataway : IEEE , 2018 : 2156 - 2162 . DOI: 10.1109/itsc.2018.8569448 http://dx.doi.org/10.1109/itsc.2018.8569448
Da C , Qian Y S , Zeng J W , et al . ST-PPO: A spatio-temporal attention enhanced proximal policy optimization algorithm for autonomous driving in complex traffic scenarios [J ] . Machine Learning , 2025 , 114 ( 11 ): 245 . DOI: 10.1007/s10994-025-06887-x http://dx.doi.org/10.1007/s10994-025-06887-x
Zhang C Z , Dai L F , Zhang H , et al . Control barrier function-guided deep reinforcement learning for decision-making of autonomous vehicle at on-ramp merging [J ] . IEEE Transactions on Intelligent Transportation Systems , 2025 , 26 ( 6 ): 8919 - 8932 . DOI: 10.1109/tits.2025.3540862 http://dx.doi.org/10.1109/tits.2025.3540862
Feng S , Sun H W , Yan X T , et al . Dense reinforcement learning for safety validation of autonomous vehicles [J ] . Nature , 2023 , 615 ( 7953 ): 620 - 627 . DOI: 10.1038/s41586-023-05732-2 http://dx.doi.org/10.1038/s41586-023-05732-2
Schulman J , Moritz P , Levine S , et al . High-dimensional continuous control using generalized advantage estimation [PP/OL ] . V6.arXiv ( 2018-10-20 )[ 2025-10-21 ] . https://arxiv.org/abs/1506.02438 https://arxiv.org/abs/1506.02438 . DOI: 10.5260/chara.21.2.8 http://dx.doi.org/10.5260/chara.21.2.8
De Boor C . Package for calculating with B-splines [J ] . SIAM Journal on Numerical Analysis , 1977 , 14 ( 3 ): 441 - 472 . DOI: 10.1137/0714026 http://dx.doi.org/10.1137/0714026
Li Q Y , Peng Z H , Feng L , et al . MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning [J ] . IEEE Transactions on Pattern Analysis and Machine Intelligence , 2023 , 45 ( 3 ): 3461 - 3475 . DOI: 10.1109/tpami.2022.3190471 http://dx.doi.org/10.1109/tpami.2022.3190471
Haarnoja T , Zhou A , Abbeel P , et al . Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor [C ] // Proceedings of the 35th International Conference on Machine Learning . Stockholm : PMLR , 2018 : 1861 - 1870 .
Mukherjee A , Liu Jun . Bridging physics-informed neural networks with reinforcement learning: Hamilton-Jacobi-bellman proximal policy optimization (HJBPPO) [C ] // Proceedings of the Workshop on New Frontiers in Learning, Control, and Dynamical Systems at the International Conference on Machine Learning . Honolulu : PMLR , 2023 .
Tsitsiklis J N , Van Roy B . Feature-based methods for large scale dynamic programming [C ] // Proceedings of 1995 34th IEEE Conference on Decision and Control . Piscataway : IEEE , 1995 : 565 - 567 .
0
浏览量
64
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621