一种最大集合期望损失的多目标Sarsa(λ)算法

doi:10.3969/j.issn.0372-2112.2013.08.003

PDF(469 KB)

电子学报 ›› 2013, Vol. 41 ›› Issue (8) : 1469-1473. DOI: 10.3969/j.issn.0372-2112.2013.08.003

学术论文

一种最大集合期望损失的多目标Sarsa(λ)算法

刘全^1,2, 李瑾¹, 傅启明¹, 崔志明¹, 伏玉琛¹

作者信息 +

A Multiple-Goal Sarsa(λ) Algorithm Based on Lost Reward of Greatest Mass

LIU Quan^1,2, LI Jin¹, FU Qi-ming¹, CUI Zhi-ming¹, FU Yu-chen¹

Author information +

文章历史 +

摘要

针对RoboCup这一典型的多目标强化学习问题,提出一种基于最大集合期望损失的多目标强化学习算法LRGM-Sarsa(λ)算法.该算法预估各个目标的最大集合期望损失,在平衡各个目标的前提下选择最佳联合动作以产生最优联合策略.在单个目标训练的过程中,采用基于改进MSBR误差函数的Sarsa(λ)算法,并对动作选择概率函数和步长参数进行优化,解决了强化学习在使用非线性函数泛化时,算法不稳定、不收敛的问题.将该算法应用到RoboCup射门局部策略训练中,取得了较好的效果,表明该学习算法的有效性.

Abstract

For solving the multiple-goal problem in RoboCup,a novel multiple-goal Reinforcement Learning algorithm,named LRGM-Sarsa(λ),is proposed.The algorithm estimates the lost reward of the greatest mass of every sub goal and trades off the long term reward of the sub goals to get a composite policy.In the single learning module,B error function,which is based on MSBR error function is proposed.B error function has guaranteed the convergence of the value prediction with the non-linear function approximation.The probability funciton of selecting actions and the parameter α are also improved with respect to B error function.This algorithm is applied to the training of shooting in Robocup 2D.The experimental results show that the proposed algorithm is more stable and converges faster.

导出引用

刘全, 李瑾, 傅启明, 崔志明, 伏玉琛. 一种最大集合期望损失的多目标Sarsa(λ)算法[J]. 电子学报, 2013, 41(8): 1469-1473. https://doi.org/10.3969/j.issn.0372-2112.2013.08.003

LIU Quan, LI Jin, FU Qi-ming, CUI Zhi-ming, FU Yu-chen. A Multiple-Goal Sarsa(λ) Algorithm Based on Lost Reward of Greatest Mass[J]. Acta Electronica Sinica, 2013, 41(8): 1469-1473. https://doi.org/10.3969/j.issn.0372-2112.2013.08.003

中图分类号： TP181

参考文献

[1] Feng Wu,Shlomo Zhilberstein,Xiaoping Chen.Online planning for multi-agent systems with bounded communication[J].Artifical Intelligence,2011,175(2):487-511.
[2] Zongzhang Zhang,Xiaoping Chen.Accelerating point-based POMDP algorithms via greedy strategies[A].Proceedings of the 2nd International Conference on Simulation,Modeling and Programming for Autonomous Robots[C].Darmstadt,Germany:Computer Science,2010.545-556.
[3] Feng Wu,Shlomo Zilberstein,XiaoPing Chen.Multi-agent online planning with communication[A].Proceedings of ICAPS-09[C].Thessaloniki,Greece:AAAI,2009.
[4] Constantin A,Dana H.Credit assignment in multiple goal embodied visuomotor behavior[J].Frontiers in Psychology,2010,173(1):1-13.
[5] Marek Grzes,Daniel Kuadenko.Multigrid reinforcement learning with reward shaping[C].Proceedings of ICANN 2008.Verlag,Berlin:Springer,2008.357-366.
[6] 王雪松,张依阳,程玉虎.基于高斯过程分类器的连续空间强化学习[J].电子学报,2009,39(6):1153-1158. Wang Xue-song,et al.Reinforcement learning for continuous spaces based on Gaussian process classifier[J].Acta Electronic Sinica,2009,39(6):1153-1158.(in Chinese)
[7] Karlsson J.Learning to Solve Multiple Goals[D].Rochester:University of Rochester,1997.
[8] Humphrys M.Action selection methods using reinforcement learning[A].Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior[C].Cambridge,MA:MIT,1996.135-144.
[9] Daw N,Doherty J,et al.Cortical substrates for exploratory decisions in humans.Nature,2006,441(7095):876-879.
[10] Rangel A,Hare T.Neural computations associated with goal-directed choice[J].Current opinion in neurobiology,2010,20(6):262-270.
[11] 郭锐,吴敏,彭军,彭娇,曹卫华.一种新的多智能体Q学习算法[J].自动化学报,2007,33(4):367-372. Guo Rui,Wu Min,et al.A new multi-agent Q-learning[J].Acta Automatica Sinica,2007,33(4):367-372.(in Chinese)
[12] Aissani N,Beldjilali B,Beldjilali B.Dynamic scheduling of maintenance tasks in the petroleum industry:A reinforcement approach[J].Engineering Applications of Artificial Intelligence,2009,22(7):1083-1103.
[13] Sprague N,Ballard D,Robinson A.Modeling embodied visual behaviors[J].ACM Transactions on Applied Perception,2007,4(2):1-23.
[14] Sprague N,Ballard D,Robinson A.Modeling embodied visual behaviors[J].ACM Transactions on Applied Perception,2007,4(2):1-23.
[15] Ana M,Timothy C,Hal P.Taxing executive processes does not necessarily increase impulsive decision making[J].Experimental Psychology,2010,57(3):193-201.
[16] Baird C.Residual algorithms:reinforcement learning with function approximation[A].Proceedings of the Twelfth International Conference of Machine Learning[C].San Francisco,CA:Morgan Kaufman,1995.

基金

国家自然科学基金 (No.61070223,No.61103045,No.61272005,No.61170020); 江苏省自然科学基金 (No.BK2012616); 江苏省高校自然科学研究项目 (No.09KJA520002,No.09KJB520012); 吉林大学符号计算与知识工程教育部重点实验室项目 (No.93K172012K04)