一种最大集合期望损失的多目标Sarsa(λ)算法

刘全, 李瑾, 傅启明, 崔志明, 伏玉琛

电子学报 ›› 2013, Vol. 41 ›› Issue (8) : 1469-1473.

PDF(469 KB)
PDF(469 KB)
电子学报 ›› 2013, Vol. 41 ›› Issue (8) : 1469-1473. DOI: 10.3969/j.issn.0372-2112.2013.08.003
学术论文

一种最大集合期望损失的多目标Sarsa(λ)算法

  • 刘全1,2, 李瑾1, 傅启明1, 崔志明1, 伏玉琛1
作者信息 +

A Multiple-Goal Sarsa(λ) Algorithm Based on Lost Reward of Greatest Mass

  • LIU Quan1,2, LI Jin1, FU Qi-ming1, CUI Zhi-ming1, FU Yu-chen1
Author information +
文章历史 +

摘要

针对RoboCup这一典型的多目标强化学习问题,提出一种基于最大集合期望损失的多目标强化学习算法LRGM-Sarsa(λ)算法.该算法预估各个目标的最大集合期望损失,在平衡各个目标的前提下选择最佳联合动作以产生最优联合策略.在单个目标训练的过程中,采用基于改进MSBR误差函数的Sarsa(λ)算法,并对动作选择概率函数和步长参数进行优化,解决了强化学习在使用非线性函数泛化时,算法不稳定、不收敛的问题.将该算法应用到RoboCup射门局部策略训练中,取得了较好的效果,表明该学习算法的有效性.

Abstract

For solving the multiple-goal problem in RoboCup,a novel multiple-goal Reinforcement Learning algorithm,named LRGM-Sarsa(λ),is proposed.The algorithm estimates the lost reward of the greatest mass of every sub goal and trades off the long term reward of the sub goals to get a composite policy.In the single learning module,B error function,which is based on MSBR error function is proposed.B error function has guaranteed the convergence of the value prediction with the non-linear function approximation.The probability funciton of selecting actions and the parameter α are also improved with respect to B error function.This algorithm is applied to the training of shooting in Robocup 2D.The experimental results show that the proposed algorithm is more stable and converges faster.

关键词

多目标 / 自适应Sarsa(λ) / 最大集合期望损失 / 强化学习 / 机器人足球

Key words

multiple-goal / adaptive Sarsa(λ) / lost reward of greatest mass / reinforcement learning / robocup 2D

引用本文

导出引用
刘全, 李瑾, 傅启明, 崔志明, 伏玉琛. 一种最大集合期望损失的多目标Sarsa(λ)算法[J]. 电子学报, 2013, 41(8): 1469-1473. https://doi.org/10.3969/j.issn.0372-2112.2013.08.003
LIU Quan, LI Jin, FU Qi-ming, CUI Zhi-ming, FU Yu-chen. A Multiple-Goal Sarsa(λ) Algorithm Based on Lost Reward of Greatest Mass[J]. Acta Electronica Sinica, 2013, 41(8): 1469-1473. https://doi.org/10.3969/j.issn.0372-2112.2013.08.003
中图分类号: TP181   

参考文献

[1] Feng Wu,Shlomo Zhilberstein,Xiaoping Chen.Online planning for multi-agent systems with bounded communication[J].Artifical Intelligence,2011,175(2):487-511.
[2] Zongzhang Zhang,Xiaoping Chen.Accelerating point-based POMDP algorithms via greedy strategies[A].Proceedings of the 2nd International Conference on Simulation,Modeling and Programming for Autonomous Robots[C].Darmstadt,Germany:Computer Science,2010.545-556.
[3] Feng Wu,Shlomo Zilberstein,XiaoPing Chen.Multi-agent online planning with communication[A].Proceedings of ICAPS-09[C].Thessaloniki,Greece:AAAI,2009.
[4] Constantin A,Dana H.Credit assignment in multiple goal embodied visuomotor behavior[J].Frontiers in Psychology,2010,173(1):1-13.
[5] Marek Grzes,Daniel Kuadenko.Multigrid reinforcement learning with reward shaping[C].Proceedings of ICANN 2008.Verlag,Berlin:Springer,2008.357-366.
[6] 王雪松,张依阳,程玉虎.基于高斯过程分类器的连续空间强化学习[J].电子学报,2009,39(6):1153-1158. Wang Xue-song,et al.Reinforcement learning for continuous spaces based on Gaussian process classifier[J].Acta Electronic Sinica,2009,39(6):1153-1158.(in Chinese)
[7] Karlsson J.Learning to Solve Multiple Goals[D].Rochester:University of Rochester,1997.
[8] Humphrys M.Action selection methods using reinforcement learning[A].Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior[C].Cambridge,MA:MIT,1996.135-144.
[9] Daw N,Doherty J,et al.Cortical substrates for exploratory decisions in humans.Nature,2006,441(7095):876-879.
[10] Rangel A,Hare T.Neural computations associated with goal-directed choice[J].Current opinion in neurobiology,2010,20(6):262-270.
[11] 郭锐,吴敏,彭军,彭娇,曹卫华.一种新的多智能体Q学习算法[J].自动化学报,2007,33(4):367-372. Guo Rui,Wu Min,et al.A new multi-agent Q-learning[J].Acta Automatica Sinica,2007,33(4):367-372.(in Chinese)
[12] Aissani N,Beldjilali B,Beldjilali B.Dynamic scheduling of maintenance tasks in the petroleum industry:A reinforcement approach[J].Engineering Applications of Artificial Intelligence,2009,22(7):1083-1103.
[13] Sprague N,Ballard D,Robinson A.Modeling embodied visual behaviors[J].ACM Transactions on Applied Perception,2007,4(2):1-23.
[14] Sprague N,Ballard D,Robinson A.Modeling embodied visual behaviors[J].ACM Transactions on Applied Perception,2007,4(2):1-23.
[15] Ana M,Timothy C,Hal P.Taxing executive processes does not necessarily increase impulsive decision making[J].Experimental Psychology,2010,57(3):193-201.
[16] Baird C.Residual algorithms:reinforcement learning with function approximation[A].Proceedings of the Twelfth International Conference of Machine Learning[C].San Francisco,CA:Morgan Kaufman,1995.

基金

国家自然科学基金 (No.61070223,No.61103045,No.61272005,No.61170020); 江苏省自然科学基金 (No.BK2012616); 江苏省高校自然科学研究项目 (No.09KJA520002,No.09KJB520012); 吉林大学符号计算与知识工程教育部重点实验室项目 (No.93K172012K04)
PDF(469 KB)

2870

Accesses

0

Citation

Detail

段落导航
相关文章

/