一种基于随机投影的贝叶斯时间差分算法

刘全, 于俊, 王辉, 傅启明, 朱斐

电子学报 ›› 2016, Vol. 44 ›› Issue (11) : 2752-2757.

PDF(1156 KB)
PDF(1156 KB)
电子学报 ›› 2016, Vol. 44 ›› Issue (11) : 2752-2757. DOI: 10.3969/j.issn.0372-2112.2016.11.026
学术论文

一种基于随机投影的贝叶斯时间差分算法

  • 刘全1,2,3, 于俊1,3, 王辉1,3, 傅启明1,3, 朱斐1,3
作者信息 +

A Bayesian Temporal Difference Algorithm Based on Random Projection

  • LIU Quan1,2,3, YU Jun1,3, WANG Hui1,3, FU Qi-ming1,3, ZHU Fei1,3
Author information +
文章历史 +

摘要

在强化学习方法中,大部分的算法都是基于值函数评估的算法.高斯过程时间差分算法利用贝叶斯方法来评估值函数,通过贝尔曼公式和贝叶斯规则,建立立即奖赏与值函数之间的概率生成模型.在状态空间中,通过在线核稀疏化并利用最小二乘方法来求解新样本的近似线性逼近,以提高算法的执行速度,但时间复杂度依然较高.针对在状态空间中近似状态的选择问题,在高斯过程框架下提出一种基于随机投影的贝叶斯时间差分算法,该算法利用哈希函数把字典状态集合中的元素映射成哈希值,根据哈希值进行分组,进而减少状态之间的比较.实验结果表明,该方法不仅能够提高算法的执行速度,而且较好地平衡了评估状态值函数精度和算法执行时间.

Abstract

Most algorithms are based on policy evaluation in reinforcement learning.The Gaussian process temporal difference is an algorithm that uses Bayesian solution to evaluate value functions.In the method,Gaussian process builds a probabilistic generative model between the immediate reward and the value function through Bellman Equation and Bayesian rule.In order to improve the efficiency of the algorithm,approximate linear approximation for new samples is solved by on-line kernel sparse and least squares in state space.However,the time complexity is still high.To deal with this problem,a Bayesian temporal difference algorithm bases on random projection algorithm is proposed.The elements in dictionary state set are mapped to hash values by hash function.According to the hash values,groups are divided and the comparison between the states is reduced.The experimental results show that this algorithm not only improves the execution speed,but also obtains balance between execution time and precision of the state value function.

关键词

强化学习 / 马尔科夫决策过程 / 高斯过程 / 随机投影 / 时间差分算法

Key words

reinforcement learning / markov decision process / gaussian process / random projection / temporal difference learning

引用本文

导出引用
刘全, 于俊, 王辉, 傅启明, 朱斐. 一种基于随机投影的贝叶斯时间差分算法[J]. 电子学报, 2016, 44(11): 2752-2757. https://doi.org/10.3969/j.issn.0372-2112.2016.11.026
LIU Quan, YU Jun, WANG Hui, FU Qi-ming, ZHU Fei. A Bayesian Temporal Difference Algorithm Based on Random Projection[J]. Acta Electronica Sinica, 2016, 44(11): 2752-2757. https://doi.org/10.3969/j.issn.0372-2112.2016.11.026
中图分类号: TP181   

参考文献

[1] Sutton R S,Barto A G.Reinforcement Learning:An Introduction[M].Cambridge:MIT Press,1998.
[2] 傅启明,刘全,尤树华,黄蔚,章晓芳.一种新的基于值函数迁移的快速Sarsa算法[J].电子学报,2014,42(11):2157-2161. Fu Qiming,Liu Quan,You Shuhua,Huang Wei,Zhang Xiaofang.A novel fast Sarsa algorithm based on value function transfer[J].Acta Electronica Sinica,2014,42(11):2157-2161.(in Chinese)
[3] Martínez Y,Nowé A,Suárez J,et al.A Reinforcement Learning Approach for the Flexible Job Shop Scheduling Problem[M].Learning and Intelligent Optimization:Springer Berlin Heidelberg,2014.253-262.
[4] Amato C,Shani G.High-level reinforcement learning in strategy games[A].Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems[C].International Foundation for Autonomous Agents and Multiagent Systems,2010.75-82.
[5] Marco Wiering,Martijn van Otterlo.Reinforcement Learning State of the Art[M].Singapore:Springer Press,2012.
[6] Sutton R S.Learning to predict by the methods of temporal differences[J].Machine Learning,1988,3(1):9-44.
[7] Shawe-Taylor J,Cristianini N.Kernel Methods for Pattern Analysis[M].Cambridge:Cambridge University Press,2004.
[8] Scholkopf B,Smola A J.Learning with Kernels:Support Vector Machines,Regularization,Optimization,and Eyond[M].Cambridge:MIT Press,2002.
[9] Ormoneit D,Sen.Kernel-based reinforcement learning[J].Machine Learning,2012,49(2-3):161-178.
[10] Xu X,Xie T,Hu D,et al.Kernel least-squares temporal difference learning[J].International Journal of Information Technology,2005,11(9):54-63.
[11] Xu X,Hu D,Lu X.Kernel-based least squares policy iteration for reinforcement learning[J].IEEE Transactions on Neural Networks,2007,18:973-992.
[12] C E Rasmussen and C K I Williams.Gaussian Processes for Machine Learning[M].Cambridge:MIT Press,2006.
[13] Engel Y,Mannor S,Meir R.Bayes meets Bellman:the gaussian process approach to temporal difference learning[A].Proceedings of the 20th International Conference on Machine Learning[C].Washington:AAAI,2011.154-161.
[14] Engel Y,Mannor S,Meir R.Reinforcement learning with gaussian processes[A].Proceedings of the 22nd International Conference on Machine Learning[C].Bonn:ACM,2014.201-208.
[15] Engel Y,Mannor S,Meir R.Sparse Online Greedy Support Vector Regression[M].Berlin:Springer,2002.

基金

国家自然科学基金 (No.61272005,No.61303108,No.61373094,No.61472262,No.61502323,No.61502329); 江苏省自然科学基金 (No.BK2012616); 江苏省高校自然科学研究项目 (No.13KJB520020); 吉林大学符号计算与知识工程教育部重点实验室项目 (No.93K172014K04); 苏州市应用基础研究计划工业部分 (No.SYG201422,No.SY201308)

PDF(1156 KB)

2304

Accesses

0

Citation

Detail

段落导航
相关文章

/