电子学报 ›› 2016, Vol. 44 ›› Issue (11): 2752-2757.DOI: 10.3969/j.issn.0372-2112.2016.11.026

• 学术论文 • 上一篇    下一篇

一种基于随机投影的贝叶斯时间差分算法

刘全1,2,3, 于俊1,3, 王辉1,3, 傅启明1,3, 朱斐1,3   

  1. 1. 苏州大学计算机科学与技术学院, 江苏苏州 215006;
    2. 吉林大学符号计算与知识工程教育部重点实验室, 吉林长春 130012:;
    3. 软件新技术与产业化协同创新中心, 江苏南京 210023
  • 收稿日期:2015-04-08 修回日期:2015-08-17 出版日期:2016-11-25 发布日期:2016-11-25
  • 作者简介:刘全,男,1969年生于内蒙古,博士,教授,博士生导师.主要研究方向为强化学习、无线传感器网络、智能信息处理.E-mail:quanliu@suda.edu.cn;于俊,男,1989年生于江苏泰州,硕士.主要研究方向为强化学习、贝叶斯推理.
  • 基金资助:

    国家自然科学基金(No.61272005,No.61303108,No.61373094,No.61472262,No.61502323,No.61502329);江苏省自然科学基金(No.BK2012616);江苏省高校自然科学研究项目(No.13KJB520020);吉林大学符号计算与知识工程教育部重点实验室项目(No.93K172014K04);苏州市应用基础研究计划工业部分(No.SYG201422,No.SY201308)

A Bayesian Temporal Difference Algorithm Based on Random Projection

LIU Quan1,2,3, YU Jun1,3, WANG Hui1,3, FU Qi-ming1,3, ZHU Fei1,3   

  1. 1. School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China;
    2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Jilin University, Ministry of Education, Jilin University, Changchun, Jilin 130012, China;
    3. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, Jiangsu 210023, China
  • Received:2015-04-08 Revised:2015-08-17 Online:2016-11-25 Published:2016-11-25

摘要:

在强化学习方法中,大部分的算法都是基于值函数评估的算法.高斯过程时间差分算法利用贝叶斯方法来评估值函数,通过贝尔曼公式和贝叶斯规则,建立立即奖赏与值函数之间的概率生成模型.在状态空间中,通过在线核稀疏化并利用最小二乘方法来求解新样本的近似线性逼近,以提高算法的执行速度,但时间复杂度依然较高.针对在状态空间中近似状态的选择问题,在高斯过程框架下提出一种基于随机投影的贝叶斯时间差分算法,该算法利用哈希函数把字典状态集合中的元素映射成哈希值,根据哈希值进行分组,进而减少状态之间的比较.实验结果表明,该方法不仅能够提高算法的执行速度,而且较好地平衡了评估状态值函数精度和算法执行时间.

关键词: 强化学习, 马尔科夫决策过程, 高斯过程, 随机投影, 时间差分算法

Abstract:

Most algorithms are based on policy evaluation in reinforcement learning.The Gaussian process temporal difference is an algorithm that uses Bayesian solution to evaluate value functions.In the method,Gaussian process builds a probabilistic generative model between the immediate reward and the value function through Bellman Equation and Bayesian rule.In order to improve the efficiency of the algorithm,approximate linear approximation for new samples is solved by on-line kernel sparse and least squares in state space.However,the time complexity is still high.To deal with this problem,a Bayesian temporal difference algorithm bases on random projection algorithm is proposed.The elements in dictionary state set are mapped to hash values by hash function.According to the hash values,groups are divided and the comparison between the states is reduced.The experimental results show that this algorithm not only improves the execution speed,but also obtains balance between execution time and precision of the state value function.

Key words: reinforcement learning, markov decision process, gaussian process, random projection, temporal difference learning

中图分类号: