一种不稳定环境下的策略搜索及迁移方法

朱斐; 刘全; 傅启明; 陈冬火; 王辉; 伏玉琛

doi:10.3969/j.issn.0372-2112.2017.02.001

您当前的位置：

首页 >

文章列表页 >

一种不稳定环境下的策略搜索及迁移方法

学术论文 | 更新时间：2025-07-16

- 一种不稳定环境下的策略搜索及迁移方法
- A Policy Search and Transfer Approach in the Non-stationary Environment
- 电子学报 2017年45卷第2期页码：257-266
- 作者机构：
  
  1. 苏州大学计算机科学与技术学院,江苏,苏州,215006
  2. 苏州大学江苏省计算机信息处理技术重点实验室,江苏,苏州,215006
  3. 符号计算与知识工程教育部重点实验室(吉林大学),吉林,长春,130012
  4. 苏州科技学院电子与信息工程学院,江苏,苏州,215011
  5. 苏州大学计算机科学与技术学院,江苏,苏州,215006
  6. 苏州大学江苏省计算机信息处理技术重点实验室,江苏,苏州,215006
  7. 符号计算与知识工程教育部重点实验室(吉林大学),吉林,长春,130012
  8. 苏州科技学院电子与信息工程学院,江苏,苏州,215011
- 作者简介：
- 基金信息：
  
  国家自然科学基金 (No.61303108，No.61373094，No.61272005，No.61472262，No.61502329）;江苏省高校自然科学研究基金 (No.13KJB520020）;吉林大学符号计算与知识工程教育部重点实验室基金 (No.93K172014K04）;苏州市应用基础研究计划基金 (No.SYG201422）;苏州大学高校省级重点实验室基金 (No.KJS1524）;中国国家留学基金 (No.201606920013）
- DOI：10.3969/j.issn.0372-2112.2017.02.001
  中图分类号： TP181
- 网络出版：2017-02-25，
  
  纸质出版：2017
- 稿件说明：
移动端阅览
朱斐, 刘全, 傅启明, 等. 一种不稳定环境下的策略搜索及迁移方法[J]. 电子学报, 2017,45(2):257-266.

ZHU Fei, LIU Quan, FU Qi-ming, et al. A Policy Search and Transfer Approach in the Non-stationary Environment[J]. Acta Electronica Sinica, 2017, 45(2): 257-266.
朱斐, 刘全, 傅启明, 等. 一种不稳定环境下的策略搜索及迁移方法[J]. 电子学报, 2017,45(2):257-266. DOI： 10.3969/j.issn.0372-2112.2017.02.001.

ZHU Fei, LIU Quan, FU Qi-ming, et al. A Policy Search and Transfer Approach in the Non-stationary Environment[J]. Acta Electronica Sinica, 2017, 45(2): 257-266. DOI： 10.3969/j.issn.0372-2112.2017.02.001.

摘要

强化学习是一种Agent在与环境交互过程中，通过累计奖赏最大化来寻求最优策略的在线学习方法.由于在不稳定环境中，某一时刻的MDP模型在与Agent交互之后就发生了变化，导致基于稳定MDP模型传统的强化学习方法无法完成不稳定环境下的最优策略求解问题.针对不稳定环境下的策略求解问题，利用MDP分布对不稳定环境进行建模，提出一种基于公式集的策略搜索算法FSPS.FSPS算法在学习过程中搜集所获得的历史样本信息，并对其进行特征信息的提取，利用这些特征信息来构造不同的用于动作选择的公式，采取策略搜索算法求解最优公式.在此基础之上，给出所求解策略的最优性边界，并从理论上证明了迁移到新MDP分布中策略的最优性主要依赖于MDP分布之间的距离以及所求解策略在原始MDP分布中的性能.最后，将FSPS算法用于经典的Markov Chain问题，实验结果表明，所求解的策略具有较好的性能.

Abstract

As an online learning algorithm

reinforcement learning

which obtains the optimal policy with the maximum expected cumulative reward by interacting with the environment

is mostly based on the stationary Markov Decision Process (MDP) but however is unable to deal with problems of the non-stationary case because traditional reinforcement learning algorithms cannot be used to learn an optimal policy directly due to the failure of MDP model after the agent once interacts with the environment.Hereby

a novel policy search algorithm based on a formula set (FSPS)

which is generated by features extracted from the collected historical sample trajectories

was proposed.The algorithm adopted the formula with the best performance as the optimal policy.The algorithm also took advantage of concept of transfer learning by transferred the learned policy between two similar MDP distributions

where the performance of the transferred policy mainly depends on the distance between two MDP distributions as well as the performance of the learned policy in the original MDP distribution.Simulation results on the Markov Chain problem show that the algorithm can solve the problem of the non-stationary case quite well.

关键词

Keywords

references

浏览量

622

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

基于图组合优化的高效社区搜索

知识数据协同的多对手智能空中博弈策略设计

基于强化学习的免调参即插即用单光子图像重建方法

基于强化学习的离散事件系统最优定向监控

基于强化学习的自免疫动态攻击生成方法