一种基于最优策略概率分布的POMDP值迭代算法

刘峰; 王崇骏; 骆斌

doi:10.3969/j.issn.0372-2112.2016.05.010

您当前的位置：

首页 >

文章列表页 >

一种基于最优策略概率分布的POMDP值迭代算法

学术论文 | 更新时间：2025-07-16

- 一种基于最优策略概率分布的POMDP值迭代算法
- A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP
- 电子学报 2016年44卷第5期页码：1078-1084
- 作者机构：
  
  1. 南京大学软件学院,江苏,南京,210093
  2. 南京大学计算机科学与技术系,江苏,南京,210093
  3. 南京大学软件新技术国家重点实验室,江苏,南京,210093
  4. 南京大学软件学院,江苏,南京,210093
  5. 南京大学计算机科学与技术系,江苏,南京,210093
  6. 南京大学软件新技术国家重点实验室,江苏,南京,210093
- 作者简介：
- 基金信息：
  
  国家自然科学基金 (No.61375069);江苏省自然科学基金 (No.BK20131277)
- DOI：10.3969/j.issn.0372-2112.2016.05.010
  中图分类号： TP319
- 网络出版：2016-05-25，
  
  纸质出版：2016
- 稿件说明：
移动端阅览
刘峰, 王崇骏, 骆斌. 一种基于最优策略概率分布的POMDP值迭代算法[J]. 电子学报, 2016,44(5):1078-1084.

LIU Feng, WANG Chong-jun, LUO Bin. A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP[J]. Acta Electronica Sinica, 2016, 44(5): 1078-1084.
刘峰, 王崇骏, 骆斌. 一种基于最优策略概率分布的POMDP值迭代算法[J]. 电子学报, 2016,44(5):1078-1084. DOI： 10.3969/j.issn.0372-2112.2016.05.010.

LIU Feng, WANG Chong-jun, LUO Bin. A Probability-Based Value Iteration on Optimal Policy Algorithm for POMDP[J]. Acta Electronica Sinica, 2016, 44(5): 1078-1084. DOI： 10.3969/j.issn.0372-2112.2016.05.010.

摘要

随着应用中POMDP问题的规模不断扩大

基于最优策略可达区域的启发式方法成为了目前的研究热点.然而目前已有的算法虽然保证了全局最优

但选择最优动作还不够精确

影响了算法的效率.本文提出一种基于最优策略概率的值迭代方法PBVIOP.该方法在深度优先的启发式探索中

根据各个动作值函数在其上界和下界之间的分布

用蒙特卡罗法计算动作最优的概率

选择概率最大的动作作为最优探索策略.在4个基准问题上的实验结果表明PBVIOP算法能够收敛到全局最优解

并明显提高了收敛效率.

Abstract

With the enlargement of the scale of POMDP problems in applications

the research of heuristic methods for reachable area based on the optimal policy becomes current hotspot.However

the standard of existing algorithms about choosing the best action is not perfect enough thus the efficiency of the algorithms is affected.This paper proposes a new value iteration method PBVIOP (Probability-based Value Iteration on Optimal Policy).In depth-first heuristic exploration

this method uses the Monte Carlo algorithm to calculate the probability of each optimal action according to the distribution of each action's Q function value between its upper and lower bounds

and chooses the maximum probability action.Experiment results of four benchmarks show that PBVIOP algorithm can obtain global optimal solution and significantly improve the convergence efficiency.

关键词

Keywords

references

浏览量

702

下载量

CSCD

文章被引用时，请邮件提醒。

提交

工具集

关联资源

暂无数据