电子学报 ›› 2022, Vol. 50 ›› Issue (4): 954-966.DOI: 10.12263/DZXB.20211268

• 机器学习交叉融合创新 • 上一篇    下一篇

基于序贯博弈多智能体强化学习的综合模块化航空电子系统重构方法

张涛, 张文涛, 代凌, 陈婧怡, 王丽, 魏倩茹   

  1. 西北工业大学软件学院,陕西 西安 710065
  • 收稿日期:2021-09-17 修回日期:2022-03-08 出版日期:2022-04-25 发布日期:2022-04-25
  • 作者简介:张 涛 男,1976年出生,陕西扶风人.博士,现为西北工业大学软件学院副教授.主要研究方向为智能机器人、人工智能、强化学习和软件测试.E-mail: tao_zhang@nwpu.edu.cn
    张文涛 男,1995年出生,福建宁德福安人.现为西北工业大学软件学院硕士研究生.主要研究方向为强化学习、进化计算和优化调度.E-mail: 820132512@qq.com
  • 基金资助:
    国家自然科学基金(61901388);航空科学基金(2015ZD53055);上海航天科技创新基金(SAST2021-054)

Integrated Modular Avionics System Reconstruction Method Based on Sequential Game Multi-agent Reinforcement Learning

ZHANG Tao, ZHANG Wen-tao, DAI Ling, CHEN Jing-yi, WANG Li, WEI Qian-ru   

  1. School of Software,Northwestern Polytechnical University,Xi'an,Shaanxi 710065,China
  • Received:2021-09-17 Revised:2022-03-08 Online:2022-04-25 Published:2022-04-25

摘要:

动态重构是一种有效的综合模块化航空电子系统故障容错方法.重构蓝图定义了系统故障环境下的应用迁移与资源重配置方案,是以最小代价重构恢复系统功能的关键.在复杂多级关联故障模式下,如何快速自动生成有效重构蓝图是其难点.针对该问题,本文提出一种基于序贯博弈多智能体强化学习的综合模块化航空电子系统重构方法.该方法引入序贯博弈模型,将因受故障影响而需要迁移重构的应用软件定义为博弈中的智能体,根据应用软件优先级确定序贯博弈的顺序.针对序贯博弈过程中多智能体间竞争与合作的问题,算法使用强化学习中的策略梯度,通过控制与环境交互中的动作选择概率来优化重构效果.应用基于有偏估计的策略梯度蒙特卡洛树搜索算法更新博弈策略,解决了传统策略梯度算法震荡难收敛、计算耗时长问题.实验结果表明,与差分进化、Q学习等方法相比,所提算法的优化性能和稳定性均具有显著优势.

关键词: 综合模块化航空电子系统, 序贯博弈, 策略梯度, 多智能体强化学习, 蒙特卡洛树搜索, 重构

Abstract:

Dynamic reconfiguration is an efficient fault-tolerant approach for integrated modular avionics(IMA) systems. The reconfiguration blueprint defines the application migration and resource reconfiguration scheme in the system failure environment, which is the key to reconfiguring and recovering the system function with minimum cost. How to generate effective reconfiguration blueprints rapidly and automatically in complex multi-level associated failure modes is the difficulty. This paper proposes an IMA system reconfiguration method based on sequential game multi-agent reinforcement learning to solve the problem. The sequential game model is introduced in this method. We define the application software needs to be migrated as the agent in the game. The sequence of sequential game is determined according to the priority of the application software. Aiming at the problem of competition and cooperation among multiple agents in the process of sequential game, the algorithm introduces policy gradient of reinforcement learning and optimizes the reconfiguration effect by controlling the action selection probability in interaction with the environment. The policy gradient Monte Carlo tree search algorithm based on biased estimation is applied to update game strategy, which solves the problems of oscillation, difficulty in convergence, long calculation time of the traditional policy gradient algorithm. Experimental results indicate that compared with differential evolution and Q-learning methods, the proposed algorithm has significant advantages in convergence and efficiency.

Key words: integrated modular avionics(IMA) system, sequential game, policy gradient, multi-agent reinforcement learning, Monte Carlo tree search, reconfiguration

中图分类号: