电子学报 ›› 2022, Vol. 50 ›› Issue (5): 1192-1200.DOI: 10.12263/DZXB.20210487

• 学术论文 • 上一篇    下一篇

基于深度强化学习的码率自适应算法研究

易令, 李泽平   

  1. 贵州大学计算机科学与技术学院,贵州 贵阳 550025
  • 收稿日期:2021-04-16 修回日期:2022-01-27 出版日期:2022-05-25 发布日期:2022-06-18
  • 通讯作者: 李泽平
  • 作者简介:易 令 男,1994年出生于贵州省铜仁市,硕士.主要研究方向为自适应流媒体,深度强化学习.E-mail: 2821210016@qq.com
    李泽平 男,1964年出生于贵州省贵阳市,博士,毕业于电子科技大学.现为贵州大学计算机科学与技术学院教授.主要研究方向为网络分发,视频流.E-mail: zpli1@gzu.edu.cn
  • 基金资助:
    国家自然科学基金(61462014)

Research of Adaptive Bitrate Algorithm Based on Deep Reinforcement Learning

YI Ling, LI Ze-ping   

  1. School of Computer Science and Technology,Guizhou University,Guiyang,Guizhou 550025,China
  • Received:2021-04-16 Revised:2022-01-27 Online:2022-05-25 Published:2022-06-18
  • Contact: LI Ze-ping

摘要:

码率自适应(Adaptive BitRate,ABR)算法是视频客户端提高用户体验质量(Quality of Experience,QoE)的一种有效途径.针对现有ABR算法存在频繁缓冲、视频卡顿、画质较低和网络吞吐量预测不准确等问题,本文提出一种基于深度强化学习的码率自适应(Deep Reinforcement Learning based ABR,DRLA)算法.DRLA用实际网络带宽数据训练神经网络,通过收集客户端缓冲区占用率和网络吞吐量向视频服务器请求最佳码率的视频.首先,DRLA用基线函数方法优化损失函数L,用熵随机探索方法防止损失函数局部收敛;其次利用约束条件限制新旧策略的散度更新幅度提高算法的鲁棒性;最后通过置信域(trust region)优化找到最优策略,使得QoE达到最优.与现有ABR算法对比的实验结果表明:DRLA减少了训练时间,能进一步提高算法的鲁棒性和用户的QoE,并在实际环境下验证了算法的有效性.

关键词: 码率自适应算法, 体验质量, 深度强化学习, 基线函数, 熵, 置信域

Abstract:

Modern video players employ adaptive bitrate(ABR) algorithms to improve user quality of experience(QoE). Aiming at the problems of the existing ABR algorithms, for example, these algorithms usually lead to frequent rebuffering, video freezes, low video quality, or inaccurate network throughput prediction. In this paper, we propose a deep reinforcement learning algorithm based on ABR(DRLA). DRLA trains the neural network with the actual network bandwidth data, and requests the video with the best bit rate from the video server by collecting the client buffer occupancy rate and network throughput. DRLA optimizes the loss function with the baseline function method. To encourage exploration, we add an entropy regularization term to the update rule of the policy network. Then, DRLA uses constraints to limit the divergence of the new and old policies. Besides, DRLA optimizes the policy to use trust region to improve QoE. Compared with the existing ABR algorithms on the QoE metrics, DRLA reduces training time, is more robust, and can further improve QoE, and the experimental results verify the effectiveness of this algorithm.

Key words: adaptive bitrate algorithm, quality of experience, deep reinforcement learning, baseline function, entropy, trust region

中图分类号: