电子学报 ›› 2012, Vol. 40 ›› Issue (2): 223-229.DOI: 10.3969/j.issn.0372-2112.2012.02.003

• 学术论文 • 上一篇    下一篇

面向异构并行计算系统的流水线式压缩检查点

刘勇鹏1, 王锋1, 卢凯1, 刘勇燕2   

  1. 1. 国防科学技术大学计算机学院,湖南长沙 410073;2. 中国科技部信息中心,北京 100862
  • 收稿日期:2010-09-20 修回日期:2011-07-27 出版日期:2012-02-25
    • 基金资助:
    • 国家863高技术研究发展计划重大项目 (No.2009AA01A128); 高效能服务器和存储技术国家重点实验室开放基金 (No.2009HSSA04); 国家自然科学基金 (No.60603061)

Pipelined Compressed Checkpointing for Heterogeneous Systems

LIU Yong-peng1, WANG Feng1, LU Kai1, LIU Yong-yan2   

  1. 1. College of Computer,National University of Defense Technology,Changsha,Hunan 410073,China;2. Information Center,Ministry of Science and Technology of China,Beijing 100862,China
  • Received:2010-09-20 Revised:2011-07-27 Online:2012-02-25 Published:2012-02-25

摘要: 在大规模并行计算系统中,并行检查点触发大量结点同时保存计算状态,造成巨大文件存储空间开销,以及对通信和存储系统的巨大访问压力.数据压缩可以缩小检查点文件尺寸,从而降低存储空间开销以及对通信和存储系统的访问压力.但是,它也带来额外的压缩计算开销.本文针对异构并行计算系统,提出流水线式并行压缩检查点技术,采用一系列优化技术来降低压缩引入的计算延时,包括:流水线式双重写缓存队列、文件写操作的合并、GPU加速的流水压缩算法和GPU资源的多进程调度,等等.本文介绍了该技术在天河一号系统中的实现,并对所实现的检查点系统进行综合评测.实验数据表明该方法在大规模异构并行计算系统中是可行、高效、实用的.

关键词: 异构并行体系结构, 检查点, 数据压缩, 软流水线, 图形处理器

Abstract: Checkpointing is an effective technique to improve the reliability of large scale parallel computing systems.Data compression is a promising technique to reduce the size of data to be saved in the files in the storage subsystem and the amount of data to go through the communication subsystem.However,compression causes a huge amount of time overhead.The time overhead is the main technical barrier of its practical usability.In this paper,we propose a parallel compressed checkpointing technique to reduce the time overhead of compression in heterogenous architectures.It integrates a number of optimization techniques,which include transmitting checkpointing data between host and GPU in buffered pipelines,aggregating file write operations,employing a pipelined parallel compression algorithm,and delegating compression operations to GPU,etc.The paper reports an implementation of the technique in the TH-1 system and the evaluation experiments with the system.The experiment data show that the technique is efficient and practically useable.

Key words: heterogenous architecture, checkpoint, data compression, pipeline, graphic processing unit (GPU)

中图分类号: