1. 国防科学技术大学计算机学院,湖南,长沙,410073
2. 中国科技部信息中心,北京,100862
3. 国防科学技术大学计算机学院湖南长沙,410073
4. 中国科技部信息中心北京,100862
纸质出版:2012
移动端阅览
刘勇鹏, 王锋, 卢凯, 等. 面向异构并行计算系统的流水线式压缩检查点[J]. 电子学报, 2012,40(2):223-229.
LIU Yong-peng, WANG Feng, LU Kai, et al. Pipelined Compressed Checkpointing for Heterogeneous Systems[J]. Acta Electronica Sinica, 2012, 40(2): 223-229.
刘勇鹏, 王锋, 卢凯, 等. 面向异构并行计算系统的流水线式压缩检查点[J]. 电子学报, 2012,40(2):223-229. DOI: 10.3969/j.issn.0372-2112.2012.02.003.
LIU Yong-peng, WANG Feng, LU Kai, et al. Pipelined Compressed Checkpointing for Heterogeneous Systems[J]. Acta Electronica Sinica, 2012, 40(2): 223-229. DOI: 10.3969/j.issn.0372-2112.2012.02.003.
在大规模并行计算系统中
并行检查点触发大量结点同时保存计算状态
造成巨大文件存储空间开销
以及对通信和存储系统的巨大访问压力.数据压缩可以缩小检查点文件尺寸
从而降低存储空间开销以及对通信和存储系统的访问压力.但是
它也带来额外的压缩计算开销.本文针对异构并行计算系统
提出流水线式并行压缩检查点技术
采用一系列优化技术来降低压缩引入的计算延时
包括:流水线式双重写缓存队列、文件写操作的合并、GPU加速的流水压缩算法和GPU资源的多进程调度
等等.本文介绍了该技术在天河一号系统中的实现
并对所实现的检查点系统进行综合评测.实验数据表明该方法在大规模异构并行计算系统中是可行、高效、实用的.
Checkpointing is an effective technique to improve the reliability of large scale parallel computing systems.Data compression is a promising technique to reduce the size of data to be saved in the files in the storage subsystem and the amount of data to go through the communication subsystem.However
compression causes a huge amount of time overhead.The time overhead is the main technical barrier of its practical usability.In this paper
we propose a parallel compressed checkpointing technique to reduce the time overhead of compression in heterogenous architectures.It integrates a number of optimization techniques
which include transmitting checkpointing data between host and GPU in buffered pipelines
aggregating file write operations
employing a pipelined parallel compression algorithm
and delegating compression operations to GPU
etc.The paper reports an implementation of the technique in the TH-1 system and the evaluation experiments with the system.The experiment data show that the technique is efficient and practically useable.
0
浏览量
2
下载量
1
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621