LIU Yong-peng, WANG Feng, LU Kai, et al. Pipelined Compressed Checkpointing for Heterogeneous Systems[J]. Acta Electronica Sinica, 2012, 40(2): 223-229.
DOI:
LIU Yong-peng, WANG Feng, LU Kai, et al. Pipelined Compressed Checkpointing for Heterogeneous Systems[J]. Acta Electronica Sinica, 2012, 40(2): 223-229. DOI: 10.3969/j.issn.0372-2112.2012.02.003.
Pipelined Compressed Checkpointing for Heterogeneous Systems
Checkpointing is an effective technique to improve the reliability of large scale parallel computing systems.Data compression is a promising technique to reduce the size of data to be saved in the files in the storage subsystem and the amount of data to go through the communication subsystem.However
compression causes a huge amount of time overhead.The time overhead is the main technical barrier of its practical usability.In this paper
we propose a parallel compressed checkpointing technique to reduce the time overhead of compression in heterogenous architectures.It integrates a number of optimization techniques
which include transmitting checkpointing data between host and GPU in buffered pipelines
aggregating file write operations
employing a pipelined parallel compression algorithm
and delegating compression operations to GPU
etc.The paper reports an implementation of the technique in the TH-1 system and the evaluation experiments with the system.The experiment data show that the technique is efficient and practically useable.