MPICH-GF: Providing Fault Tolerance on Grid Environments
 
Namyoon Woo1 Soonho Choi1 Hyungsoo Jung1 Junghwan Moon1 Heon Y. Yeom1 Taesoon Park2 and Hyoungwoo Park3
 
1School of Computer Science and Engineering
Seoul National University
Seoul 151-742, KOREA
{nywoo, shchoi, jhs, jhmoon, yeom}@dcslab.snu.ac.kr
2Department of Computer Engineering
Sejong University
Seoul 143-747, KOREA
tspark@kunja.sejong.ac.kr
3Supercomputing Center, KISTI
Taejon, Korea
hwpark@hpcnet.ne.kr
 
 
Abstract
 
 
Our research objective is providing checkpoint-based fault tolerance for message passing processes on grids. As a research result, we've implemented MPICH-GF which stands for fault-tolerant MPICH-G2. MPICH-GF consists of hierarchical process managers and MPICH-GF library. Hierarchical managers control parallel processes, monitor failures and recover failed processes automatically. MPICH-GF library includes dynamic process management, checkpoint library and checkpointing and message logging protocols. Current MPICH-GF implementation supports coordinated checkpointing protocol, in which processes generate checkpoint files simultaneously after coordination and all of processes rollback on a single failure. We've succeeded in testing Nas Parallel Benchmark applications with MPICH-GF. MPICH-GF guarantees no message-loss in global checkpoint and supports non-blocking message transfer. While most of previous related works adopt indirect message transfer method, in MPICH-GF processes transfer message directly to target processes. Indepedent checkpointing with message logging and high available process manager implementation are in progress.