| MPICH-GF: Providing Fault Tolerance on Grid Environments |
| Namyoon Woo1 Soonho Choi1 Hyungsoo Jung1 Junghwan Moon1 Heon Y. Yeom1 Taesoon Park2 and Hyoungwoo Park3 |
| 1School of Computer Science and Engineering Seoul National University Seoul 151-742, KOREA {nywoo, shchoi, jhs, jhmoon, yeom}@dcslab.snu.ac.kr |
| 2Department of Computer Engineering Sejong University Seoul 143-747, KOREA tspark@kunja.sejong.ac.kr |
| 3Supercomputing Center, KISTI Taejon, Korea hwpark@hpcnet.ne.kr |
| Abstract |
Our research objective is providing checkpoint-based fault tolerance for message passing processes on grids. As a research result, we've implemented MPICH-GF which stands for fault-tolerant MPICH-G2. MPICH-GF consists of hierarchical process managers and MPICH-GF library. Hierarchical managers control parallel processes, monitor failures and recover failed processes automatically. MPICH-GF library includes dynamic process management, checkpoint library and checkpointing and message logging protocols. Current MPICH-GF implementation supports coordinated checkpointing protocol, in which processes generate checkpoint files simultaneously after coordination and all of processes rollback on a single failure. We've succeeded in testing Nas Parallel Benchmark applications with MPICH-GF. MPICH-GF guarantees no message-loss in global checkpoint and supports non-blocking message transfer. While most of previous related works adopt indirect message transfer method, in MPICH-GF processes transfer message directly to target processes. Indepedent checkpointing with message logging and high available process manager implementation are in progress. |