Eagle: A Novel Fault-Tolerant System for Large-Scale Computational Grids
 
Akikazu Hattori, Takashi Yokota, Kanemitsu Ootsu, Fumihito Furukawa, and Takanobu Baba
 
Department of Information Science, Faculty of Engineering
Utsunomiya University
 
 
Abstract
 
 
Very large-scale problems have emerged in some scientific and engineering fields. Large-scale computational grid is welcomed as a hopeful solution, although, reliability problems are discussed due to the massive number of nodes involved. We present a novel fault-tolerant system, named Eagle, for large-scale computational grids. The system is based on the pessimistic logging protocol. We introduce internal stable storages (ISSs), external stable storages (ESSs), and failure detectors. An ISS is responsible for logging communication messages and checkpoint information. Furthermore, ISSs can virtualize individual computation nodes within a site. An ESS is responsible for inter-site communication. An failure detector monitors liveness of nodes and/or processes. The proposed system is evaluated by our simulator EGsim. The experimental results show a high potential and robustness of the Eagle system.