A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

Cited 0 time in Web of Science Cited 0 time in Scopus

Authors: Dcconinck, Geert; Vounckx, Johan; Lauwereins, Rudy; Altmann, Jorn; Balbach, Frank

Citation: IEEE MPCS1996, 2nd International Conference on Massively Parallel Computing Systems, pp. 205-212, Ischia, Italy, May 1996

Abstract: For massively parallel systems, the probability of cr
s~Yslenc failure clue to u random hardware fault becomes
statistically very significant because of the huge number
of components. Besides, filult injection experiments show
that multiple failures go undetected, leading to incorrect
results. Hence, massively parallel systems reguirc
abilities to tolerate: these faults that will occur. The
FTMPS project presents a scalable implementation to
integrate the different steps to,laull tolerance into existing
HPC systems . On the initial parallel .system only 4017v of
(randomly injected),faulls do not cause the application to
crash or produce wrong results . 1n. the resulting FTMPS
prototype more than. 80%, of these ftiults are correctly
detected and recovered. Resulting overhead for the
application is only between 10 and 20%. Evaluation. of
the different, co-operating fault tolerance modules shows
the,llexibility and the ,.scalability of the approach.

Files in This Item:

Appears in Collections:

Show Full Item Record

Find it @ SNU

SNS Share