Browse

A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors
Dcconinck, Geert; Vounckx, Johan; Lauwereins, Rudy; Altmann, Jorn; Balbach, Frank
Issue Date
1996-05
Publisher
IEEE MPCS1996
Citation
IEEE MPCS1996, 2nd International Conference on Massively Parallel Computing Systems, pp. 205-212, Ischia, Italy, May 1996
Abstract
For massively parallel systems, the probability of cr
s~Yslenc failure clue to u random hardware fault becomes
statistically very significant because of the huge number
of components. Besides, filult injection experiments show
that multiple failures go undetected, leading to incorrect
results. Hence, massively parallel systems reguirc
abilities to tolerate: these faults that will occur. The
FTMPS project presents a scalable implementation to
integrate the different steps to,laull tolerance into existing
HPC systems . On the initial parallel .system only 4017v of
(randomly injected),faulls do not cause the application to
crash or produce wrong results . 1n. the resulting FTMPS
prototype more than. 80%, of these ftiults are correctly
detected and recovered. Resulting overhead for the
application is only between 10 and 20%. Evaluation. of
the different, co-operating fault tolerance modules shows
the,llexibility and the ,.scalability of the approach.
Language
English
URI
http://hdl.handle.net/10371/6861
Files in This Item:
Appears in Collections:
College of Engineering/Engineering Practice School (공과대학/대학원)Program in Technology, Management, Economics and Policy (협동과정-기술·경영·경제·정책전공)Others_협동과정-기술·경영·경제·정책전공
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse