Publications
Detailed Information
A Scalable Implementation of Fault Tolerance for Massively Parallel Systems
Cited 0 time in
Web of Science
Cited 0 time in Scopus
- Authors
- Issue Date
- 1996-05
- Publisher
- IEEE MPCS1996
- Citation
- IEEE MPCS1996, 2nd International Conference on Massively Parallel Computing Systems, pp. 205-212, Ischia, Italy, May 1996
- Abstract
- For massively parallel systems, the probability of cr
s~Yslenc failure clue to u random hardware fault becomes
statistically very significant because of the huge number
of components. Besides, filult injection experiments show
that multiple failures go undetected, leading to incorrect
results. Hence, massively parallel systems reguirc
abilities to tolerate: these faults that will occur. The
FTMPS project presents a scalable implementation to
integrate the different steps to,laull tolerance into existing
HPC systems . On the initial parallel .system only 4017v of
(randomly injected),faulls do not cause the application to
crash or produce wrong results . 1n. the resulting FTMPS
prototype more than. 80%, of these ftiults are correctly
detected and recovered. Resulting overhead for the
application is only between 10 and 20%. Evaluation. of
the different, co-operating fault tolerance modules shows
the,llexibility and the ,.scalability of the approach.
- Language
- English
- Files in This Item:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.