Publications

Detailed Information

A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

DC Field Value Language
dc.contributor.authorDcconinck, Geert-
dc.contributor.authorVounckx, Johan-
dc.contributor.authorLauwereins, Rudy-
dc.contributor.authorAltmann, Jorn-
dc.contributor.authorBalbach, Frank-
dc.date.accessioned2009-08-10T23:53:34Z-
dc.date.available2009-08-10T23:53:34Z-
dc.date.issued1996-05-
dc.identifier.citationIEEE MPCS1996, 2nd International Conference on Massively Parallel Computing Systems, pp. 205-212, Ischia, Italy, May 1996en
dc.identifier.urihttps://hdl.handle.net/10371/6861-
dc.description.abstractFor massively parallel systems, the probability of cr
s~Yslenc failure clue to u random hardware fault becomes
statistically very significant because of the huge number
of components. Besides, filult injection experiments show
that multiple failures go undetected, leading to incorrect
results. Hence, massively parallel systems reguirc
abilities to tolerate: these faults that will occur. The
FTMPS project presents a scalable implementation to
integrate the different steps to,laull tolerance into existing
HPC systems . On the initial parallel .system only 4017v of
(randomly injected),faulls do not cause the application to
crash or produce wrong results . 1n. the resulting FTMPS
prototype more than. 80%, of these ftiults are correctly
detected and recovered. Resulting overhead for the
application is only between 10 and 20%. Evaluation. of
the different, co-operating fault tolerance modules shows
the,llexibility and the ,.scalability of the approach.
en
dc.description.sponsorshipThis project is partly sponsored by ESPRIT project
6731 (FTMPS): "Fault Tolerance in Massively Parallel
Systems" . Geert Deconinck and Johan Vounckx have a
grant from the Flemish Institute for the Advancement of
Scientific and Technological Research in Industry (IWT).
Rudy Lauwereins is a Senior Research Associate of the
Belgian Fund for Scientific Research .
en
dc.language.isoenen
dc.publisherIEEE MPCS1996en
dc.titleA Scalable Implementation of Fault Tolerance for Massively Parallel Systemsen
dc.typeConference Paperen
Appears in Collections:
Files in This Item:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share