Publications
Detailed Information
A Scalable Implementation of Fault Tolerance for Massively Parallel Systems
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Dcconinck, Geert | - |
dc.contributor.author | Vounckx, Johan | - |
dc.contributor.author | Lauwereins, Rudy | - |
dc.contributor.author | Altmann, Jorn | - |
dc.contributor.author | Balbach, Frank | - |
dc.date.accessioned | 2009-08-10T23:53:34Z | - |
dc.date.available | 2009-08-10T23:53:34Z | - |
dc.date.issued | 1996-05 | - |
dc.identifier.citation | IEEE MPCS1996, 2nd International Conference on Massively Parallel Computing Systems, pp. 205-212, Ischia, Italy, May 1996 | en |
dc.identifier.uri | https://hdl.handle.net/10371/6861 | - |
dc.description.abstract | For massively parallel systems, the probability of cr
s~Yslenc failure clue to u random hardware fault becomes statistically very significant because of the huge number of components. Besides, filult injection experiments show that multiple failures go undetected, leading to incorrect results. Hence, massively parallel systems reguirc abilities to tolerate: these faults that will occur. The FTMPS project presents a scalable implementation to integrate the different steps to,laull tolerance into existing HPC systems . On the initial parallel .system only 4017v of (randomly injected),faulls do not cause the application to crash or produce wrong results . 1n. the resulting FTMPS prototype more than. 80%, of these ftiults are correctly detected and recovered. Resulting overhead for the application is only between 10 and 20%. Evaluation. of the different, co-operating fault tolerance modules shows the,llexibility and the ,.scalability of the approach. | en |
dc.description.sponsorship | This project is partly sponsored by ESPRIT project
6731 (FTMPS): "Fault Tolerance in Massively Parallel Systems" . Geert Deconinck and Johan Vounckx have a grant from the Flemish Institute for the Advancement of Scientific and Technological Research in Industry (IWT). Rudy Lauwereins is a Senior Research Associate of the Belgian Fund for Scientific Research . | en |
dc.language.iso | en | en |
dc.publisher | IEEE MPCS1996 | en |
dc.title | A Scalable Implementation of Fault Tolerance for Massively Parallel Systems | en |
dc.type | Conference Paper | en |
- Appears in Collections:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.