Publications

Detailed Information

Fault Tolerance in Massively Parallel Systems

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

Deconinck, G.; Vounckx, J.; Cuyvers, R.; Lauwereins, R.; Bieker, B.; Wileke, H.; Maehle, E.; Hein, A.; Balbach, F.; Altmann, Jorn; Dal Cin, M.; Madeira, H.; Silva, J.G.; Wagner, R.; Viehover, G.

Issue Date
1994-12
Publisher
John Wiley & Sons
Citation
Transputer Communications, 2(4), 241-257
Keywords
fault tolerancemassively parallel systemerror detectionfault diagnosisbackward error recoveryreconfiguration
Abstract
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion
of long-running computation-intensive applications . To achieve this at reasonable low cost,
we present a global approach . A flexible and powerful backbone is provided through the combination
ofhardware and software error detection techniques, fault diagnosis and operator-site
software together with reconfiguration of the system. Application recovery is based on checkpointing
and rollback . The red line (i.e. applicability for a massively parallel system) comprises
scalability as well as simplicity. A unifying system model is introduced that allows the mapping
of a global concept for fault tolerance to a wide variety of MPS. The framework for
implementation in an existing MPS is discussed .'
ISSN
1070-454X
Language
English
URI
https://hdl.handle.net/10371/6897
Files in This Item:
There are no files associated with this item.
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share