SHERP

Fault Tolerance in Massively Parallel Systems

Cited 0 time in webofscience Cited 0 time in scopus
Authors
Deconinck, G.; Vounckx, J.; Cuyvers, R.; Lauwereins, R.; Bieker, B.; Wileke, H.; Maehle, E.; Hein, A.; Balbach, F.; Altmann, Jorn; Dal Cin, M.; Madeira, H.; Silva, J.G.; Wagner, R.; Viehover, G.
Issue Date
1994-12
Publisher
John Wiley & Sons
Citation
Transputer Communications, 2(4), 241-257
Keywords
fault tolerancemassively parallel systemerror detectionfault diagnosisbackward error recoveryreconfiguration
Abstract
In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion
of long-running computation-intensive applications . To achieve this at reasonable low cost,
we present a global approach . A flexible and powerful backbone is provided through the combination
ofhardware and software error detection techniques, fault diagnosis and operator-site
software together with reconfiguration of the system. Application recovery is based on checkpointing
and rollback . The red line (i.e. applicability for a massively parallel system) comprises
scalability as well as simplicity. A unifying system model is introduced that allows the mapping
of a global concept for fault tolerance to a wide variety of MPS. The framework for
implementation in an existing MPS is discussed .'
ISSN
1070-454X
Language
English
URI
http://hdl.handle.net/10371/6897
Files in This Item:
There are no files associated with this item.
Appears in Collections:
College of Engineering/Engineering Practice School (공과대학/대학원)Program in Technology, Management, Economics and Policy (협동과정-기술·경영·경제·정책전공)Journal Papers (저널논문_협동과정-기술·경영·경제·정책전공)
  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse