Publications
Detailed Information
Fault Tolerance in Massively Parallel Systems
Cited 0 time in
Web of Science
Cited 0 time in Scopus
- Authors
- Issue Date
- 1994-12
- Publisher
- John Wiley & Sons
- Citation
- Transputer Communications, 2(4), 241-257
- Keywords
- fault tolerance ; massively parallel system ; error detection ; fault diagnosis ; backward error recovery ; reconfiguration
- Abstract
- In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion
of long-running computation-intensive applications . To achieve this at reasonable low cost,
we present a global approach . A flexible and powerful backbone is provided through the combination
ofhardware and software error detection techniques, fault diagnosis and operator-site
software together with reconfiguration of the system. Application recovery is based on checkpointing
and rollback . The red line (i.e. applicability for a massively parallel system) comprises
scalability as well as simplicity. A unifying system model is introduced that allows the mapping
of a global concept for fault tolerance to a wide variety of MPS. The framework for
implementation in an existing MPS is discussed .'
- ISSN
- 1070-454X
- Language
- English
- Files in This Item:
- There are no files associated with this item.
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.