S-Space College of Engineering/Engineering Practice School (공과대학/대학원) Program in Technology, Management, Economics and Policy (협동과정-기술·경영·경제·정책전공) Journal Papers (저널논문_협동과정-기술·경영·경제·정책전공)
Fault Tolerance in Massively Parallel Systems
- Deconinck, G.; Vounckx, J.; Cuyvers, R.; Lauwereins, R.; Bieker, B.; Wileke, H.; Maehle, E.; Hein, A.; Balbach, F.; Altmann, Jorn; Dal Cin, M.; Madeira, H.; Silva, J.G.; Wagner, R.; Viehover, G.
- Issue Date
- John Wiley & Sons
- Transputer Communications, 2(4), 241-257
- fault tolerance; massively parallel system; error detection; fault diagnosis; backward error recovery; reconfiguration
- In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion
of long-running computation-intensive applications . To achieve this at reasonable low cost,
we present a global approach . A flexible and powerful backbone is provided through the combination
ofhardware and software error detection techniques, fault diagnosis and operator-site
software together with reconfiguration of the system. Application recovery is based on checkpointing
and rollback . The red line (i.e. applicability for a massively parallel system) comprises
scalability as well as simplicity. A unifying system model is introduced that allows the mapping
of a global concept for fault tolerance to a wide variety of MPS. The framework for
implementation in an existing MPS is discussed .'
- Files in This Item: There are no files associated with this item.