Fault Tolerance in Massively Parallel Systems

서울대학교 중앙도서관

S-Space 소개

My S-Space

로그인이 필요합니다.

S-Space

Publications

Detailed Information

Fault Tolerance in Massively Parallel Systems

Cited 0 time in Web of Science Cited 0 time in Scopus

Authors: Deconinck, G.; Vounckx, J.; Cuyvers, R.; Lauwereins, R.; Bieker, B.; Wileke, H.; Maehle, E.; Hein, A.; Balbach, F.; Altmann, Jorn; Dal Cin, M.; Madeira, H.; Silva, J.G.; Wagner, R.; Viehover, G.

Keywords: fault tolerance ; massively parallel system ; error detection ; fault diagnosis ; backward error recovery ; reconfiguration

Abstract: In massively parallel systems (MPS), fault tolerance is indispensable to obtain proper completion
of long-running computation-intensive applications . To achieve this at reasonable low cost,
we present a global approach . A flexible and powerful backbone is provided through the combination
ofhardware and software error detection techniques, fault diagnosis and operator-site
software together with reconfiguration of the system. Application recovery is based on checkpointing
and rollback . The red line (i.e. applicability for a massively parallel system) comprises
scalability as well as simplicity. A unifying system model is introduced that allows the mapping
of a global concept for fault tolerance to a wide variety of MPS. The framework for
implementation in an existing MPS is discussed .'

Appears in Collections:

Show Full Item Record

Find it @ SNU

SNS Share