Browse
S-Space
College of Engineering/Engineering Practice School (공과대학/대학원)
Program in Technology, Management, Economics and Policy (협동과정-기술·경영·경제·정책전공)
Others_협동과정-기술·경영·경제·정책전공
On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers
- Authors
- Altmann, Jorn ; Bartha, Tamas ; Pataricza, Andras
- Issue Date
- 1995-04
- Publisher
- IPDS1995
- Citation
- IPDS1995, 1st International Computer Performance and Dependability Symposium, pp. 154-164, Erlangen, Germany, April 1995
- Keywords
- Error detection, ; distributed diagnosis, ; syndrome decoding, ; massively parallel systems
- Abstract
- Scalable fault diagnosis is necessary for constructing
fault tolerance mechanisms in large massively parallel multiprocessor
systems. The diagnosis algorithm must operate
efficiently even if the system consists of several thousand
processors. In this paper we introduce an event-driven, distributed
system-level diagnosis algorithm. It uses a small
number of messages and is based on a general diagnosis
model without the limitation of the number of simultaneously
existing faults (an important requirement for massively
parallel computers). The algorithm integrates both error
detection techniques like messages, and built
in hardware mechanisms. The structure of the implemented
algorithm is presented, and the essential program modules
are described. The paper also discusses the use of test results
generated by error detection mechanisms for fault localization.
Measurement results illustrate the effect of the
diagnosis algorithm, in particular the error detection mechanism
by messages, on the application performance.
- Language
- English
- Files in This Item:
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.