Publications
Detailed Information
On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers
Cited 0 time in
Web of Science
Cited 0 time in Scopus
- Authors
- Issue Date
- 1995-04
- Publisher
- IPDS1995
- Citation
- IPDS1995, 1st International Computer Performance and Dependability Symposium, pp. 154-164, Erlangen, Germany, April 1995
- Keywords
- Error detection, ; distributed diagnosis, ; syndrome decoding, ; massively parallel systems
- Abstract
- Scalable fault diagnosis is necessary for constructing
fault tolerance mechanisms in large massively parallel multiprocessor
systems. The diagnosis algorithm must operate
efficiently even if the system consists of several thousand
processors. In this paper we introduce an event-driven, distributed
system-level diagnosis algorithm. It uses a small
number of messages and is based on a general diagnosis
model without the limitation of the number of simultaneously
existing faults (an important requirement for massively
parallel computers). The algorithm integrates both error
detection techniques like messages, and built
in hardware mechanisms. The structure of the implemented
algorithm is presented, and the essential program modules
are described. The paper also discusses the use of test results
generated by error detection mechanisms for fault localization.
Measurement results illustrate the effect of the
diagnosis algorithm, in particular the error detection mechanism
by messages, on the application performance.
- Language
- English
- Files in This Item:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.