Publications

Detailed Information

On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers

Cited 0 time in Web of Science Cited 0 time in Scopus
Authors

Altmann, Jorn; Bartha, Tamas; Pataricza, Andras

Issue Date
1995-04
Publisher
IPDS1995
Citation
IPDS1995, 1st International Computer Performance and Dependability Symposium, pp. 154-164, Erlangen, Germany, April 1995
Keywords
Error detection,distributed diagnosis,syndrome decoding,massively parallel systems
Abstract
Scalable fault diagnosis is necessary for constructing
fault tolerance mechanisms in large massively parallel multiprocessor
systems. The diagnosis algorithm must operate
efficiently even if the system consists of several thousand
processors. In this paper we introduce an event-driven, distributed
system-level diagnosis algorithm. It uses a small
number of messages and is based on a general diagnosis
model without the limitation of the number of simultaneously
existing faults (an important requirement for massively
parallel computers). The algorithm integrates both error
detection techniques like messages, and built
in hardware mechanisms. The structure of the implemented
algorithm is presented, and the essential program modules
are described. The paper also discusses the use of test results
generated by error detection mechanisms for fault localization.
Measurement results illustrate the effect of the
diagnosis algorithm, in particular the error detection mechanism
by messages, on the application performance.
Language
English
URI
https://hdl.handle.net/10371/6875
Files in This Item:
Appears in Collections:

Altmetrics

Item View & Download Count

  • mendeley

Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.

Share