Publications
Detailed Information
On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Altmann, Jorn | - |
dc.contributor.author | Bartha, Tamas | - |
dc.contributor.author | Pataricza, Andras | - |
dc.date.accessioned | 2009-08-11 | - |
dc.date.available | 2009-08-11 | - |
dc.date.issued | 1995-04 | - |
dc.identifier.citation | IPDS1995, 1st International Computer Performance and Dependability Symposium, pp. 154-164, Erlangen, Germany, April 1995 | en |
dc.identifier.uri | https://hdl.handle.net/10371/6875 | - |
dc.description.abstract | Scalable fault diagnosis is necessary for constructing
fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. In this paper we introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented, and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by messages, on the application performance. | en |
dc.description.sponsorship | Supported by the EU (European Unit) as part of the Esprit Project 6731,
Fault Tolerance for Massively Parallel Systems, and the Hungarian-German Joint Scientific Research Project #70 with additional support from OTKA-F007414. | en |
dc.language.iso | en | en |
dc.publisher | IPDS1995 | en |
dc.subject | Error detection, | en |
dc.subject | distributed diagnosis, | en |
dc.subject | syndrome decoding, | en |
dc.subject | massively parallel systems | en |
dc.title | On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers | en |
dc.type | Conference Paper | en |
- Appears in Collections:
Item View & Download Count
Items in S-Space are protected by copyright, with all rights reserved, unless otherwise indicated.