Tests and tolerances for high-performance software-implemented fault detection

Turmon, M.; Granat, R.; Katz, D. S.; Lou, J. Z.

Published in

Institute of Electrical and Electronics Engineers, IEEE Transactions on Computers, 5(52), p. 579-591, 2003

DOI: 10.1109/tc.2003.1197125

Tools

Export citation

Search in Google Scholar

Tests and tolerances for high-performance software-implemented fault detection

Journal article published in 2003 by M. Turmon, R. Granat, D. S. Katz

, J. Z. Lou

This paper is available in a repository.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper, bounds on error.

Published in

Links

Tools

Tests and tolerances for high-performance software-implemented fault detection

Abstract