Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice.

Gunnels, John A.; van de Geijn, Robert A.; Katz, Daniel S.; Quintana Ortí, Enrique S.

Published in

Proceedings International Conference on Dependable Systems and Networks

DOI: 10.1109/dsn.2001.941390

Tools

Export citation

Search in Google Scholar

Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice.

Proceedings article published in 1970 by John A. Gunnels, Robert A. van de Geijn, Daniel S. Katz

, Enrique S. Quintana Ortí

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

We extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C=AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry, is corrupted. Third we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.

Published in

Links

Tools

Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice.

Abstract