Dissemin is shutting down on January 1st, 2025

Published in

Springer, Lecture Notes in Computer Science, p. 207-216, 2013

DOI: 10.1007/978-3-642-36949-0_23

Links

Tools

Export citation

Search in Google Scholar

Systematic approach in optimizing numerical memory-bound kernels on GPU

Proceedings article published in 2013 by Ahmad M. Abdelfattah, David E. Keyes ORCID, Hatem Ltaief
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Red circle
Preprint: archiving forbidden
Orange circle
Postprint: archiving restricted
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0. © 2013 Springer-Verlag.