Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Charara, Ali; Keyes, David Elliot; Ltaief, Hatem

Published in

Association for Computing Machinery (ACM), ACM Transactions on Mathematical Software, 2(45), p. 1-28, 2019

DOI: 10.1145/3267101

Tools

Export citation

Search in Google Scholar

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Journal article published in 2019 by Ali Charara, David Elliot Keyes

, Hatem Ltaief

This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.

Published in

Links

Tools

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Abstract