Massively Parallel Polar Decomposition on Distributed-memory Systems

Ltaief, Hatem; Sukkari, Dalal E.; Esposito, Aniello; Nakatsukasa, Yuji; Keyes, David Elliot

Published in

Association for Computing Machinery (ACM), ACM Transactions on Parallel Computing, 1(6), p. 1-15, 2019

DOI: 10.1145/3328723

Tools

Export citation

Search in Google Scholar

Massively Parallel Polar Decomposition on Distributed-memory Systems

Journal article published in 2019 by Hatem Ltaief, Dalal E. Sukkari, Aniello Esposito, Yuji Nakatsukasa, David Elliot Keyes

This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877—this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to 17, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102,400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3× speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

Published in

Links

Tools

Massively Parallel Polar Decomposition on Distributed-memory Systems

Abstract