Dissemin is shutting down on January 1st, 2025

Published in

Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems - LCR '04

DOI: 10.1145/1066650.1066667

Links

Tools

Export citation

Search in Google Scholar

Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors

Journal article published in 2004 by Tanping Wang, Filip Blagojevic, Dimitrios S. Nikolopoulos ORCID
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

This paper presents runtime mechanisms that enable flex-ible use of speculative precomputation in conjunction with thread-level parallelism on SMT processors. The mecha-nisms were implemented and evaluated on a real multi-SMT system. So far, speculative precomputation and thread-level parallelism have been used disjunctively on SMT processors and no attempts have been made to compare and possi-bly combine these techniques for further optimization. We present runtime support mechanisms for coordinating pre-computation with its sibling computation, so that precom-putation is regulated to avoid cache pollution and sufficient runahead distance is allowed from the targeted computa-tion. We also present a task queue mechanism to orches-trate precomputation and thread-level parallelism, so that they can be used conjunctively in the same program. The mechanisms are motivated by the observation that differ-ent parts of a program may benefit from different modes of multithreaded execution. Furthermore, idle periods during TLP execution or sequential sections can be used for pre-computation and vice versa. We apply the mechanisms in loop-structured scientific codes. We present experimental results that verify that no single technique (precomputation or TLP) in isolation achieves the best performance in all cases. Efficient combination of precomputation and TLP is most often the best solution.