Dynamic and Fault-Tolerant Clustering for Scientific Workflows

Chen, Weiwei; da Silva, Rafael Ferreira; Deelman, Ewa; Fahringer, Thomas

Published in

Institute of Electrical and Electronics Engineers, IEEE Transactions on Cloud Computing, 1(4), p. 49-62, 2016

DOI: 10.1109/tcc.2015.2427200

Tools

Export citation

Search in Google Scholar

Dynamic and Fault-Tolerant Clustering for Scientific Workflows

Journal article published in 2015 by Weiwei Chen, Rafael Ferreira da Silva

, Ewa Deelman, Thomas Fahringer

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

Task clustering has proven to be an effective method to reduce execution overhead and to improve the computational granularity of scientific workflow tasks executing on distributed resources. However, a job composed of multiple tasks may have a higher risk of suffering from failures than a single task job. In this paper, we conduct a theoretical analysis of the impact of transient failures on the runtime performance of scientific workflow executions. We propose a general task failure modeling framework that uses a Maximum Likelihood estimation-based parameter estimation process to model workflow performance. We further propose 3 fault-tolerant clustering strategies to improve the runtime performance of workflow executions in faulty execution environments. Experimental results show that failures can have significant impact on executions where task clustering policies are not fault-tolerant, and that our solutions yield makespan improvements in such scenarios. In addition, we propose a dynamic task clustering strategy to optimize the workflow’s makespan by dynamically adjusting the clustering granularity when failures arise. A trace-based simulation of five real workflows shows that our dynamic method is able to adapt to unexpected behaviors, and yields better makespans when compared to static methods.

Published in

Links

Tools

Dynamic and Fault-Tolerant Clustering for Scientific Workflows

Abstract