HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Chen, Sibei; Tang, Nan; Fan, Ju; Yan, Xuemi; Chai, Chengliang; Li, Guoliang; Du, Xiaoyong

Published in

Proceedings of the ACM on Management of Data, 1(1), p. 1-26, 2023

DOI: 10.1145/3588945

Tools

Export citation

Search in Google Scholar

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Journal article published in 2023 by Sibei Chen

, Nan Tang

, Ju Fan

, Xuemi Yan

, Chengliang Chai

, Guoliang Li

, Xiaoyong Du

This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

Data preparation is crucial in achieving optimized results for machine learning (ML). However, having a good data preparation pipeline is highly non-trivial for ML practitioners, which is not only domain-specific, but also dataset-specific. There are two common practices. Human-generated pipelines (HI-pipelines) typically use a wide range of any operations or libraries but are highly experience- and heuristic-based. In contrast, machine-generated pipelines (AI-pipelines), a.k.a. AutoML, often adopt a predefined set of sophisticated operations and are search-based and optimized. These two common practices are mutually complementary. In this paper, we study a new problem that, given an HI-pipeline and an AI-pipeline for the same ML task, can we combine them to get a new pipeline (HAI-pipeline) that is better than the provided HI-pipeline and AI-pipeline? We propose HAIPipe, a framework to address the problem, which adopts an enumeration-sampling strategy to carefully select the best performing combined pipeline. We also introduce a reinforcement learning (RL) based approach to search an optimized AI-pipeline. Extensive experiments using 1400+ real-world HI-pipelines (Jupyter notebooks from Kaggle) verify that HAIPipe can significantly outperform the approaches using either HI-pipelines or AI-pipelines alone.

Published in

Links

Tools

HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation

Abstract