Published in

Association for Computing Machinery (ACM), ACM Transactions on Knowledge Discovery from Data, 5(16), p. 1-24, 2022

DOI: 10.1145/3511896

Links

Tools

Export citation

Search in Google Scholar

Who will Win the Data Science Competition? Insights from KDD Cup 2019 and Beyond

Journal article published in 2022 by Hao Liu ORCID, Qingyu Guo, Hengshu Zhu, Fuzhen Zhuang, Shenwen Yang, Dejing Dou, Hui Xiong
This paper was not found in any repository, but could be made available legally by the author.
This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

Data science competitions are becoming increasingly popular for enterprises collecting advanced innovative solutions and allowing contestants to sharpen their data science skills. Most existing studies about data science competitions have a focus on improving task-specific data science techniques, such as algorithm design and parameter tuning. However, little effort has been made to understand the data science competition itself. To this end, in this article, we shed light on the team’s competition performance, and investigate the team’s evolving performance in the crowd-sourcing competitive innovation context. Specifically, we first acquire and construct multi-sourced datasets of various data science competitions, including the KDD Cup 2019 machine learning competition and beyond. Then, we conduct an empirical analysis to identify and quantify a rich set of features that are significantly correlated with teams’ future performances. By leveraging team’s rank as a proxy, we observe “the stronger, the stronger” rule; that is, top-ranked teams tend to keep their advantages and dominate weaker teams for the rest of the competition. Our results also confirm that teams with diversified backgrounds tend to achieve better performances. After that, we formulate the team’s future rank prediction problem and propose the Multi-Task Representation Learning (MTRL) framework to model both static features and dynamic features. Extensive experimental results on four real-world data science competitions demonstrate the team’s future performance can be well predicted by using MTRL. Finally, we envision our study will not only help competition organizers to understand the competition in a better way, but also provide strategic implications to contestants, such as guiding the team formation and designing the submission strategy.