Association for Computing Machinery (ACM), ACM Transactions on Software Engineering and Methodology, 6(32), p. 1-34, 2023
DOI: 10.1145/3593802
Full text: Unavailable
Upon receiving a new issue report, practitioners start by investigating the defect type, the potential fixing effort needed to resolve the defect and the change impact. Moreover, issue reports contain valuable information, such as, the title, description and severity, and researchers leverage the topics of issue reports as a collective metric portraying similar characteristics of a defect. Nonetheless, none of the existing studies leverage the defect topic, i.e., a semantic cluster of defects of the same nature, such as Performance, GUI, and Database , to estimate the change impact that represents the amount of change needed in terms of code churn and the number of files changed. To this end, in this article, we conduct an empirical study on 298,548 issue reports belonging to three large-scale open-source systems, i.e., Mozilla, Apache, and Eclipse, to estimate the change impact in terms of code churn or the number of files changed while leveraging the topics of issue reports. First, we adopt the Embedded Topic Model (ETM), a state-of-the-art topic modelling algorithm, to identify the topics. Second, we investigate the feasibility of predicting the change impact using the identified topics and other information extracted from the issue reports by building eight prediction models that classify issue reports requiring small or large change impact along two dimensions, i.e., the code churn size and the number of files changed. Our results suggest that XGBoost is the best-performing algorithm for predicting the change impact, with an AUC of 0.84, 0.76, and 0.73 for the code churn and 0.82, 0.71, and 0.73 for the number of files changed metric for Mozilla, Apache, and Eclipse, respectively. Our results also demonstrate that the topics of issue reports improve the recall of the prediction model by up to 45%.