Imbalanced text classification: A term weighting approach

Liu, Ying; Loh, Han Tong; Sun, Aixin

Published in

Elsevier, Expert Systems with Applications, 1(36), p. 690-701

DOI: 10.1016/j.eswa.2007.10.042

Tools

Export citation

Search in Google Scholar

Imbalanced text classification: A term weighting approach

Journal article published in 2009 by Ying Liu

, Han Tong Loh, Aixin Sun

This paper is available in a repository.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving forbidden

Published version: archiving forbidden

Policy details

Data provided by

Abstract

The natural distribution of textual data used in text classification is often imbalanced. Categories with fewer examples are under-represented and their classifiers often perform far below satisfactory. We tackle this problem using a simple probability based term weighting scheme to better distinguish documents in minor categories. This new scheme directly utilizes two critical information ratios, i.e. relevance indicators. Such relevance indicators are nicely supported by probability estimates which embody the category membership. Our experimental study using both Support Vector Machines and Naïve Bayes classifiers and extensive comparison with other classic weighting schemes over two benchmarking data sets, including Reuters-21578, shows significant improvement for minor categories, while the performance for major categories are not jeopardized. Our approach has suggested a simple and effective solution to boost the performance of text classification over skewed data sets. ; Department of Industrial and Systems Engineering

Published in

Links

Tools

Imbalanced text classification: A term weighting approach

Abstract