Elsevier, Information Sciences, 10(177), p. 2167-2187
DOI: 10.1016/j.ins.2006.12.005
Full text: Download
In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for auto- matic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algo- rithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co- training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits. � 2006 Elsevier Inc. All rights reserved.