Published in

Elsevier, Information Sciences, 10(177), p. 2167-2187

DOI: 10.1016/j.ins.2006.12.005

Links

Tools

Export citation

Search in Google Scholar

Learning to classify e-mail

Journal article published in 2007 by Irena Koprinska ORCID, Josiah Poon, James Clark, Jason Chan
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Green circle
Preprint: archiving allowed
Red circle
Postprint: archiving forbidden
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for auto- matic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algo- rithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co- training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits. � 2006 Elsevier Inc. All rights reserved.