Web Image Classification for Information Extraction Web Image Classification for Information Extraction
We describe an approach to classifying images found on the WWW for the purpose of information extraction (IE). Among features used for classification are image sizes, colour histograms, and the simi-larity of the classified image's content to images in a training collection. Our content similarity metric is based on the latent semantic index. Re-sults are presented on a collection of 1624 image occurrences found on bicycle shop websites, and the task is to distinguish bicycle images from the rest.