High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Lancichinetti, Andrea; Irmak Sirer, M.; Sirer, M. Irmak; Wang, Jane X.; Acuna, Daniel; Körding, Konrad; Amaral, Luís A. Nunes; Nunes Amaral, Luís A.

Published in

American Physical Society, Physical Review X, 1(5), 2015

DOI: 10.1103/physrevx.5.011007

Tools

Export citation

Search in Google Scholar

High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Journal article published in 2015 by Andrea Lancichinetti, M. Irmak Sirer, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding

, Luís A. Nunes Amaral, Luís A. Nunes Amaral

This paper is available in a repository.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent searching, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state of the art in topic modeling. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results that are not accurate in inferring the most suitable model parameters. Adapting approaches from community detection in networks, we propose a new algorithm that displays high reproducibility and high accuracy and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure.

Published in

Links

Tools

High-Reproducibility and High-Accuracy Method for Automated Topic Classification

Abstract