Published in

BioMed Central, Genome Biology, 2(9), p. R31

DOI: 10.1186/gb-2008-9-2-r31

Links

Tools

Export citation

Search in Google Scholar

Text-mining assisted regulatory annotation

This paper is made freely available by the publisher.
This paper is made freely available by the publisher.

Full text: Download

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Green circle
Published version: archiving allowed
Data provided by SHERPA/RoMEO

Abstract

Abstract Background Decoding transcriptional regulatory networks and the genomic cis -regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature. Results We develop text-mining strategies to identify relevant publications and extract sequence information to assist the regulatory annotation process. Using a vector space model to identify Medline abstracts from papers likely to have high cis -regulatory content, we demonstrate that document relevance ranking can assist the curation of transcriptional regulatory networks and estimate that, minimally, 30,000 papers harbor unannotated cis -regulatory data. In addition, we show that DNA sequences can be extracted from primary text with high cis -regulatory content and mapped to genome sequences as a means of identifying the location, organism and target gene information that is critical to the cis -regulatory annotation process. Conclusion Our results demonstrate that text-mining technologies can be successfully integrated with genome annotation systems, thereby increasing the availability of annotated cis -regulatory data needed to catalyze advances in the field of gene regulation.