Links

Tools

Export citation

Search in Google Scholar

Updating and extending the concept annotations of the CRAFT corpus

This paper was not found in any repository; the policy of its publisher is unknown or unclear.
This paper was not found in any repository; the policy of its publisher is unknown or unclear.

Full text: Unavailable

Question mark in circle
Preprint: policy unknown
Question mark in circle
Postprint: policy unknown
Question mark in circle
Published version: policy unknown

Abstract

Other ; With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ones. The use of natural language processing (NLP) tools can make the literature more accessible by aiding concept recognition and information extraction. As NLP-based approaches have been increasingly used for biocuration, so too have biomedical ontologies, whose use enables semantic integration across disparate curated resources, and millions of biomedical entities have been annotated with them. Particularly important are the Open Biomedical Ontologies (OBOs), a set of open, orthogonal, interoperable ontologies formally representing knowledge over a wide range of biology, medicine, and related disciplines. Manually annotated document corpora have become critical gold-standard resources for the training and testing of biomedical NLP systems. This was the motivation for the creation of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access journal articles from the biomedical literature. Within these articles, each mention of the concepts explicitly represented in eight prominent OBOs has been annotated, resulting in gold-standard markup of genes and gene products, chemicals and molecular entities, biomacromolecular sequence features, cells and cellular and extracellular components and locations, organisms, biological processes and molecular functionalities. With these ~100,000 concept annotations among the ~800,000 words in the 67 articles of the 1.0 release, it is one of the largest gold-standard biomedical semantically annotated corpora. In addition to this substantial conceptual markup, the corpus is fully annotated along a number of syntactic and other axes, notably by sentence segmentation, tokenization, part-of-speech tagging, syntactic parsing, text formatting, and document sectioning. In the several years since the initial release of the CRAFT Corpus, in addition to efforts within our group and in collaboration with others, including the first comprehensive gold-standard evaluation of current prominent concept-recognition systems, it has already been used in multiple external projects to drive development of higher-performing systems. Here we present our continuing work on the corpus along several fronts. First, to keep the corpus relevant, we are updating the concept annotations using newer versions of the ontologies already used to mark up the articles, removing annotations of obsoleted classes and editing previous annotations or creating new annotations of newly added classes. Additionally, to extend the domain of annotated concept types, we are also marking up mentions of concepts using the Molecular Process Ontology (for types of chemical processes) and the Uberon Anatomy Ontology (for anatomical components and life-cycle stages). Finally, to capture even more content, we are generating new annotations for roots of prefixed/suffixed words as well as annotations made with extension classes we have created. We will present updated annotation counts and interannotator agreement statistics for these continuing efforts as well as future plans. All of this work is designed to further increase the potential of the CRAFT Corpus to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems.