Links

Tools

Export citation

Search in Google Scholar

Learning Signals in Genomic Sequence Alignments for Identification of Functional Elements

Book published in 2006 by James Taylor ORCID
This paper was not found in any repository; the policy of its publisher is unknown or unclear.
This paper was not found in any repository; the policy of its publisher is unknown or unclear.

Full text: Unavailable

Question mark in circle
Preprint: policy unknown
Question mark in circle
Postprint: policy unknown
Question mark in circle
Published version: policy unknown

Abstract

The structure of genomes, how they encode function, and how they evolve is still quite mysterious. Even the best understood functional elements regions that code for proteins are far from exhaustively annotated. Other functional elements, such as the cis-regulatory modules that control gene transcription, are even more poorly understood. Comparisons between the genomes of different species can be a useful tool to understand the structure of these elements and improve our ability to identify them. However, such comparisons also raise new questions as we observe regions with distinctly atypical evolutionary patterns but no clear relationship to any known function. Other sequence signals, such as base composition and specific motifs, are also useful for identifying functional regions, but the specific signals to use for identifying a given class of elements are not always obvious. When training data for a class of functional elements is available, applying a machine learning method to learn the relevant sequence and evolutionary patterns has the potential to better identify functional elements. In this work we describe a computational method, called ESPERR (Evolutionary and Sequence Pattern Extraction through Reduced Representations), which uses training examples to learn encodings of multi-genome alignments into a reduced form for predicting a chosen class of functional elements. We show that ESPERR gives excellent performance on several problems. We first describe using ESPERR for discriminating two classes of regions, with particular focus on discriminating cis-regulatory regions from neutral DNA, producing a score called Regulatory Potential that has excellent predictive power. We also consider additional pairwise discrimination problems: discrimination of DNAseI hypersensitive sites using training data produced by the ENCODE project; and screening highly conserved regions for developmental enhancer activity using training data from the VISTA Enhancer Browser. We also demonstrate the flexibility in the ESPERR procedure with respect to the type of problem addressed by showing a generalization to multi-class classification: predicting whether cDNA 5