Probabilistic base calling of Solexa sequencing data

Rougemont, Jacques; Amzallag, Arnaud; Iseli, Christian; Farinelli, Laurent; Xenarios, Ioannis; Naef, Felix

Published in

BioMed Central, BMC Bioinformatics, 1(9), 2008

DOI: 10.1186/1471-2105-9-431

Tools

Export citation

Search in Google Scholar

Probabilistic base calling of Solexa sequencing data

Journal article published in 2008 by Jacques Rougemont, Arnaud Amzallag, Christian Iseli, Laurent Farinelli, Ioannis Xenarios

, Felix Naef

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

Abstract Background Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. Results We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. Conclusion We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

Published in

Links

Tools

Probabilistic base calling of Solexa sequencing data

Abstract