Links

Tools

Export citation

Search in Google Scholar

Scaffolding low quality genomes using orthologous protein sequences

Journal article published in 2012 by Yang I. Li ORCID, Richard Rr Copley
This paper is available in a repository.
This paper is available in a repository.

Full text: Download

Question mark in circle
Preprint: policy unknown
Question mark in circle
Postprint: policy unknown
Question mark in circle
Published version: policy unknown

Abstract

Motivation: The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The protein sequences are used as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology. Results: To perform, SWiPS does not depend on a high N50 or whole proteins being encoded on a single contig. We tested our algorithm on simulated next-generation data from Ciona intestinalis , real next-generation data from Drosophila melanogaster , a complex genome assembly of Homo sapiens and the low coverage Sanger sequence assembly of Callorhinchus milii . The improvements in N50 are of the order of 20% for the C.intestinalis and H.sapiens assemblies, which is significant, considering the large size of intergenic regions in these eukaryotes. Using the CEGMA pipeline to assess the gene space represented in the genome assemblies, the number of genes retrieved increased by >110% for C.milii and from 20 to 40% for C.intestinalis . The scaffold error rates are low: 85–90% of scaffolds are fully correct, and >95% of local contig joins are correct.