MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification

Fiscon, Giulia; Weitschek, Emanuel; Cella, Eleonora; Lo Presti, Alessandra; Presti, Alessandra Lo; Giovanetti, Marta; Babakir-Mina, Muhammed; Ciotti, Marco; Ciccozzi, Massimo; Pierangeli, Alessandra; Bertolazzi, Paola; Felici, Giovanni

Published in

BioMed Central, BioData Mining, 1(9), 2016

DOI: 10.1186/s13040-016-0116-2

Tools

Export citation

Search in Google Scholar

MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification

Journal article published in 2016 by Giulia Fiscon

, Emanuel Weitschek, Eleonora Cella, Alessandra Lo Presti, Alessandra Lo Presti, Marta Giovanetti, Muhammed Babakir-Mina, Marco Ciotti, Massimo Ciccozzi, Alessandra Pierangeli, Paola Bertolazzi, Giovanni Felici

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

Abstract Background Continuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods. Results We propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences. Conclusions We discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification. Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions.

Published in

Links

Tools

MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification

Abstract