Full text: Download
Recent years have seen major changes in the classification criteria and taxonomy of viruses. The current classification scheme, also called “megataxonomy of viruses”, recognizes six different viral realms, defined based on the presence of viral hallmark genes (VHGs). Within the realms, viruses are classified into hierarchical taxons, ideally defined by the phylogeny of their shared genes. To enable the detection of shared genes, viruses have first to be clustered, and there is currently a need for tools to assist with virus clustering and classification. Here, VirClust is presented. It is a novel, reference-free tool capable of performing: (i) protein clustering, based on BLASTp and Hidden Markov Models (HMMs) similarities; (ii) hierarchical clustering of viruses based on intergenomic distances calculated from their shared protein content; (iii) identification of core proteins and (iv) annotation of viral proteins. VirClust has flexible parameters both for protein clustering and for splitting the viral genome tree into smaller genome clusters, corresponding to different taxonomic levels. Benchmarking on a phage dataset showed that the genome trees produced by VirClust match the current ICTV classification at family, sub-family and genus levels. VirClust is freely available, as a web-service and stand-alone tool.