Published in

Oxford University Press (OUP), Bioinformatics, 10(28), p. 1314-1323

DOI: 10.1093/bioinformatics/bts121

Links

Tools

Export citation

Search in Google Scholar

Probabilistic Suffix Array: Efficient Modeling and Prediction of Protein Families

Journal article published in 2012 by Jie Lin, Donald Adjeroh, Bing-Hua Jiang ORCID
This paper is made freely available by the publisher.
This paper is made freely available by the publisher.

Full text: Download

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

MOTIVATION: Markov models are very popular for analyzing complex sequences such as protein sequences, whose sources are unknown, or whose underlying statistical characteristics are not well understood. A major problem is the computational complexity involved with using Markov models, especially the exponential growth of their size with the order of the model. The probabilistic suffix tree (PST) and its improved variant sparse probabilistic suffix tree (SPST) have been proposed to address some of the key problems with Markov models. The use of the suffix tree, however, implies that the space requirement for the PST/SPST could still be high. RESULTS: We present the probabilistic suffix array (PSA), a data structure for representing information in variable length Markov chains. The PSA essentially encodes information in a Markov model by providing a time and space-efficient alternative to the PST/SPST. Given a sequence of length N, construction and learning in the PSA is done in O(N) time and space, independent of the Markov order. Prediction using the PSA is performed in O(mlog{N is divided by /Σ/}) time, where m is the pattern length, and Σ is the symbol alphabet. In terms of modeling and prediction accuracy, using protein families from Pfam 25.0, SPST and PSA produced similar results (SPST 89.82%, PSA 89.56%), but slightly lower than HMMER3 (92.55%). A modified algorithm for PSA prediction improved the performance to 91.7%, or just 0.79% from HMMER3 results. The average (maximum) practical construction space for the protein families tested was 21.58±6.32N (41.11N) bytes using the PSA, 27.55±13.16N (63.01N) bytes using SPST and 47±24.95N (140.3N) bytes for HMMER3. The PSA was 255 times faster to construct than the SPST, and 11 times faster than HMMER3.