Published in

World Scientific Publishing, International Journal of Foundations of Computer Science, 07(28), p. 889-914

DOI: 10.1142/s0129054117500307

Links

Tools

Export citation

Search in Google Scholar

IDPM: An Improved Degenerate Pattern Matching Algorithm for Biological Sequences

Journal article published in 2017 by Jie Lin, Yue Jiang, E. James Harner, Bing-Hua Jiang ORCID, Don Adjeroh
This paper was not found in any repository, but could be made available legally by the author.
This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

Let [Formula: see text] be a string, with symbols from an alphabet. [Formula: see text] is said to be degenerate if for some positions, say [Formula: see text], [Formula: see text] can contain a subset of symbols from the symbol alphabet, rather than just one symbol. Given a text string [Formula: see text] and a pattern [Formula: see text], both with symbols from an alphabet [Formula: see text], the degenerate string matching problem, is to find positions in [Formula: see text] where [Formula: see text] occured, such that [Formula: see text], [Formula: see text], or both are allowed to be degenerate. Though some algorithms have been proposed, their huge computational cost pose a significant challenge to their practical utilization. In this work, we propose IDPM, an improved degenerate pattern matching algorithm based on an extension of the Boyer–Moore algorithm. At the preprocessing phase, the algorithm defines an alphabet-independent compatibility rule, and computes the shift arrays using respective variants of the bad character and good suffix heuristics. At the search phase, IDPM improves the matching speed by using the compatibility rule. On average, the proposed IDPM algorithm has a linear time complexity with respect to the text size, and to the overall size of the pattern. IDPM demonstrates significance performance improvement over state-of-the-art approaches. It can be used in fast practical degenerate pattern matching with large data sizes, with important applications in flexible and scalable searching of huge biological sequences.