Knuth-Morris-Pratt (KMP) Algorithm

Thursday, December 3, 2020

Knuth-Morris-Pratt (KMP) algorithm is an algorithm that is used to find whether a pattern exists in some text or string. There is a bunch of usage and application of this algorithm, for example in spelling checker, similarity checker, and also plagiarism checker. At a more advanced level, this algorithm also can be used to check whether a person's DNA matches other DNA.

The naive approach to do string matching instead of KMP actually relatively simple, but the problem with the naïve approach is time-consuming. Naïve approach can be done using two loops to check each element or character of the pattern with the string itself. Let define the length of the string is M and the length of the pattern is N, then the time complexity of the naive approach is O(M*N) for the worst case.

KMP algorithm can do much better than the naive approach. KMP uses a generated table called failure function/failure table. A failure table is a table that holds an index when if the matching between elements in a given string with the pattern fails. When this happen, failure table will provide where or what index the matching should start over again.

To generate failure table, it is quite simple. We just need to track the match between suffix of the pattern. There is a simulator to understand this process much easier by accessing this link over here: https://people.ok.ubc.ca/ylucet/DS/KnuthMorrisPratt.html

But, if you are willing to calculate it manually, you can refer to this code and understand how the calculation works for the failure table.

Let us consider a problem where we need to determine whether pattern AAB exists in string AAAAAB. First, we need to generate the failure table as shown below.

j	0	1	2
P[j]	A	A	B
F[j]	0	1	0

*Note:
P[] is an array of char of pattern.
F[] is a DP (dynamic programming) array to hold failure index.
j = iteration index.

The process to get that failure table can be described as below.

F[0] is beginning of failure table, then F[0] = 0.

Set len = 0, j = 1.

When j = 1, because of Pat[j] == Pat[len] (A == A), then F[j] = ++len. (At this state len = 1)

When j = 2, because of Pat[j] != Pat[len] (A != B), then F[j] = F[len-1]. (Which is F[2] = F[0] = 0)

Okay, now we are going to check the match using KMP algorithm. I’ll use index i for text string iteration, and index j for failure table iteration. Blue table is the string, and Orange table is the failure table. Green cells mean match, and Red cells mean mismatch. Here’s the detail.

· Step 1: i = 0, j = 0 (match)

A	A	A	A	B
A	A	B

· Step 2: i = 1, j = 1 (match)

A	A	A	A	B
A	A	B

· Step 3: i = 2, j = 2 (mismatch).

A	A	A	A	B
A	A	B

In this state we will shift the matching position which is F[j-1] = F[1] = 1. At this state i = 2, and j = 1. So, the table will turn like this.

A	A	A	A	B
	A	A	B

· Step 4: i = 3, j = 2 (mismatch).

A	A	A	A	B
	A	A	B

In this state we will shift the matching position which is F[j-1] = F[1] = 1. At this state i = 2, and j = 1. So, the table will turn like this.

A	A	A	A	B
		A	A	B

· Step 5: i = 4, j = 2 (match)

A	A	A	A	B
		A	A	B

Therefore, we can see that pattern AAB exists in string AAAAB. By far, we can learn that the time complexity for KMP algorithm is the sum of time to generate the failure table which is let say M, plus the loop through the element or character in the string let say N. So, the time complexity of KMP algorithm is O(M+N), which is much much faster than the naive method which is O(M*N).

LABELS :

algorithm article development

COMMENTS

ARCHIVE

Search This Blog

Thursday, December 3, 2020