Question: Search a pattern with length in a string with length . We assume that .
The naive approach will be matching the pattern at every location of the string and perform comparison at each location, so the complexity is around .
KMP algorithm preprocess the pattern to produce a table with the following entries (we assume the string and pattern are all 0-based) to help backtracing when a mismatch is found:
- The pattern will contain sub-patterns, each pattern starts with the location 0, and ends (inclusive) at location . For example, for pattern “AAABAAA“, it has 7 sub patterns as “A“, “AA“, “AAA“, “AAAB“, “AAABA“, “AAABAA” and “AAABAAA“.
- For each sub pattern, we look at its begining and end, and find its longest proper prefix which happens to be a suffix. Notice that proper prefix means that the prefix should not cover the whole string. For example, for sub-pattern “AAABAA“, the prefix which happens to be suffix is “AA”, which has a length 2. For sub-pattern “AA”, the prefix and suffix is “A”, which has the length 1. For single character pattern like “A”, since it has no proper prefix, its value is always 0. By going through all the sub patterns, we construct the table with each entry as an integer to indicate the length of maximum proper prefix as following:
With such a table constructed, now we can start the search as the following:
- Aligning the pattern with the string at string position ;
- Compare each character in pattern with the character in the string. The string within the range is called the current string window. We use to represent the index to the pattern.
- If all the characters in the current window match, we find a match and stop. By this way we find the leftmost match. The KMP algorithm could easily be adapted to find all the matches in the string, but that will make the algorithm a little more complicated and here we will try to make things easier by focus on its core idea.
- Now suppose at the location which is not equal to , the match fails. This means the text in the current window does not match the pattern. However, we don’t need to move forward the string window by only 1 and start over with the comparison from the begin of the pattern. Instead, we can use the table we construct before: since the characters from location already match the current sub-pattern (a.k.a. pattern inclusively), by looking at the size of proper prefix at the current sub-pattern , we can move the window start directly to , and confidently know that the string from that start to the position right before the current location already matches the start of the pattern (since for the sub pattern, this is a suffix and also a prefix), and we just continue to comparison at current string location and pattern location .
The above process can be illurstrated by the following diagram:
This covers the most element of the KMP algorithm, but when I read the materials listed in the reference section, I had the following question: Why do we have to shift the pattern to the start of the suffix? Isn’t possible that the next potential match could start with somewhere else within the pattern? For example, given the following sub-pattern:
If we have successfully matched the sub pattern above, do we have to just assuming only is matched, why can’t we shift the pattern to start at the in the middle?
I actually did some research on the articles I can find on Internet, most of them are focused on how to efficiently implement the step 1-4 described above, instead of reasoning about the process. Fortuntely, it is not hard to prove the new match has to start at . In the above example sub-pattern, and stands for arbitrary sub-string, stands for the same sub string, which happens to be the proper prefix, proper suffix, and also shows up in the middle. Consider the following two possibilities:
- If is not equal to , since we know the string in the current window has matched the sub-pattern so far (a.k.a. ), if we just shift the pattern to the middle A, then we are trying to match pattern with a portion of string which we already know is , and they can’t match each other, so shift pattern to middle A can not produce a match.
- If is EQUAL to , then is NOT the longest prefix any more, Instead, the longest proper prefix will be (or ), this contradicts with the fact that we assume is already the longest proper prefix.
So, if we know is the proper longest prefix, then shifting the pattern to the start of the suffix is the only way which could produce a match. The similar proof can be done for the sub pattern below.
The KMP algorithm is very smart, salute to those inventors of the algorithm!