There are a great many pattern-matching problems, some much easier than others. In two or three dimensions, this is still an area in which humans are much better than computers. We find it easy to recognise faces, or writing, even in the presence of distortions and `noise', aided by the massive parallel processing that our eyes and brain are capable of. By comparison, it takes massive computing resources and highly-specialised programs to make even a modest attempt at this sort of recognition. But in one dimension, the computer is much more relatively effective. We look specifically at the string-matching problem. [Life gets much harder if the patterns include variable elements, such as `find two vowels separated by at least three consonants'.]
Unix users often refer to this as the `grep' problem, and will talk about `grepping for W in T'. This derives from the `grep' command in Unix, which in turn is named after the Editor command `g/re/p', which means `globally search for a pattern [or regular expression] and print any lines where it occurs'. Many programs on PCs offer a `search' or `find' option somewhere in the menus which does much the same thing.
T: TEXT WORD TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT W: WORD W: WORD W: WORD W: WORD W: WORD W: WORD <-- found here!This is an easy program to write, and no detailed commentary is needed:
for i = 0 to n-m for j = 1 to m if W[j] /= T[j+i] then next i endif # mismatch endfor return i+1 # match found, starting at T[i+1] endfor return 0 # W not found in T[This assumes that your chosen language doesn't mind you jumping out of the middle of the j-loop to go to the next value of i; if it does, then you will have to do something a little more complicated on discovering the mismatch.]
How many operations does this take? Well, obviously the distance to the match [or n-m+1 if there isn't one] times the average distance gone in W before the mismatch is discovered. With luck, as in the example, the distance in W is usually quite small, so is this an O(n) algorithm? No, because in the worst cases, we usually, or often, get most of the way down W before the mismatch, and in this case the work done is O(m×n). What do these worst cases look like? Here is one:
T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa W: aaaaaaaaaabBut, you will say, this is silly; no-one looks for words like that in text like that! Not so -- T could be the raster scan of a blank radar screen, and W the shape of the blip for a particular aircraft. Not quite, of course; you would expect there to be lots of different blips for all the different possible aircraft [so we have to search for all of them], and they wouldn't really look like blank screen, and if the screen is blank then you don't need to look for them. All true, and I leave it to your imagination to tweak the example to be more realistic, but still pretty inefficient. Another modern application is searching DNA sequences; here, T is a long sequence of As, Cs, Gs and Ts, and it is very easy for W to match for a long way [the sequences are rather repetitive in places] before the mismatch is found.
Anyway, this algorithm was used for ages before anyone thought to see if it was possible to do better. Moral: Any computing or mathematical algorithm which is based on assumptions or code that is more than a decade old can almost certainly be improved.
The algorithm looks like
i = 0, j = 1 while i <= n-m while W[j] = T[j+i] do if j = m then return i+1 else j = j+1 endif endwhile i = i + nexti[j], j = nextj[j] endwhile return 0where nexti and nextj are the precomputed arrays that tell us what to do if there is a mismatch at position j. Further details of how to do the precomputation are beyond the present scope, but can be found in many texts on algorithms. Note that when W is aaaaaaaaaab, our previous worst case, a mismatch towards the end of the as will cause W to be advanced many places, while a mismatch on the b will only advance W one place, but will restart trying to match that same character in T with the last a, so the search will still run pretty efficiently. KMP always works in O(n), and indeed with no more than m+n comparisons. [There is also some work, O(m), needed to do the precomputing, but we know that m<n -- more precisely, that if m>n we know instantly that W doesn't occur in T.] Apart from the gain in efficiency, another advantage is that we never need to look back in T, which can be helpful if T is being generated by a program rather than stored in an array.
Is this the end of the story? Of course not, as...
Moral: It's always worth having a look at an algorithm to see if it works better from a different viewpoint. Beware, in particular, of choices that are obvious for cultural reasons -- we read left-to-right so expect to match left-to-right as well.
Is that the end of the story? No, of course not. Let us look briefly at two applications which look, perhaps naively, as though they are the same problem, but for which even Boyer-Moore is not fast enough. How do they do that?
Now, for each word in our text, you compute the same messy function. If the corresponding bit is 0, you know the word is not in your dictionary. If it is 1, then it probably is, but it could be a `Bingo!', an accidental coincidence of the function of your word with that same function of a different word which is in the dictionary. A random mis-spelling has a chance of about N in 1000000 [or whatever] of bingoing, where N is the number of words in the dictionary. If this worries you, then you run, as Unix does, several hash functions in parallel, and your word only passes if it matches several different functions of words in the dictionary. Thus, you can test each word, with very good probability of getting the right answer, simply by computing some fixed functions of the letters of the word and looking up that position in an array.
Hash functions of a particular type can be used to improve our original brute-force search. The idea, due to Rabin and Karp, is that if the hash function can be computed incrementally -- that is, the change in the function when one letter is dropped off at the left and another added at the right is easily calculated -- then you can easily compare the hash function of W with that of sections of T as those sections shift along T. This doesn't really offer any improvement over KMP, let alone BM, for the simple string-matching problem, but it works well if you are looking for several Ws simultaneously [a sort of spellcheck!], and it generalises well to two or more dimensions.