G12FCO -- Formal Computation

Lecture 7 -- Pattern matching

There are a great many pattern-matching problems, some much easier than others. In two or three dimensions, this is still an area in which humans are much better than computers. We find it easy to recognise faces, or writing, even in the presence of distortions and `noise', aided by the massive parallel processing that our eyes and brain are capable of. By comparison, it takes massive computing resources and highly-specialised programs to make even a modest attempt at this sort of recognition. But in one dimension, the computer is much more relatively effective. We look specifically at the string-matching problem. [Life gets much harder if the patterns include variable elements, such as `find two vowels separated by at least three consonants'.]

String matching

Suppose we have some text, T, and a word, W. Where, if at all, does W occur in T? For a little more definiteness, we can take T and W to be arrays of characters, of lengths n and m respectively. We shall assume that n is typically much larger than m, but m too could be quite large, much larger than in the examples. T could easily be megabytes, or even gigabytes, long in practice. W could be just a few characters, or a line or so, or many lines.

Unix users often refer to this as the `grep' problem, and will talk about `grepping for W in T'. This derives from the `grep' command in Unix, which in turn is named after the Editor command `g/re/p', which means `globally search for a pattern [or regular expression] and print any lines where it occurs'. Many programs on PCs offer a `search' or `find' option somewhere in the menus which does much the same thing.

The `obvious' algorithm

There is an obvious solution to this problem, which we can think of as placing W under T at its left hand end, comparing characters until we find either a `match' by running off the end of W, or a `mismatch' by finding two characters that differ. If we mismatch, then we slide W along by one, and try again:

	T: TEXT WORD TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT TEXT
	W: WORD
	W:  WORD
	W:   WORD
	W:    WORD
	W:     WORD
	W:	WORD <-- found here!

This is an easy program to write, and no detailed commentary is needed:

	for i = 0 to n-m
	    for j = 1 to m
		if W[j] /= T[j+i] then next i endif # mismatch
	    endfor
	    return i+1  # match found, starting at T[i+1]
	endfor
	return 0	# W not found in T

[This assumes that your chosen language doesn't mind you jumping out of the middle of the j-loop to go to the next value of i; if it does, then you will have to do something a little more complicated on discovering the mismatch.]

How many operations does this take? Well, obviously the distance to the match [or n-m+1 if there isn't one] times the average distance gone in W before the mismatch is discovered. With luck, as in the example, the distance in W is usually quite small, so is this an O(n) algorithm? No, because in the worst cases, we usually, or often, get most of the way down W before the mismatch, and in this case the work done is O(m×n). What do these worst cases look like? Here is one:

	T: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
	W: aaaaaaaaaab

But, you will say, this is silly; no-one looks for words like that in text like that! Not so -- T could be the raster scan of a blank radar screen, and W the shape of the blip for a particular aircraft. Not quite, of course; you would expect there to be lots of different blips for all the different possible aircraft [so we have to search for all of them], and they wouldn't really look like blank screen, and if the screen is blank then you don't need to look for them. All true, and I leave it to your imagination to tweak the example to be more realistic, but still pretty inefficient. Another modern application is searching DNA sequences; here, T is a long sequence of As, Cs, Gs and Ts, and it is very easy for W to match for a long way [the sequences are rather repetitive in places] before the mismatch is found.

Anyway, this algorithm was used for ages before anyone thought to see if it was possible to do better. Moral: Any computing or mathematical algorithm which is based on assumptions or code that is more than a decade old can almost certainly be improved.

Knuth, Morris and Pratt

These three gentlemen [actually, mostly M and P, but K is much more famous] discovered how to avoid repeating any match. The basic idea is this. When a mismatch occurs partway through W, we already know what the most recent few characters of T are. So, by comparison with the `obvious' algorithm, we can precompute what the next `interesting' values of i and j are. Suppose for example that W is coconut. Then if we mismatch on the n, we already know that T looks something like xxxxxcocoyzzzzzz, where the xs are uninteresting characters that we've already gone past, we now know that y is not n, and the zs are [possibly] interesting characters that we haven't looked at yet. So we now know that the current value of i leads to a mismatch at position 5; but we also know that the next value of i leads to a mismatch at position 1 [because it starts with an o instead of a c], and that the next value of i after that will match for the first two characters [co...]. So we don't restart with i incremented and j set to 1, we can restart with i advanced by 2 and j set to 3, to see if y is in fact a c. These values [2 and 3] can be computed very easily once and for all from the form of W. If we had failed not on the n but on the u, then T would have matched with cocon, and as none of ocon, con, on or n match the start of coconut, we can immediately slide W 5 places along and start again with j set to 1, trying to match against c the character that didn't match u.

The algorithm looks like

	i = 0, j = 1
	while i <= n-m
		while W[j] = T[j+i] do
			if j = m then return i+1 else j = j+1 endif
		endwhile
		i = i + nexti[j], j = nextj[j]
	endwhile
	return 0

where nexti and nextj are the precomputed arrays that tell us what to do if there is a mismatch at position j. Further details of how to do the precomputation are beyond the present scope, but can be found in many texts on algorithms. Note that when W is aaaaaaaaaab, our previous worst case, a mismatch towards the end of the as will cause W to be advanced many places, while a mismatch on the b will only advance W one place, but will restart trying to match that same character in T with the last a, so the search will still run pretty efficiently. KMP always works in O(n), and indeed with no more than m+n comparisons. [There is also some work, O(m), needed to do the precomputing, but we know that m<n -- more precisely, that if m>n we know instantly that W doesn't occur in T.] Apart from the gain in efficiency, another advantage is that we never need to look back in T, which can be helpful if T is being generated by a program rather than stored in an array.

Is this the end of the story? Of course not, as...

Boyer and Moore

... noticed a neat little twist. Suppose you try to match not, as everyone does, from the left-hand end of W, but from the right-hand end. Then if the mismatching character in T doesn't occur in W at all, then we can advance W not just one place, but all the way past the wrong character. Take again our aaaaaaaaaab case. We try to match the b at the end. If at that point T has a c, for example, we can move W along 11 places [the length of W] and try again to match the b. If instead the b matches, then we look for the preceding a, and so on back down the word; but any failure will again enable us to move W along 11 places. If the b mismatches, but with an a [which does occur in W], then we can only shift W along one place, but we again start at the end, and make big strides unless the string in T at that point matches W reasonably well. In the worst case, BM is no worse than KMP, but in the best cases, it is much faster. Typically, if you ask it to search for a normal English word in a normal piece of English text, it only looks at a few more than 1 in m of the letters in the text [m being the length of W], so the work done is O(n/m). You can't get better than that without risking missing a match.

Moral: It's always worth having a look at an algorithm to see if it works better from a different viewpoint. Beware, in particular, of choices that are obvious for cultural reasons -- we read left-to-right so expect to match left-to-right as well.

Is that the end of the story? No, of course not. Let us look briefly at two applications which look, perhaps naively, as though they are the same problem, but for which even Boyer-Moore is not fast enough. How do they do that?

Spellcheckers

A spellchecker is looking for words in your text that are not in the dictionary. One way would be to take each word in your text, and grep for it in the dictionary; failures correspond to mistakes. But that's much too slow [feel free to do some estimates of how long this would take, even using Boyer-Moore, for a reasonably large dictionary]. So, what do they do? The dictionary is known in advance, so we can precompute some help. One way is a trick known as hash functions. There are many different versions of this. One is to think of some function that maps words onto integers in some range, say 0 to 999999. The function is up to you, but a simple [not very good one] would be to add up the ASCII values of the characters in the word and take the remainder on dividing by 1000000. That's not very good because all anagrams will map to the same integer, for example, and what we want is that, as far as possible, all words should map to different integers. So we want a suitably messy function, left to you to invent. What then? Well, we precompute [once and for all] that function of all the words in the dictionary. Then we set up an array of 1000000 bits [125000 bytes, a mere bagatelle these days] set to 1 if there is a word which maps to the corresponding bit, and to 0 if there isn't. This again is precomputed once and for all; it's in the file /usr/lib/spell/hlistb for British words and /usr/lib/spell/hlista for American words on many Unix systems [or look in /usr/dict, or try putting /local in there somewhere].

Now, for each word in our text, you compute the same messy function. If the corresponding bit is 0, you know the word is not in your dictionary. If it is 1, then it probably is, but it could be a `Bingo!', an accidental coincidence of the function of your word with that same function of a different word which is in the dictionary. A random mis-spelling has a chance of about N in 1000000 [or whatever] of bingoing, where N is the number of words in the dictionary. If this worries you, then you run, as Unix does, several hash functions in parallel, and your word only passes if it matches several different functions of words in the dictionary. Thus, you can test each word, with very good probability of getting the right answer, simply by computing some fixed functions of the letters of the word and looking up that position in an array.

Hash functions of a particular type can be used to improve our original brute-force search. The idea, due to Rabin and Karp, is that if the hash function can be computed incrementally -- that is, the change in the function when one letter is dropped off at the left and another added at the right is easily calculated -- then you can easily compare the hash function of W with that of sections of T as those sections shift along T. This doesn't really offer any improvement over KMP, let alone BM, for the simple string-matching problem, but it works well if you are looking for several Ws simultaneously [a sort of spellcheck!], and it generalises well to two or more dimensions.

Web searching

If you surf for something, you've probably used one of the major search engines. How do they do that? Well, not by looking through millions of Web pages every time someone comes along and says `find me some pages about topic X'. Again, the trick is to precompute something about the possible searches. Basically, a Web crawler constructs indexes of the pages it comes across; these indexes are stached away and searched when you surf. Even that would not be practicable in full generality. So, common words are ignored; and refined compression and other database techniques are used. Well beyond the scope of this module!

top of page

G12FCO home page. border=0

Previous lecture. border=0

Next lecture. border=0

Coursework 2 .