Boyer-Moore String Search: Finding Substrings with Adaptive Jumps

What is Boyer-Moore String Search?

it finds the first occurrence of a pattern in a text — basically a custom text.find(pattern).

the trick is it skips positions instead of checking one by one. brute force checks every position left to right. Boyer-Moore starts comparing from the right side of the pattern, and when it doesn’t match, it jumps forward — sometimes by a lot.

not the same as Boyer-Moore Voting Algorithm — same guys (Robert S. Boyer & J Strother Moore, UT Austin), completely different problem. string search came first (1977), voting came later (1981).

How it works (in my own words)

imagine you’re looking for “WORLD” in a really long sentence.

you put “WORLD” at the start and check the last letter first — the ‘D’. if the text has a ‘Z’ there, you already know none of “WORLD” can match at this position. and since ‘Z’ doesn’t even appear in “WORLD”, you can skip forward by the entire length of the pattern.

that’s the whole idea. compare from right, jump smart.

The important thing: two indices

this is the thing that tripped me up the most. you need two separate indices:

start — where the pattern is sitting on top of the text
j — which character in the pattern you’re currently comparing (right to left)

the comparison: text[start + j] vs pat[j]

1text:  a  b  c  d  e
2pat:   b  c  d
3       0  1  2
4             ↑ j starts here (K-1 = 2), goes left

don’t use one variable for both. i did that and pat[3] crashed because start > 0 pushed the index past the pattern length. lesson learned the hard way.

Walkthrough

find “bcd” in “abcde”:

first, build a last map from the pattern — each char’s rightmost index:

1last = {'b': 0, 'c': 1, 'd': 2}

 1Step 1: start=0, compare from right
 2  text:  a  b  c  d  e
 3  pat:   b  c  d
 4               ↑ j=2: text[2]='c' vs pat[2]='d' → mismatch!
 5
 6  bad char = 'c', last['c'] = 1
 7  shift = j - last['c'] = 2 - 1 = 1
 8
 9Step 2: start=1, compare from right
10  text:  a  b  c  d  e
11  pat:      b  c  d
12                  ↑ j=2: 'd' vs 'd' ✓
13            ↑ j=1: 'c' vs 'c' ✓
14         ↑ j=0: 'b' vs 'b' ✓
15
16  all matched → return 1 ✅

find “xyz” in “abcde”:

1start=0: text[2]='c' vs pat[2]='z' → mismatch
2  'c' not in pattern → shift = j+1 = 3
3
4start=3: pattern needs 3 chars but only 2 left → return -1 ✅

Shift calculation

when mismatch at pattern index j:

1def compute_shift(j, bad_char, last):
2    idx = last.get(bad_char, -1)
3    if idx == -1:
4        return j + 1              # char not in pattern, big jump
5    shift = j - idx
6    return max(shift, 1)          # at least move 1

Case	What happens	Shift
bad char not in pattern	nothing to align to	`j + 1`
bad char is in pattern, before j	align it	`j - idx`
bad char is in pattern, after j	can’t go backwards	`1` (minimum)

Code: iterative (while loops)

 1def boyer_moore(text, pat):
 2    last = {}
 3    for idx, c in enumerate(pat):
 4        last[c] = idx
 5
 6    i = 0
 7    while i <= len(text) - len(pat):
 8        j = len(pat) - 1
 9        while j >= 0 and text[i + j] == pat[j]:
10            j -= 1
11
12        if j < 0:
13            return i
14
15        bad_char = text[i + j]
16        idx = last.get(bad_char, -1)
17        if idx == -1:
18            i += j + 1
19        else:
20            i += max(j - idx, 1)
21
22    return -1

Code: recursive (no loops — NUS TCX1002 T04 Q8)

same algorithm but purely recursive, since the assignment didn’t allow loops:

 1def compute_shift(j, bad_char, last):
 2    idx = last.get(bad_char, -1)
 3    if idx == -1:
 4        return j + 1
 5    shift = j - idx
 6    if shift < 1:
 7        shift = 1
 8    return shift
 9
10def find_jump(text, pat):
11    N, K = len(text), len(pat)
12
13    def build_last(i, m):
14        if i >= K: return m
15        m[pat[i]] = i
16        return build_last(i + 1, m)
17
18    last = build_last(0, {})
19
20    def find_pattern(start):
21        if start + K > N: return -1
22
23        def compare(j):
24            if j < 0: return -1
25            if text[start + j] != pat[j]: return j
26            return compare(j - 1)
27
28        diff = compare(K - 1)
29        if diff == -1:
30            return start
31
32        shift = compute_shift(diff, text[start + diff], last)
33        return find_pattern(start + shift)
34
35    return find_pattern(0)

compare(j) is nested inside find_pattern so it can see start through closure. returns -1 for “all matched” to avoid confusion with mismatch at index 0.

Mistakes i made

these are all real bugs from my implementation, not hypothetical:

what i did	why it broke	fix
one variable for text + pattern index	`pat[3]` crash when start > 0	keep `start` and `j` separate
`compare` returns 0 for match AND mismatch at 0	can’t tell them apart	use -1 for “all matched”
boundary check `start >= N`	pattern tail overflows text	`start + K > N`
`text[diff]` in compute_shift	diff is pattern index, not text position	`text[start + diff]`
treated shift as next position	shift is how far to move, not where to go	`start + shift`
built `last` from text instead of pattern	`last` should map pattern chars only	iterate `pat`
slice comparison `text[start:end] == pat`	works but doesn’t tell you WHERE mismatch is	need `j` for compute_shift

Complexity

	Best	Worst
Time	$O(N/K)$	$O(NK)$
Space	$O(K)$	$O(K)$

best case: pattern’s last char never shows up in text, so you jump K positions every time. worst case: lots of partial matches like text = "aaaaaa", pat = "aaa".

Boyer vs Boyer

	String Search (this post)	Voting
Year	1977	1981
Problem	find pattern in text	find majority element
Core idea	right-to-left + skip	cancel different elements
Time	O(N/K) best	O(N)
Space	O(K)	O(1)

same duo, two classic algorithms.

References

A Fast String Searching Algorithm (Boyer & Moore, 1977)
Wikipedia: Boyer-Moore String Search
NUS TCX1002 Tutorial 04 Q8