Boyer-Moore String Search: Finding Substrings with Adaptive Jumps

Boyer-Moore String Search algorithm explained with visual walkthroughs, iterative and recursive Python implementations, and common pitfalls from hands-on practice.

it finds the first occurrence of a pattern in a text — basically a custom text.find(pattern).

the trick is it skips positions instead of checking one by one. brute force checks every position left to right. Boyer-Moore starts comparing from the right side of the pattern, and when it doesn’t match, it jumps forward — sometimes by a lot.

not the same as Boyer-Moore Voting Algorithm — same guys (Robert S. Boyer & J Strother Moore, UT Austin), completely different problem. string search came first (1977), voting came later (1981).

How it works (in my own words)

imagine you’re looking for “WORLD” in a really long sentence.

you put “WORLD” at the start and check the last letter first — the ‘D’. if the text has a ‘Z’ there, you already know none of “WORLD” can match at this position. and since ‘Z’ doesn’t even appear in “WORLD”, you can skip forward by the entire length of the pattern.

that’s the whole idea. compare from right, jump smart.

The important thing: two indices

this is the thing that tripped me up the most. you need two separate indices:

  • start — where the pattern is sitting on top of the text
  • j — which character in the pattern you’re currently comparing (right to left)

the comparison: text[start + j] vs pat[j]

1text:  a  b  c  d  e
2pat:   b  c  d
3       0  1  2
4             ↑ j starts here (K-1 = 2), goes left

don’t use one variable for both. i did that and pat[3] crashed because start > 0 pushed the index past the pattern length. lesson learned the hard way.

Walkthrough

find “bcd” in “abcde”:

first, build a last map from the pattern — each char’s rightmost index:

1last = {'b': 0, 'c': 1, 'd': 2}
 1Step 1: start=0, compare from right
 2  text:  a  b  c  d  e
 3  pat:   b  c  d
 4               ↑ j=2: text[2]='c' vs pat[2]='d' → mismatch!
 5
 6  bad char = 'c', last['c'] = 1
 7  shift = j - last['c'] = 2 - 1 = 1
 8
 9Step 2: start=1, compare from right
10  text:  a  b  c  d  e
11  pat:      b  c  d
12                  ↑ j=2: 'd' vs 'd' ✓
13            ↑ j=1: 'c' vs 'c' ✓
14         ↑ j=0: 'b' vs 'b' ✓
15
16  all matched → return 1 ✅

find “xyz” in “abcde”:

1start=0: text[2]='c' vs pat[2]='z' → mismatch
2  'c' not in pattern → shift = j+1 = 3
3
4start=3: pattern needs 3 chars but only 2 left → return -1 ✅

Shift calculation

when mismatch at pattern index j:

1def compute_shift(j, bad_char, last):
2    idx = last.get(bad_char, -1)
3    if idx == -1:
4        return j + 1              # char not in pattern, big jump
5    shift = j - idx
6    return max(shift, 1)          # at least move 1
CaseWhat happensShift
bad char not in patternnothing to align toj + 1
bad char is in pattern, before jalign itj - idx
bad char is in pattern, after jcan’t go backwards1 (minimum)

Code: iterative (while loops)

 1def boyer_moore(text, pat):
 2    last = {}
 3    for idx, c in enumerate(pat):
 4        last[c] = idx
 5
 6    i = 0
 7    while i <= len(text) - len(pat):
 8        j = len(pat) - 1
 9        while j >= 0 and text[i + j] == pat[j]:
10            j -= 1
11
12        if j < 0:
13            return i
14
15        bad_char = text[i + j]
16        idx = last.get(bad_char, -1)
17        if idx == -1:
18            i += j + 1
19        else:
20            i += max(j - idx, 1)
21
22    return -1

Code: recursive (no loops — NUS TCX1002 T04 Q8)

same algorithm but purely recursive, since the assignment didn’t allow loops:

 1def compute_shift(j, bad_char, last):
 2    idx = last.get(bad_char, -1)
 3    if idx == -1:
 4        return j + 1
 5    shift = j - idx
 6    if shift < 1:
 7        shift = 1
 8    return shift
 9
10def find_jump(text, pat):
11    N, K = len(text), len(pat)
12
13    def build_last(i, m):
14        if i >= K: return m
15        m[pat[i]] = i
16        return build_last(i + 1, m)
17
18    last = build_last(0, {})
19
20    def find_pattern(start):
21        if start + K > N: return -1
22
23        def compare(j):
24            if j < 0: return -1
25            if text[start + j] != pat[j]: return j
26            return compare(j - 1)
27
28        diff = compare(K - 1)
29        if diff == -1:
30            return start
31
32        shift = compute_shift(diff, text[start + diff], last)
33        return find_pattern(start + shift)
34
35    return find_pattern(0)

compare(j) is nested inside find_pattern so it can see start through closure. returns -1 for “all matched” to avoid confusion with mismatch at index 0.

Mistakes i made

these are all real bugs from my implementation, not hypothetical:

what i didwhy it brokefix
one variable for text + pattern indexpat[3] crash when start > 0keep start and j separate
compare returns 0 for match AND mismatch at 0can’t tell them apartuse -1 for “all matched”
boundary check start >= Npattern tail overflows textstart + K > N
text[diff] in compute_shiftdiff is pattern index, not text positiontext[start + diff]
treated shift as next positionshift is how far to move, not where to gostart + shift
built last from text instead of patternlast should map pattern chars onlyiterate pat
slice comparison text[start:end] == patworks but doesn’t tell you WHERE mismatch isneed j for compute_shift

Complexity

BestWorst
Time$O(N/K)$$O(NK)$
Space$O(K)$$O(K)$

best case: pattern’s last char never shows up in text, so you jump K positions every time. worst case: lots of partial matches like text = "aaaaaa", pat = "aaa".

Boyer vs Boyer

String Search (this post)Voting
Year19771981
Problemfind pattern in textfind majority element
Core idearight-to-left + skipcancel different elements
TimeO(N/K) bestO(N)
SpaceO(K)O(1)

same duo, two classic algorithms.

References

comments powered by Disqus