Horspool's algorithm is a space-time tradeoff algorithm designed to efficiently find occurrences of a
pattern \(P\) of length \(m\) within a larger text \(T\) of length \(n\). Unlike simpler algorithms that compare each character repeatedly, Horspool minimizes the total number of character comparisons by intelligently skipping portions of the text where matches cannot possibly occur.
Horspool compares characters starting from the rightmost character of the pattern moving leftward. This end-to-start comparison is strategic: mismatches near the end of the pattern are more informative, as they often justify larger shifts that skip irrelevant text. The shift distance depends only on the character aligned with the last position of the pattern in the current alignment, regardless of whether any earlier characters matched.
To illustrate, suppose we want to find the pattern "RODEO" in the text "NOW WE RODE ON HORSES".
Here we have \(m=5\) and \(n=21\).
We explore three scenarios based on the character in the text aligned with the last character of the pattern:
Case 1:The aligned character does not appear in the pattern.
Suppose the alignment places the last character of "RODEO" over a W character of the text:
NOW WE RODE ON HORSES
RODEO
Since W does not appear in "RODEO", shifting fewer than \(m\) (in this case, 5)
ensures that the W will not match any character in the pattern. Thus, we might as well shift by \(m\).
In this case, we store \(m\) in the shift table for W.
This is the case for any character not appearing in the pattern.
Case 2:The aligned character appears somewhere in the pattern before the last position
(and may or may not appear in the last position).
Suppose the alignment places the last character of "RODEO" over an R character of the text:
NOW WE RODE ON
RODEO
The R appears at index 0 of the pattern. Notice that if we shift the pattern any fewer than 4 spots,
the R of the text cannot possible match anything in the pattern.
In this case we store 4 in the shift table for the letter R.
In general, for characters whose last appearance in the pattern is at index \(i\), where \(i \lt m-1\),
the shift table value is \(m - 1 - i\).
This ensures that the given occurrence of the letter in the text is aligned with its last appearance in the pattern.
Case 3:The aligned character only occurs as the final character of the pattern.
(Notice that if it also occurs elsewhere, it is in Case 2). For instance, if we replace the pattern "RODEO" with "RADEO",
and have a situation such as the following:
NOW WE RODE ON
RADEO
It is not hard to see that this is identical to Case 1. So we store \(m\) in the shift table for O
in this case. Luckily, this is not actually a special case because, as we will see, the algorithm automatically handles it.
In summary:
Case
Text character aligned with last pattern character...
Example
Action
1
is not in the pattern
NOW WE RODE ON HORSES
RODEO
Shift the pattern by \(m\).
2
appears earlier in the pattern (last occurrence at index \(i\))
NOW WE RODE ON HORSES
RODEO
Shift the pattern by \(m - 1 - i\).
3
appears only at final position in pattern
NOW WE RODE ON HORSES
RADEO
Shift the pattern by \(m\).
Two important points to re-emphasize:
First, we look at the character in the text, not the pattern.
Second, we look at the character in the text that is aligned with the last character of the pattern regardless of
whether or not any characters matched.
So it is always the character in the text that is lined up with the last character in the pattern.
These examples/observations are the idea behind the Bad Character Heuristic.
The heuristic builds a shift table that precomputes optimal shifts for each character,
leading to significant reductions in unnecessary comparisons.
More concretely, given a pattern \(P\) of length \(m\),
the preprocessing step to compute the shift table (also know as the bad character table)
is as follows.
Create a shift table \(S\) that stores a shift value for each character in the alphabet.
Set \(S[c] = m\) for all characters \(c\) that occur in the text.
This default value corresponds to shifting past the entire pattern length,
assuming the character does not appear in the pattern or only appears as the final character.
Next, refine the shift table:
For each character in the pattern from first to second-to-last (left to right), set
\[S[P[j]] = m - 1 - j.\]
The order is crucial; processing from left to right ensures that if a character appears multiple times,
the shift value from its rightmost occurrence is retained,
which guarantees the correct number of positions to shift when a mismatch occurs.
For simplicity, in most the remainder of our discussion
we will assume the characters are ASCII so we can treat them as integers 0 through 255,
making it easy to implement a lookup table.
Many languages, including Java and C++, make this easy to do.
This means we will need a shift table with 256 entries.
Given that, we can write the preprocessing step of the algorithm in formal pseudocode as follows:
int [] computeShiftTable(int P[])
m=P.length
int[] S = new int[256]; // Create an array of size 256
for (int i = 0; i < 256 ; i++) { // Initialize all entries to m
S[i] = m;
}
for (int j = 0; j < m - 1 ; j++) { // Update entries based on the pattern
S[P[j]] = m - 1 - j;
}
return S
Let's see a few examples of constructing the table.
For simplicity, the following demo restricts the characters to upper case alphabetic characters
and only shows that portion of the shift table.
Run it with strings like "HORSPOOL", "SHOELACES", and "CGTAATAAGTC".
If the characters in question are not ASCII, we can simply replace an array with a dictionary,
which languages like Python make very easy, can be done with a HashMap
in Java, or with a std::map in C++. In this case, the dictionary only needs to store entries
whose values are not the full length of the pattern (\(m\)), and if a character is not in the dictionary
the value is assumed to be \(m\).
This means no initialization is required, and less space is used most of the time.
Now that we have a shift table, we can get into the main part of the algorithm.
Begin by aligning the pattern \(P\) with the start of the text \(T\).
Character comparisons are performed from right to left, meaning the algorithm first compares the last character of the pattern to the corresponding aligned character in the text.
If all characters match, record or return the starting index of this occurrence.
If a mismatch occurs anywhere within the pattern, lookup \(k=S[c]\),
where \(c\) is the character in the text aligned with the last character of \(P\).
Shift the pattern to the right by \(k\) places.
As long as the pattern does no go past the end of the text, go back to step 2. Otherwise quit.
If more formal pseudocode, it looks something like this
(where this version simply outputs all matches):
horspoolSearch(int[]P, int[] T, int[] S)
int n=T.length
int m=P.length
int i = 0; // Line up P at the start of T
while (i ≤ n - m) // As long as pattern does not go past the end of text
int j = m - 1; // Start at the end of the pattern
while (j ≥ 0 && P[j] == T[i + j]) // As long as the characters match
j--; // go to the previous character
if (j < 0) // If we made it through the whole pattern
print(i) // print the starting index of the match location
int next = T[i + m - 1]; // Get character of text lined up with the last
// character of pattern (treated like an integer)
i += S[next]; // Shift according to the bad character table
Use this demo with several patterns and texts until you are comfortable with how the algorithm works.
This demo only shows the portion of the table with characters from the pattern. The * entry
indicates all other characters, which is why its value is \(m\).
Do not make the text too long unless you have a wide enough monitor!
Code Implementations
In the following implementations, we assume the characters in the char arrays are ASCII characters.
If they are not, the algorithms will crash (index-out-of-bounds errors). Note that for C++ we need to cast
the char into unsigned char since wether or not a char is signed or unsigned
is implementation dependent, and negative values would definitly cause problems.
This complication can be avoided by using a dictionary/map instead of an array
(but depending on your background, you may or may not know how to do that in C++, so we do not assume).
Also notice that, as usual, we need to pass in the sizes of the array in C++.
In the first Python implementation, we try to keep the implementation as similar to Java and C++ as possible.
In it we use the ord method to convert a character to its ASCII value.
We give a second version that takes advantage of the fact that dictionaries are very simple to use in Python,
as well as a few other Python-shortcuts.
void horspoolSearch(char[] P, char[] T) {
int m = P.length;
int n = T.length;
int[] S = new int[256];
for (int i = 0; i < 256; i++) {
S[i] = m;
}
for (int j = 0; j < m - 1; j++) {
S[P[j]] = m - 1 - j; // Implicit cast from char to int
}
int i = 0;
while (i <= n - m) {
int j = m - 1;
while (j >= 0 && P[j] == T[i + j]) {
j--;
}
if (j < 0) {
System.out.println(i);
}
char next = T[i + m - 1];
i += S[next]; // next implicitly cast to int
}
}
void horspoolSearch(char P[], int m, char T[], int n) {
int S[256];
for (int i = 0; i < 256; i++)
S[i] = m;
for (int j = 0; j < m - 1; j++)
S[(unsigned char)P[j]] = m - 1 - j;
int i = 0;
while (i <= n - m) {
int j = m - 1;
while (j >= 0 && P[j] == T[i + j])
j--;
if (j < 0)
std::cout << i << std::endl;
char next = T[i + m - 1];
i += S[(unsigned char)next];
}
}
def horspool_search(P, T):
m = len(P)
n = len(T)
# Step 1: Build the shift table
S = [m] * 256
for j in range(m - 1):
S[ord(P[j])] = m - 1 - j
# Step 2: Search
i = 0
while i <= n - m:
j = m - 1
while j >= 0 and P[j] == T[i + j]:
j -= 1
if j < 0:
print(i)
i += S[ord(T[i + m - 1])]
def horspool_search(P, T):
m = len(P)
n = len(T)
# Step 1: Build the shift table using a dict
shift = {c: m for c in set(T)} # default to m for all seen text chars
for j in range(m - 1):
shift[P[j]] = m - 1 - j
# Step 2: Search
i = 0
while i <= n - m:
j = m - 1
while j >= 0 and P[j] == T[i + j]:
j -= 1
if j < 0:
print(i)
c = T[i + m - 1]
i += shift.get(c, m) # default to m if char not in shift table
Time/Space Analysis
Time Complexity:
The preprocessing phase takes \(O(m + \sigma)\) time, where \(m\) is the length of the pattern and \(\sigma\) is the size of the alphabet.
The \(O(\sigma)\) term accounts for initializing the shift table, and the \(O(m)\) term comes from updating the table entries based on the pattern.
In the worst case, each shift results in only a single-character shift with up to \(m\) comparisons, leading to a total time of \(O(mn)\).
This can occur in contrived inputs where mismatches happen only at the end of each alignment (e.g., text is a repeated prefix of the pattern).
However, the average-case complexity is \(O(n)\) for typical inputs, especially when the alphabet is reasonably large and the pattern is not highly repetitive.
This is because most mismatches are detected quickly (often near the end of the pattern), and the algorithm often shifts by several characters at once.
On average, the number of character comparisons per alignment is small and independent of the pattern length, resulting in linear-time behavior for most inputs.
This behavior makes Horspool’s algorithm significantly faster in practice than brute-force search, especially when the pattern is long or the character distribution is uniform.
Space Complexity: The shift table requires \(O(\sigma)\) space, which is constant for fixed-size alphabets such as ASCII (256 characters).
Variations/Improvements
Horspool's algorithm is a simplified version of the more general Boyer-Moore algorithm, which uses both the bad character rule (as in Horspool) and an additional good suffix rule to determine shift distances.
While Horspool is easier to implement and performs well in practice, especially for longer patterns and larger alphabets, Boyer-Moore typically achieves better performance by using more information from partial matches.
Further refinements, such as the Boyer-Moore-Horspool-Sunday algorithm, adjust the shift logic to examine characters beyond the current window, often improving average-case behavior.
Other extensions and hybrid approaches balance preprocessing time and shift table complexity to optimize for specific contexts, such as small alphabets or frequent partial matches in the text.
Shift Table Purpose What is the purpose of the bad character table (the shift table) in Horspool’s algorithm?
Preprocessing Range Why does the preprocessing loop update the shift table only for the first \(m-1\) characters of the pattern? What would happen if the last character were included?
Shift Formula How is the shift value for a character in the pattern computed?
Right-to-Left Comparison Why does Horspool’s algorithm compare characters from right to left during the search?
Time Complexity What are the worst-case and average-case time complexities of Horspool’s algorithm?
Large Alphabet What effect does a large alphabet size have on the performance of the algorithm?
Shift Character What character in the text determines how far the pattern shifts after a mismatch?
Default Shift What value is assigned in the shift table for characters that do not appear in the pattern, and why?
Improvement Over Brute Force In what ways does Horspool's algorithm improve over the brute-force approach?
Repeated Characters How does having repeated characters in the pattern affect the entries in the shift table?
Answer:
The shift table tells the algorithm how far to shift the pattern when a mismatch occurs, based on the character in the text that caused the mismatch. It allows the algorithm to skip over sections of the text instead of checking every position.
Answer:
The shift table is used when a mismatch occurs at the last character of the pattern, so there is no need to define a shift for that character based on itself — it is the one being compared. If the last character were included in the preprocessing, it would receive a shift value of 0, which would prevent any progress in the search when mismatches occur on that character. This could cause infinite loops or redundant comparisons.
Answer:
For a character \(c\) at position \(j\) in the pattern (where \(j < m - 1\)), the shift value is computed as \(m - 1 - j\). This tells the algorithm how far to shift the pattern when a mismatch occurs on character \(c\).
Answer:
Horspool's algorithm compares characters from right to left so that a mismatch at the end of the pattern can immediately trigger a shift based on the “bad character” rule. This approach allows the algorithm to use a precomputed shift table indexed by the mismatched character in the text, which is aligned with the last character of the pattern. The direction of comparison does not affect how quickly mismatches are found or how large the shift is—it simply enables efficient table-based shifting.
Answer:
The worst-case time complexity is \(O(mn)\), which occurs in cases where shifts are small and many comparisons are needed. The average-case time is \(O(n)\) because mismatches often allow large shifts and most comparisons fail quickly.
Answer:
A large alphabet affects both the preprocessing and search phases. In preprocessing, it increases the time and space needed to initialize the shift table, since an entry must be created for each character in the alphabet. During the search, a larger alphabet improves average-case performance by increasing the likelihood that a mismatched character does not appear in the pattern, allowing the pattern to shift farther.
Answer:
The character in the text that aligns with the last character of the pattern determines the shift value. This is the “bad character” used to look up the shift amount in the table.
Answer:
Characters not in the pattern are assigned a shift value equal to the pattern length. This ensures the pattern moves past them entirely, since no match is possible.
Answer:
Horspool avoids rechecking characters that have already been compared by starting from the right end of the pattern. It also uses the shift table to skip ahead by several characters on mismatches, often reducing the number of comparisons significantly.
Answer:
Repeated characters in the pattern result in multiple updates to the shift table for the same character. The last occurrence closest to the end of the pattern determines the final shift value.
In-Class Activities
Build a Shift Table
Given a pattern and an alphabet (e.g., ASCII a–z), manually construct the bad-character shift table by computing \( m - 1 - j \) for each relevant character. Compare answers with a partner.
Simulate the Algorithm
Use pencil and paper to walk through Horspool’s algorithm on a short pattern and text. Mark character comparisons, mismatches, matches, and record each shift. Tally the total number of comparisons.
Compare with Naive Search
For the same pattern and text, simulate both Horspool and the brute-force (naive) approach. Compare total comparisons and shifts. Which one performs better and why?
Shift Table with Larger Alphabets
Change the alphabet to something larger (e.g., extended ASCII or Unicode emoji) and investigate how this affects preprocessing time and shift table size. What data structure would work best?
Map-Based Shift Table
Implement a shift table using a dictionary (or map) instead of a fixed-size array to support non-ASCII or dynamic alphabets.
Give full code in your chosen language. Test it on several inputs. Discuss this implementation versus the array-based
implementation presented here. Is one more/less efficient? Easier to implement? Other advantages/disadvantages?
Benchmarking Horspool
Write code to generate random text and patterns, and time the average number of comparisons Horspool makes over many trials. Vary pattern length and alphabet size. Plot and analyze the results.
Worst-Case Generator
Try to construct a pattern of length \(m\) and text of length \(n\)
that result in the worst-case behavior for Horspool. How often does this happen in practice? Discuss in small groups.
Data Type Considerations
How might the performance or usefulness of Horspool's algorithm differ when searching natural language text (e.g. English)
versus binary data (e.g., image files) or biological data (e.g., DNA)? In case you are unaware,
DNA data uses a very small alphabet of only four characters: A, C, G, and T.
Problems
Simulate Horspool by Hand
Given the text "abracadabraabracadabra" and pattern "abra", simulate Horspool's algorithm by hand
(finding all matches!). At each step, list the comparisons made and total them at the end.
Compare with Naive Algorithm
For the same input as above, simulate the brute-force (naive) string search algorithm. Count and compare the number of character comparisons with Horspool's.
Another Simulation/Comparison
By hand, go through both Horspool's and the exhaustive search algorithm on the text "thethemethatmattersmostistheme"
and pattern "theme". Make sure you show every step!
Count the number of shifts and total number of comparisons each algorithm makes.
Prove Correctness
Prove the correctness of Horspool's algorithm based on the bad-character heuristic. In particular, show that the algorithm never skips a possible match and always reports correct positions.
Support Wildcards
Modify the algorithm to support wildcard characters in the pattern (e.g., `?` matches any single character). Describe the necessary changes to the shift table and the matching loop.
Shifts on Repetitive Patterns
Consider the text "abababbabab" and the pattern "abab". Simulate Horspool’s algorithm by hand. For each alignment, indicate whether a match occurs and how far the pattern is shifted. Count the total number of character comparisons. How does this behavior compare to the naive algorithm on the same input? What conclusions, if any, can you draw about how well Horspool's algorithm
works on small alphabets and/or strings with repeated patterns?
Parallelization
Design a parallel version of Horspool's algorithm. How could different processors search different parts of the text safely? What limits parallel scalability?
Unicode Optimization
Suppose the text contains characters from a Unicode alphabet with up to 100,000 possible characters, but the pattern only uses about 40 unique characters. Discuss how to represent the shift table efficiently (e.g., fixed-size array vs. dictionary/map), and analyze the trade-offs in terms of time and space for both preprocessing and lookup during search.
Pattern with Repeats
How do repeated characters in the pattern affect the shift table? Give an example and explain how the shift values are computed when a character appears multiple times.
Simple Improvement Heuristic
Suggest and justify a simple heuristic improvement to Horspool's algorithm. For example, could the shift table be updated dynamically as the search proceeds?
Multiple Pattern Matching
Horspool's algorithm is designed for a single pattern. How might you adapt it to search for multiple patterns simultaneously within the same text (e.g., for virus scanning or spam detection)? Could you reuse portions of shift tables, merge preprocessing data, or restructure the patterns to share common information? What are the limitations of running the algorithm separately on each pattern?