@shmVirus

String Matching

Finding patterns in text efficiently: brute-force search, Rabin-Karp's rolling hash, and the Knuth-Morris-Pratt algorithm's failure function.

Every time you press Ctrl+F in a browser, type a query into a search bar, run grep over a log file, or watch a plagiarism detector compare an essay against a vast corpus of existing text, you are setting off some variant of the same fundamental question: given a long text and a short pattern, where — if anywhere — does the pattern occur within the text? This sounds almost too simple to deserve a chapter of its own — and the brute-force solution genuinely is almost embarrassingly simple to write. But that simple solution hides a worst-case performance trap that becomes a real liability the moment “long text” means a multi-gigabyte genome or a planet’s worth of web pages, and escaping that trap leads to two of the most elegant algorithmic ideas in this entire course — one borrowed directly from a data structure you already know intimately, and one that constructs, in a single linear pass, a table so cleverly self-referential that it lets the algorithm never look backward in the text, not even once.

The Brute-Force Approach: Slide and Compare

The most direct idea imaginable: try aligning the pattern against the text starting at position 0, compare character by character; if it doesn’t match all the way through, slide the pattern one position to the right, and try again.

#include <stdio.h>
#include <string.h>

void naive_search(const char *text, const char *pattern) {
    int n = (int) strlen(text);
    int m = (int) strlen(pattern);

    for (int i = 0; i + m <= n; i++) {
        int j = 0;
        while (j < m && text[i + j] == pattern[j]) {
            j++;
        }
        if (j == m) {
            printf("Match found at index %d\n", i);
        }
    }
}

In the best case — the first character of the pattern rarely matches the text at all — this runs in close to Θ(n), barely glancing at the pattern before moving on. But construct an adversarial input, and the picture changes dramatically: search for the pattern "aaaaaaaaab" inside the text "aaaaaaaaaaaaaaaaaaaaaaaaaaa...a". At every single alignment, the comparison gets all the way to the last character before failing — m comparisons per alignment, n - m + 1 alignments, giving Θ(n · m) in the worst case. For a long text and a moderately long pattern, that’s a genuinely painful bound — and the uncomfortable detail is that this isn’t some bizarre edge case dreamed up to make a textbook point: repetitive structure of exactly this kind is common in real text — DNA sequences are built from an alphabet of only four symbols, and natural-language text is dense with repeated short substrings (“the,” “and,” common letter pairs).

Pause and think: Notice the shape of this worst case — it’s not “the pattern is never found,” it’s “the pattern is almost found, over and over, in a way that wastes nearly the maximum possible amount of comparison work before finally failing.” This should feel like a familiar villain by now: it’s the same shape of trap you met in quicksort’s worst-case pivot selection (data that looks almost-sorted, repeatedly defeating the partitioning step) and in binary search trees that degrade toward a linked list (data that arrives in an order that repeatedly defeats the balancing assumption). “The input that looks almost like the easy case, but is engineered to maximize wasted comparisons” is one of the most recurring villains in this entire course — and the algorithms in the rest of this chapter exist specifically to take that villain off the table, by guaranteeing that no comparison, once made, is ever wasted.

Rabin-Karp: Compare Hashes, Not Characters

Here’s an idea that should feel like an old friend: instead of comparing the pattern against every alignment character-by-character, compute a single number — a hash — that summarizes each candidate substring, and compare those numbers first. Two substrings with different hashes are certainly different; only when the hashes match do you bother checking the characters directly (to rule out the rare case of a hash collision — two different strings that happen to hash to the same value, exactly the phenomenon the Hash Tables chapter asked you to design around). This is the Rabin-Karp algorithm, and its entire performance case rests on one beautifully engineered trick: computing each new candidate’s hash cheaply, by reusing the previous one.

Treat each m-character substring as a number written in some base B (commonly 256, since that’s the number of possible byte values): the substring "abc" becomes 'a' · B² + 'b' · B¹ + 'c' · B⁰. Computed naively, this costs Θ(m) work for every alignment — no better than brute force. But here’s the trick: when you slide the window one position to the right, you can compute the new hash from the old one in constant time, by removing the contribution of the character that just fell out of the window and adding the contribution of the one that just entered — a rolling hash:

#include <stdio.h>
#include <string.h>

#define BASE 256
#define MODULUS 1000000007L

void rabin_karp_search(const char *text, const char *pattern) {
    int n = (int) strlen(text);
    int m = (int) strlen(pattern);
    if (m > n) return;

    long high_order = 1;                          /* B^(m-1) mod MODULUS: the weight of the leading digit */
    for (int i = 0; i < m - 1; i++) {
        high_order = (high_order * BASE) % MODULUS;
    }

    long pattern_hash = 0, window_hash = 0;
    for (int i = 0; i < m; i++) {
        pattern_hash = (pattern_hash * BASE + pattern[i]) % MODULUS;
        window_hash  = (window_hash  * BASE + text[i])    % MODULUS;
    }

    for (int i = 0; i + m <= n; i++) {
        if (pattern_hash == window_hash && strncmp(text + i, pattern, m) == 0) {
            printf("Match found at index %d\n", i);             /* hashes agree AND characters verified */
        }
        if (i + m < n) {
            /* roll the window: drop text[i]'s contribution, shift left, add text[i+m] */
            window_hash = (window_hash - text[i] * high_order % MODULUS + MODULUS) % MODULUS;
            window_hash = (window_hash * BASE + text[i + m]) % MODULUS;
        }
    }
}

Each roll costs Θ(1), so the entire scan costs Θ(n) for hashing, plus the cost of full character-by-character verification whenever a hash collision occurs. With a well-chosen modulus, collisions are rare enough that the expected total running time is Θ(n + m) — though, crucially, the worst case remains Θ(n · m), if an adversary can engineer enough collisions (this is precisely why the verification step with strncmp is not optional window-dressing — skip it, and an unlucky or adversarial input can silently hand you false positives, reporting matches that don’t actually exist).

Try it yourself: That (window_hash - text[i] * high_order % MODULUS + MODULUS) % MODULUS line is doing something subtle with modular arithmetic — and it’s exactly the kind of line a careless port to a different language quietly breaks. Work out, on paper, why the + MODULUS is necessary before the final % MODULUS (hint: in C, the result of % on a negative left operand is implementation-permitted to be negative — what would happen to window_hash a few iterations later if this one value were allowed to go negative and propagate?). This is precisely the kind of “the math is right, but the arithmetic, on a real machine, has opinions of its own” trap the Complexity Analysis chapter’s discussion of integer overflow first warned you to watch for — and it will keep reappearing, in new disguises, for as long as you write code that does arithmetic on fixed-width integers.

Pause and think: Rabin-Karp’s real strength shows up in a variant of the problem brute force handles particularly badly: searching for many patterns at once. If you need to check whether a text contains any of a thousand different patterns of the same length, brute force essentially multiplies its cost by a thousand — but Rabin-Karp can hash all thousand patterns once, store the hashes in (yes — exactly what you’re thinking) a hash table, and then slide a single rolling window across the text, checking each window’s hash against the table in O(1) expected time. One pass over the text, regardless of how many patterns you’re hunting for. Sketch out, in a few sentences, how you’d structure this multi-pattern search — and notice that you’ve just independently rediscovered the core idea behind real plagiarism-detection and intrusion-detection systems, which routinely need to check incoming text against catalogues of tens of thousands of known patterns simultaneously.

Knuth-Morris-Pratt: Never Look Backward in the Text

Rabin-Karp’s expected-case speed comes from avoiding character comparisons most of the time. The Knuth-Morris-Pratt algorithm (KMP) takes a completely different and, in a sense, more ambitious route: it guarantees Θ(n + m) in the worst case, with no randomness and no collisions to worry about — by being smart enough to never re-examine a character of the text that it has already looked at, not even once, no matter how the matching attempts unfold.

Here’s the wasted-effort that brute force commits, stated precisely: suppose you’re matching the pattern "ABABC" against the text, and you’ve successfully matched "ABAB" before failing on the fifth character. Brute force’s response is to slide the pattern one position right and start comparing from scratch — but think about what that throws away. You already know, with complete certainty, what the next several characters of the text are — they’re "ABAB", because that’s exactly what just matched! Re-reading them is pure waste. The only genuinely open question is: given that the text continues with "ABAB...", how far can I slide the pattern so that its beginning lines up with some suffix of what I’ve already confirmed — without having to re-check anything?

That question has a remarkable property: its answer depends only on the pattern itself, never on the text. Which means it can be computed once, in advance, for the whole pattern — precisely the “build a table once, so that every later query becomes a fast lookup” idea you spent the entire Dynamic Programming chapter learning to recognize and exploit. The table KMP builds is called the failure function (or prefix function): failure[i] records the length of the longest proper prefix of the pattern that is also a suffix of pattern[0..i].

#include <stdlib.h>

void build_failure_table(const char *pattern, int m, int failure[]) {
    failure[0] = 0;
    int matched = 0;                     /* length of the current matching prefix */

    for (int i = 1; i < m; i++) {
        while (matched > 0 && pattern[i] != pattern[matched]) {
            matched = failure[matched - 1];     /* fall back to the next-best candidate prefix length */
        }
        if (pattern[i] == pattern[matched]) {
            matched++;
        }
        failure[i] = matched;
    }
}

Pause and think: Look very closely at that while loop — matched = failure[matched - 1]. The failure table is being used to help build itself. This is not a typo or a cleverness-for-its-own-sake flourish; it is the only way to compute this table in linear time, and it has exactly the flavour of a recurrence relying on previously solved sub-problems — the same flavour as dp[i][j] depending on dp[i-1][j-1] in the longest-common-subsequence table from the Dynamic Programming chapter, or fib_memo(n) depending on fib_memo(n-1). Convince yourself this loop genuinely runs in Θ(m) total — not Θ(m²) — by tracking a single quantity across the entire function: the value of matched. It only ever increments by one (in the if), or jumps backward via the lookup (in the while). Add up, across the whole function’s execution, the total number of increments versus the total number of “jumps backward,” and you’ll find a clean accounting argument lurking — the same style of reasoning the Complexity Analysis chapter’s amortized-analysis section used to show that vector_push costs O(1) on average, even though any individual call might trigger an expensive resize.

With the failure table built, the search itself becomes a single pass over the text that never backs up:

#include <stdio.h>
#include <string.h>

void kmp_search(const char *text, const char *pattern) {
    int n = (int) strlen(text);
    int m = (int) strlen(pattern);
    if (m == 0 || m > n) return;

    int *failure = malloc(m * sizeof(int));
    build_failure_table(pattern, m, failure);

    int matched = 0;                      /* number of pattern characters matched so far */
    for (int i = 0; i < n; i++) {
        while (matched > 0 && text[i] != pattern[matched]) {
            matched = failure[matched - 1];      /* don't restart — slide using what we already know */
        }
        if (text[i] == pattern[matched]) {
            matched++;
        }
        if (matched == m) {
            printf("Match found at index %d\n", i - m + 1);
            matched = failure[matched - 1];      /* continue searching for overlapping matches */
        }
    }

    free(failure);
}

The index i into the text only ever moves forward — the entire loop runs exactly n times, with the while loop’s “fallbacks” paid for by an amortized-analysis argument that mirrors the one you were just asked to construct for build_failure_table. The total cost is Θ(n + m) — in the worst case, with an ironclad guarantee, no hash collisions, no probabilistic hand-waving. This is what it looks like to fully close off an entire category of bad input, rather than merely making it unlikely: brute force’s adversarial nightmare — "aaaa...ab" against "aaaa...a" — is not just survivable for KMP; it’s not even interesting. The failure table for a pattern like "aaaab" immediately encodes “if a long run of as is followed by a mismatch, slide back by exactly one and keep almost all of your progress” — turning the brute-force algorithm’s worst nightmare into one of KMP’s most unremarkable, routine cases.

Try it yourself: Build the failure table, entirely by hand, for the pattern "ABABCABAB". (Watch closely for the moment the table’s self-referential lookup — failure[matched - 1] — actually gets exercised; for a pattern this short, it happens at most once or twice, and catching it in the act is the single best way to make the algorithm’s “memory” feel concrete rather than abstract.) Then trace kmp_search running this pattern against the text "ABABCABABABABCABAB" by hand, recording the value of matched and the value of i at every step. Count how many times i advances versus how many times matched falls back via the failure table. Does the text index ever move backward — not “rarely,” but literally ever, even once? That single observation — verified with your own hand-traced numbers, not taken on faith from the paragraph above — is the entire payload of this algorithm, made undeniable.

String Matching with Finite Automata

Both brute-force and Rabin-Karp still revisit decisions — at every text position, they restart or re-hash. KMP never backs up in the text, but it does carry forward one piece of state: how many characters of the pattern have been matched so far. Finite automata matching makes this idea explicit and completely general: it precomputes a deterministic finite automaton (DFA) whose state at each moment encodes exactly “how far through the pattern does the longest matching suffix of the text-so-far extend?” Once the DFA is built, scanning the text is a single loop with one table lookup per character — O(n) time with no while-loop fallbacks, no branching at all.

Building the transition table: the DFA has m + 1 states (0 through m), one per possible match length. State m means the full pattern has been found. For each state q and each character c, delta[q][c] = the length of the longest prefix of the pattern that is also a suffix of pattern[0..q-1] + c:

#include <string.h>

#define ALPHA 256                  /* ASCII alphabet size */

/* Precompute delta[q][c] = next state after reading character c from state q.
   For a pattern of length m, delta has (m+1) × ALPHA entries. */
void compute_transition(const char *pattern, int m, int delta[][ALPHA]) {
    for (int q = 0; q <= m; q++) {
        for (int c = 0; c < ALPHA; c++) {
            /* find the longest prefix of pattern that is a suffix of pattern[0..q-1]+c */
            int k = (q < m) ? q + 1 : q;          /* candidate length: try longest first */
            /* build the string pattern[0..q-1] + c and check if its last k chars
               match pattern[0..k-1].  Shrink k until they do. */
            while (k > 0) {
                /* check: does pattern[0..k-1] == (pattern[0..q-1]+c)[end-k+1..end]? */
                int match = (pattern[k - 1] == (char) c);
                if (match) {
                    for (int i = 0; i < k - 1 && match; i++)
                        if (pattern[i] != pattern[q - k + 1 + i]) match = 0;
                }
                if (match) break;
                k--;
            }
            delta[q][c] = k;
        }
    }
}

void automaton_search(const char *text, const char *pattern) {
    int m = (int) strlen(pattern);
    int n = (int) strlen(text);

    /* stack-allocate for small patterns; use malloc for large ones in production */
    int delta[m + 1][ALPHA];
    compute_transition(pattern, m, delta);

    int q = 0;                                     /* start in state 0 (no prefix matched) */
    for (int i = 0; i < n; i++) {
        q = delta[q][(unsigned char) text[i]];     /* one table lookup per character */
        if (q == m) {
            printf("Match found at index %d\n", i - m + 1);
        }
    }
}

compute_transition is O(m² · |Σ|) — quadratic in the pattern length and linear in the alphabet size. For small alphabets and patterns, or when the pattern will be searched against many texts, this precomputation cost is fully amortized: each subsequent text scan is a pure O(n) loop with no conditional fallback. The connection to KMP is direct: KMP’s failure function encodes exactly the same information as the automaton’s transitions on the pattern’s own characters — the automaton generalises this to the full alphabet, paying more in preprocessing to gain a completely branch-free scan.

Pause and think: The compute_transition function as written has a naïve inner loop that re-checks all characters each time — O(m) per (q, c) pair, giving the O(m² · |Σ|) total. The KMP failure function from the previous section lets you build the same table in O(m · |Σ|): for each character c, delta[q][c] = q + 1 if pattern[q] == c, otherwise delta[failure[q-1]][c] (follow the failure link and ask again). This observation — that KMP’s failure function and the automaton’s transition table are two representations of the same underlying structure — is the cleanest way to see that KMP is an automaton matcher, just one that implicitly stores only the on-pattern transitions and computes the off-pattern ones lazily at runtime. Understanding that equivalence is worth more than either algorithm in isolation.

Choosing a Strategy

If your situation is……reach forBecause
The pattern or text is short, or this runs rarelyBrute-force searchSimplicity wins when performance isn’t actually at stake — a lesson worth not forgetting amid all this cleverness
You need to search for many patterns over the same text, or vice versaRabin-KarpHash lookups let you check membership in a whole set of patterns in roughly the same time as checking one
You need a guaranteed worst-case bound — adversarial or highly repetitive input is plausibleKMPΘ(n + m) with no dependence on luck, randomness, or input structure
Same pattern, many different texts — and you want the fastest possible per-text scanFinite automaton`O(m² ·
You’re matching against a fixed, reusable collection of patterns (a dictionary, a set of virus signatures)A trie or Aho-Corasick automatonAmortizes the cost of preprocessing the pattern collection across unboundedly many searches — the natural extension of “build the table once” to “build it once for a whole library of patterns”

Try it yourself, as a closing reflection: Look back over the three searching algorithms in this chapter, and notice that each one embodies a different philosophy for avoiding wasted work — brute force does no preprocessing and pays for it in the worst case; Rabin-Karp preprocesses into a summary (a hash) and accepts a small risk of needing to double-check; KMP preprocesses into a complete map of its own structure (the failure table) and thereby earns the right to never look back at all. Now connect this to the much broader pattern this entire course has been quietly built around: complexity analysis taught you to measure the cost of not preprocessing (the naive O(n) search, repeated); amortized analysis showed you how to account fairly for occasional expensive preprocessing (the resizing array); dynamic programming showed you how to build a table once and query it forever after cheaply; and greedy algorithms showed you when you’re allowed to trust a single pass without ever needing to revisit a decision. String matching turns out to be a meeting point for nearly every major idea this course has built up to now — which is exactly why it sits this late in the syllabus, and exactly why noticing those connections, rather than treating this as “yet another isolated algorithm to memorize,” is the most valuable thing you can take from this chapter.

Bringing It Together: Practice as a Problem-Solver

  1. Engineer the worst case, then measure your way out of it. Implement all three search algorithms (naive_search, rabin_karp_search, kmp_search) so that each one counts the number of character comparisons it performs (not just whether it finds a match). Construct three test inputs: (a) a text and pattern with no repeated structure at all (e.g., random lowercase letters), (b) brute force’s adversarial nightmare (a long run of one character, with a pattern that almost — but doesn’t quite — match it), and (c) a text built by repeating a short pattern many times over, with a search for that exact pattern. Run all three algorithms on all three inputs and tabulate the comparison counts. Which algorithm’s count stays nearly constant across all three inputs — and is that the one you predicted, going in?

  2. Trace the table, not just the algorithm. Build the KMP failure table by hand for the pattern "AABAACAABAA". Before running any code, predict — in writing — at which index (or indices) of the table you expect the self-referential fallback (failure[matched - 1]) to be exercised during table construction, and explain your prediction in terms of the pattern’s repeated substructure. Then build the table by hand, step by step, and check your prediction against what actually happens. If your prediction was wrong, identify exactly which piece of the pattern’s structure you misjudged.

  3. Generalize to a harder shape. The Rabin-Karp idea generalizes surprisingly well to two-dimensional pattern matching — finding a small grid of characters within a larger grid (a problem that shows up in image processing and in board-game engines searching for known patterns). Sketch, in a few sentences or in pseudocode, how you would extend a rolling hash to work row-by-row and then column-by-column over a 2D grid. (You don’t need to write the full implementation — the goal is to identify precisely which property of the original 1D rolling hash — the one that makes “slide the window by one and update in O(1)” possible — needs to be preserved for the 2D extension to retain its speed, and to convince yourself that it can be preserved.)

  4. Reflection. The single sentence that most compactly captures this entire chapter is probably this one: “brute force repeats work because it forgets what it just learned; Rabin-Karp and KMP both succeed by remembering it — they simply choose to remember different things, in different ways, bought at different costs.” Write two or three sentences, in your own words, naming precisely what each of the three algorithms in this chapter chooses to “remember” between comparisons (or chooses not to, in brute force’s case), and what it costs to maintain that memory. If you can state that crisply for all three, you’ve distilled the chapter to its essence — and you’ve also rehearsed, one more time, the single question this entire course keeps circling back to from every possible angle: where, exactly, is the wasted work hiding, and what would it cost me to stop repeating it?