Search Algorithms

Linear and binary search, a rigorous proof of the O(log n) bound, search variants, and interpolation search as a case study in exploiting data distribution.

Search is, by a wide margin, the most frequently executed operation in real software — far more often than sorting, which is usually a means to an end rather than the end itself. Every database lookup, every autocomplete suggestion, every “find in page,” every spell-checker, every routing table — all of them are, underneath, a search problem wearing a costume. And search has a property that makes it a uniquely satisfying topic to study closely: the gap between the naive solution and the clever one is enormous, the clever solution is almost embarrassingly short, and yet proving that the clever solution is correct — not “seems to work on my examples,” but provably, always correct — turns out to be a genuinely subtle exercise that rewards careful thought. This chapter is as much about that proof technique as it is about the algorithms themselves, because the technique — reasoning rigorously about what a loop guarantees at every step — is one of the most transferable skills in this entire course.

Linear Search: The Baseline

If you know nothing about how your data is arranged, you have no choice but to look at it, one element at a time, until you find what you want or run out of places to look:

int linear_search(const int arr[], int n, int target) {
    for (int i = 0; i < n; i++) {
        if (arr[i] == target) {
            return i;
        }
    }
    return -1;
}

This is Θ(n) in the worst case (the target is last, or absent) and Θ(1) in the best case — and, as discussed in the Complexity Analysis chapter, Θ(n) on average too, assuming the target’s position is uniformly distributed. There is nothing wrong with linear search; it is the correct, optimal choice whenever your data is unsorted, unindexed, and you have no structural information to exploit. The entire rest of this chapter is about what becomes possible the moment that assumption — “I know nothing about the arrangement” — stops being true.

Binary Search: Halving the Problem

If the array is sorted, you don’t need to look at every element — you can rule out half of the remaining candidates with a single comparison, and repeat:

int binary_search(const int arr[], int n, int target) {
    int lo = 0;
    int hi = n - 1;
    while (lo <= hi) {
        int mid = lo + (hi - lo) / 2;       /* avoids signed-integer overflow vs. (lo + hi) / 2 */
        if (arr[mid] == target) {
            return mid;
        } else if (arr[mid] < target) {
            lo = mid + 1;
        } else {
            hi = mid - 1;
        }
    }
    return -1;                               /* lo > hi: search range is empty, target absent */
}

Twelve lines of code, and yet there is a real, satisfying proof obligation hiding inside them — one that’s worth doing properly, because the technique generalizes to nearly every loop you’ll ever need to reason about with confidence.

Proving It Correct: The Loop Invariant

A loop invariant is a statement that is true before the loop begins, remains true after every iteration, and — combined with the condition that ends the loop — implies that the algorithm is correct when it terminates. Finding the right invariant is often the crux of understanding why an algorithm works, rather than merely observing that it seems to.

For binary search, the invariant is: if target appears anywhere in arr, it appears at some index within [lo, hi].

Initialization: before the first iteration, lo = 0 and hi = n - 1 — the entire array. The invariant trivially holds: if the target is anywhere in the array, it’s certainly somewhere in “the entire array.”
Maintenance: assume the invariant holds at the start of an iteration — the target, if present, is in [lo, hi]. We compute mid and compare. If arr[mid] == target, we’re done — found it, and it was indeed in range. If arr[mid] < target, then because the array is sorted, every element at an index ≤ mid is also < target — the target cannot be among them, so if it exists at all, it must be in [mid + 1, hi], and that’s exactly the new range we set lo to start from. The symmetric argument handles arr[mid] > target. Either way, the invariant is preserved for the next iteration.
Termination: the loop ends when lo > hi — the range is empty — or when we return having found the target. If the loop ends with lo > hi, the invariant tells us the target, if present, would have to be in an empty range — which is impossible. So the target is not in the array, and returning -1 is correct.

Pause and think: Notice the shape of this argument: it never once appeals to “and then we keep going and eventually we narrow it down enough.” It establishes a fact that’s true at the start, shows that fact survives every single step unchanged, and then reads off the conclusion from the one fact we know holds when the loop stops. This “it was true at the start, it stays true, therefore it’s true at the end — and the end tells us what we needed to know” pattern is mathematical induction, wearing a loop’s clothing. Once you can see a loop this way — not as “a sequence of steps that happens to work” but as “a fact, preserved” — you gain the ability to look at unfamiliar loops and ask, productively, “what invariant would make this loop obviously correct?” That question, asked early and often, will save you from an enormous number of off-by-one bugs over your career — including, very likely, ones lurking in code you’ve already written.

Proving the `O(log n)` Bound

Correctness tells you the algorithm gets the right answer; it says nothing about how fast. For that, watch what happens to the size of the search range, hi - lo + 1, on each iteration. Whichever branch executes, the new range is at most half the old one (rounding down): if mid is the midpoint, then both [lo, mid - 1] and [mid + 1, hi] contain at most ⌊(hi - lo + 1) / 2⌋ elements.

So if the range starts at size n, after one iteration it’s at most n/2; after two, at most n/4; after k iterations, at most n / 2^k. The loop must terminate by the time the range shrinks to size 0 (or 1, where it can be resolved in one more comparison) — so we need n / 2^k < 1, i.e. 2^k > n, i.e. k > log₂ n. The loop runs at most ⌈log₂ n⌉ + 1 times. That is not a hand-wavy “it feels logarithmic” — it is a derivation, with no step that requires you to trust anything you haven’t just watched yourself prove.

Try it yourself: Use this bound to compute, by hand, the maximum number of comparisons binary search needs to find an element (or determine it’s absent) in an array of one million elements. Compare that number to one million. Now do the same for an array of the estimated number of atoms in the observable universe (~10⁸⁰). The fact that these two answers are barely different is, more than any abstract description, what “logarithmic growth” really means in your bones — and it’s worth letting that sink in fully before moving on.

Variants: The Same Skeleton, Different Questions

The real test of whether you understand binary search — as opposed to merely being able to recite it — is whether you can adapt its skeleton to answer related questions. Three of the most common, and most instructive:

Finding the first occurrence of a value that appears multiple times. A plain binary search returns some matching index — not necessarily the first. To find the first, don’t stop the moment you find a match: record it, and keep searching to the left, narrowing hi as if the match were too large:

int first_occurrence(const int arr[], int n, int target) {
    int lo = 0, hi = n - 1, result = -1;
    while (lo <= hi) {
        int mid = lo + (hi - lo) / 2;
        if (arr[mid] == target) {
            result = mid;          /* record this match... */
            hi = mid - 1;          /* ...but keep looking for an earlier one */
        } else if (arr[mid] < target) {
            lo = mid + 1;
        } else {
            hi = mid - 1;
        }
    }
    return result;
}

This still runs in O(log n) — the loop invariant and the halving argument both still apply essentially unchanged; only the “what do I do when I find a match” step is different. (Try writing last_occurrence yourself — it’s a one-line change, and finding that line is an excellent test of whether the shape of the algorithm has truly clicked for you, rather than just the specific code you’ve memorized.)

Finding the insertion point (“where would this value go, to keep the array sorted?”) — the building block behind functions like C++‘s std::lower_bound or Python’s bisect.insort. The skeleton is nearly identical; only the return value and the loop’s stopping condition shift slightly to track “the boundary between elements smaller than the target and elements not smaller.”

Searching a rotated sorted array — an array that was sorted and then “rotated” by some unknown amount, e.g. [4, 5, 6, 7, 0, 1, 2] (originally [0,1,2,4,5,6,7], rotated left by 4). This one is a genuine design challenge, not a small tweak:

Try it yourself — this is the chapter’s central design challenge, so resist the urge to look up the answer before wrestling with it. A rotated sorted array is not sorted as a whole, so you cannot directly apply “is the target less than arr[mid]?” — that comparison no longer reliably tells you which half to discard. But here is the key structural fact to lean on: at least one of the two halves split by mid, is still fully sorted (rotation can break the array into at most two sorted runs, and mid can only sit inside one of them). So: first determine which half is the sorted one (one comparison: arr[lo] <= arr[mid]?). Then ask whether the target could possibly lie within that sorted half’s known range — if so, recurse there; if not, it must be in the other half, rotated or not. Try to write this out, in pseudocode or in C, before reading any further sources on it. If you get stuck, ask yourself: “given that I know one side is a normal sorted array and I know its endpoints, what’s the one question whose answer tells me whether my target could be hiding in it?” That single question is the entire trick — and once you’ve found it, the rest of the algorithm falls out almost automatically, in the familiar O(log n) shape.

Interpolation Search: Searching Like a Human Looks Up a Word

Here’s an observation worth sitting with: when you look up “octopus” in a physical dictionary, you don’t open it exactly in the middle and bisect from there — you open it roughly three-quarters of the way through, because you know “o” is a letter near (but not at) the end of the alphabet, and you implicitly trust that words are spread out roughly evenly across the pages. You are exploiting your knowledge of the data’s distribution to make a far better first guess than “the middle.” Interpolation search formalizes exactly this intuition for sorted numeric arrays whose values are roughly uniformly distributed:

int interpolation_search(const int arr[], int n, int target) {
    int lo = 0, hi = n - 1;
    while (lo <= hi && target >= arr[lo] && target <= arr[hi]) {
        if (lo == hi) {
            return (arr[lo] == target) ? lo : -1;
        }
        /* estimate target's position assuming values are spread roughly evenly across [lo, hi] */
        int pos = lo + (int)((long long)(target - arr[lo]) * (hi - lo) / (arr[hi] - arr[lo]));

        if (arr[pos] == target) {
            return pos;
        } else if (arr[pos] < target) {
            lo = pos + 1;
        } else {
            hi = pos - 1;
        }
    }
    return -1;
}

The single line that matters is the position estimate — a linear interpolation between the known endpoints (lo, arr[lo]) and (hi, arr[hi]), computing “if the values are spread evenly, roughly what fraction of the way from lo to hi should the target be?” When that assumption holds — say, searching a sorted array of roughly-uniformly-spaced timestamps or measurement values — interpolation search achieves an average-case running time of O(log log n), a doubly-logarithmic bound that is dramatically faster than O(log n) for large n (for n = 4 billion, log₂ n ≈ 32 while log₂ log₂ n ≈ 5).

Pause and think: Look at that (long long) cast and the division in the position estimate, and ask yourself two questions a careful engineer always asks of arithmetic like this: can this overflow?, and can this divide by zero? (Hint: what must be true of arr[hi] and arr[lo] for the surrounding while condition to even let you reach this line — and could arr[hi] == arr[lo] ever still occur with lo != hi, given that the array is merely sorted, not required to contain distinct values?) Tracking down exactly when an apparently-safe expression becomes unsafe — and then deciding whether to guard against it, restrict your assumptions, or restructure the algorithm — is not a distraction from “real” algorithm design. It is what algorithm design looks like once an algorithm leaves the page and has to survive contact with real machines and real data.

But notice the cost of trusting that assumption: feed interpolation search data that isn’t roughly uniform — say, an array where most values cluster tightly near one end with a few outliers far away — and the position estimates become wildly inaccurate, degrading toward O(n) in the worst case (consider an adversarial array like [1, 2, 3, ..., 99, 1000000]: searching for 50 will repeatedly estimate a position near the very end). You have, again, traded a weaker assumption for a stronger guarantee — “comparable” traded for “comparable and roughly uniformly distributed” — and the algorithm’s performance is only as trustworthy as that trade is justified. This is precisely the same shape of trade-off you saw with counting sort and radix sort in the previous chapter: a faster algorithm, purchased by assuming more about your data, which becomes a liability rather than an asset the moment that assumption is wrong. Recognizing this recurring pattern — “this algorithm is faster because it assumes X; is X actually true of my data, and what happens to me if I’m wrong?” — is one of the most valuable instincts a problem-solver can develop, and it will keep reappearing for the rest of this course.

Choosing a Search Strategy

If your situation is…	…reach for	Because
Data is unsorted, or you’ll search only once	Linear search	Sorting first costs `O(n log n)` — not worth it for a single lookup
Data is sorted and stays sorted across many searches	Binary search	`O(log n)` per search, after a one-time `O(n log n)` sort
You need the first/last/insertion-point among duplicates	A binary-search variant	Same `O(log n)` shape, adapted termination/recording logic
Data is sorted and roughly uniformly distributed numerically	Interpolation search	`O(log log n)` average — but verify the distribution assumption first
You’ll repeatedly look up by exact key, order doesn’t matter	A hash table	`O(1)` average lookup — covered in depth in the Data Structures course; the right choice whenever you don’t need the data sorted, only findable

Try it yourself, as a closing reflection: the last row of that table is a deliberate plant. Hash tables achieve O(1) average lookup — seemingly “better” than binary search’s O(log n) — and yet binary search remains indispensable. Why? What can a sorted, ordered structure answer in O(log n) that a hash table fundamentally cannot answer quickly, no matter how cleverly it’s built — questions like “what’s the smallest element greater than X?”, “how many elements fall between A and B?”, or “what are the 10 elements nearest to Y?” Spend a few minutes listing the kinds of questions where order is the valuable property, not just presence. That list is the precise boundary of when “asymptotically faster” stops being the same question as “actually the right tool” — a distinction that separates someone who has memorized complexity classes from someone who can choose wisely between them.