Hash Tables

Hash functions, collision resolution by chaining and open addressing, load factor, rehashing, and the engineering decisions behind O(1) average-case lookup.

Every other data structure in this course offers a trade-off: arrays give O(1) access by position but O(n) search by value; binary search trees give O(log n) lookup but require a total order on keys; linked lists give O(1) insertion at a known point but O(n) search. The hash table breaks this pattern in a way that feels almost like cheating: O(1) average-case insertion, deletion, and lookup by value, with no ordering requirement on the keys whatsoever. Understanding why it works — and what “average” really means here — is the point of this chapter.

The Core Idea

The fundamental insight is to turn a search problem into an access problem. If you know the index of what you’re looking for, you can retrieve it in O(1). The question is: given a key — a string, an integer, a student ID — how do you quickly compute the array index where its associated value should live?

A hash function h(key) maps keys to array indices:

index = h(key) mod table_size

If the hash function is fast to compute and spreads keys roughly uniformly across the table, then for most keys, h(key) points directly to the right slot. Lookup becomes: compute the hash, go to that index, done. O(1).

The complication is collisions — two different keys that hash to the same index. Every hash table design is fundamentally a collision resolution strategy.

Hash Functions

A hash function must be:

Deterministic — the same key always produces the same hash
Fast — it runs on every operation, so it must not itself become a bottleneck
Uniform — keys should spread evenly across indices, minimising collisions

For integer keys, a common approach is modular arithmetic: h(k) = k mod m where m is a prime number (primes avoid the systematic cancellations that composite moduli produce). For strings, accumulate a rolling polynomial:

#include <stddef.h>
#include <stdint.h>

#define TABLE_SIZE 101   /* prime */

/* polynomial rolling hash: h = (h * BASE + c) for each character */
int hash_string(const char *key) {
    const int BASE = 31;
    unsigned long long h = 0;
    while (*key) {
        h = h * BASE + (unsigned char)(*key);
        key++;
    }
    return (int)(h % TABLE_SIZE);
}

int hash_int(int key) {
    /* ensure non-negative before mod */
    return ((key % TABLE_SIZE) + TABLE_SIZE) % TABLE_SIZE;
}

The BASE = 31 choice (like Java’s String.hashCode) is a small prime with good bit-mixing properties. The key engineering rule: always use a prime table size — composite sizes introduce systematic clustering when key mod m happens to share factors with m.

Pause and think: Consider what happens if you use TABLE_SIZE = 100 and most of your integer keys are multiples of 4. Then key mod 100 can only land on multiples of 4: indices 0, 4, 8, …, 96. You’re using only 25 of 100 slots. Your “100-slot table” effectively has 25 usable buckets, and performance degrades accordingly. A prime like 101 has no such common factor — key mod 101 distributes multiples of any small integer uniformly across all 101 slots.

Chaining

Separate chaining resolves collisions by making each table slot a linked list. All keys that hash to the same index are stored in that slot’s list. Insertion is always O(1) — prepend to the list. Lookup and deletion require scanning the list at the computed index.

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define CHAIN_TABLE_SIZE 101

typedef struct ChainNode {
    char            *key;
    int              value;
    struct ChainNode *next;
} ChainNode;

typedef struct {
    ChainNode *buckets[CHAIN_TABLE_SIZE];
} ChainHashTable;

void chain_init(ChainHashTable *ht) {
    for (int i = 0; i < CHAIN_TABLE_SIZE; i++) ht->buckets[i] = NULL;
}

static int chain_hash(const char *key) {
    const int BASE = 31;
    unsigned long long h = 0;
    while (*key) { h = h * BASE + (unsigned char)(*key++); }
    return (int)(h % CHAIN_TABLE_SIZE);
}

void chain_insert(ChainHashTable *ht, const char *key, int value) {
    int idx = chain_hash(key);
    /* check if key already exists — update in place */
    for (ChainNode *node = ht->buckets[idx]; node; node = node->next) {
        if (strcmp(node->key, key) == 0) { node->value = value; return; }
    }
    /* prepend new node */
    ChainNode *node = malloc(sizeof(ChainNode));
    node->key   = strdup(key);
    node->value = value;
    node->next  = ht->buckets[idx];
    ht->buckets[idx] = node;
}

int chain_lookup(const ChainHashTable *ht, const char *key, int *found) {
    int idx = chain_hash(key);
    for (ChainNode *node = ht->buckets[idx]; node; node = node->next) {
        if (strcmp(node->key, key) == 0) { *found = 1; return node->value; }
    }
    *found = 0;
    return 0;
}

void chain_delete(ChainHashTable *ht, const char *key) {
    int idx = chain_hash(key);
    ChainNode **pp = &ht->buckets[idx];
    while (*pp) {
        ChainNode *node = *pp;
        if (strcmp(node->key, key) == 0) {
            *pp = node->next;
            free(node->key);
            free(node);
            return;
        }
        pp = &node->next;
    }
}

void chain_free(ChainHashTable *ht) {
    for (int i = 0; i < CHAIN_TABLE_SIZE; i++) {
        ChainNode *node = ht->buckets[i];
        while (node) {
            ChainNode *next = node->next;
            free(node->key);
            free(node);
            node = next;
        }
        ht->buckets[i] = NULL;
    }
}

The double-pointer trick in chain_delete — ChainNode **pp pointing at the pointer that points to the current node — lets you splice out any node, including the head, without a special case. *pp = node->next rewires whatever pointer (the bucket pointer or a ->next) was pointing at node, removing it cleanly.

Try it yourself: Trace a deletion of the head node: pp = &ht->buckets[idx] (pointing at the bucket pointer itself). *pp is the head node. When you execute *pp = node->next, you set ht->buckets[idx] = node->next — the bucket now points past the deleted node. No special-case needed. Now trace the same code for deleting a middle node and confirm the same logic applies.

Load Factor and Performance

The load factor α = n / m (number of stored keys / table size) is the key metric. With chaining, the average chain length is α — which is exactly the average number of nodes scanned per lookup. If the table has 101 slots and 101 keys, α = 1, and lookup costs O(1) on average. If α = 10, you scan 10 nodes per lookup on average — still O(1) technically (constant, not growing with n if you also grow the table), but 10× slower.

The standard engineering rule: rehash when α > 0.75 (for chaining, α > 1.0 is fine; for open addressing, the limit is much lower).

Open Addressing

Open addressing stores all keys directly in the table array — no linked lists, no heap allocation per key. When a collision occurs, the algorithm probes a sequence of alternative indices until it finds an empty slot.

Linear probing: if slot h(key) is occupied, try h(key) + 1, h(key) + 2, …, wrapping around at the table boundary.

#define OPEN_TABLE_SIZE 101

typedef enum { EMPTY, OCCUPIED, DELETED } SlotState;

typedef struct {
    char      key[64];
    int       value;
    SlotState state;
} Slot;

typedef struct {
    Slot slots[OPEN_TABLE_SIZE];
    int  count;
} OpenHashTable;

void open_init(OpenHashTable *ht) {
    ht->count = 0;
    for (int i = 0; i < OPEN_TABLE_SIZE; i++) ht->slots[i].state = EMPTY;
}

static int open_hash(const char *key) {
    const int BASE = 31;
    unsigned long long h = 0;
    while (*key) { h = h * BASE + (unsigned char)(*key++); }
    return (int)(h % OPEN_TABLE_SIZE);
}

void open_insert(OpenHashTable *ht, const char *key, int value) {
    int idx = open_hash(key);
    for (int i = 0; i < OPEN_TABLE_SIZE; i++) {
        int probe = (idx + i) % OPEN_TABLE_SIZE;
        Slot *s = &ht->slots[probe];
        if (s->state == OCCUPIED && strcmp(s->key, key) == 0) {
            s->value = value;      /* update existing */
            return;
        }
        if (s->state == EMPTY || s->state == DELETED) {
            strncpy(s->key, key, sizeof(s->key) - 1);
            s->key[sizeof(s->key) - 1] = '\0';
            s->value = value;
            s->state = OCCUPIED;
            ht->count++;
            return;
        }
    }
    /* table is full — should rehash before reaching here */
}

int open_lookup(const OpenHashTable *ht, const char *key, int *found) {
    int idx = open_hash(key);
    for (int i = 0; i < OPEN_TABLE_SIZE; i++) {
        int probe = (idx + i) % OPEN_TABLE_SIZE;
        const Slot *s = &ht->slots[probe];
        if (s->state == EMPTY) break;                          /* key absent */
        if (s->state == OCCUPIED && strcmp(s->key, key) == 0) {
            *found = 1; return s->value;
        }
        /* DELETED: keep probing — the key might be further along */
    }
    *found = 0; return 0;
}

void open_delete(OpenHashTable *ht, const char *key) {
    int idx = open_hash(key);
    for (int i = 0; i < OPEN_TABLE_SIZE; i++) {
        int probe = (idx + i) % OPEN_TABLE_SIZE;
        Slot *s = &ht->slots[probe];
        if (s->state == EMPTY) return;
        if (s->state == OCCUPIED && strcmp(s->key, key) == 0) {
            s->state = DELETED;    /* tombstone — do not use EMPTY */
            ht->count--;
            return;
        }
    }
}

The Tombstone Problem

Deletion in open addressing requires care. You cannot simply mark a slot as EMPTY — if you delete a key that was part of a probe sequence, marking it EMPTY would make open_lookup stop early and wrongly conclude that later keys in the same probe sequence don’t exist.

The solution: mark deleted slots with a tombstone (DELETED). Lookups skip over tombstones. Insertions may reuse tombstone slots. Over time, many tombstones degrade performance — another reason to rehash periodically.

Pause and think: Suppose you insert three keys A, B, C that all hash to index 5. They land at slots 5, 6, 7. Now you delete B (slot 6 becomes DELETED). Now you look up C: you hash to 5, see A (not C), probe to 6, see DELETED (skip, don’t stop), probe to 7, see C — found. If you had set slot 6 to EMPTY, you would have stopped at 6 and incorrectly concluded C is absent. The tombstone preserves the integrity of every probe sequence that passes through the deleted slot.

Clustering

Linear probing suffers from primary clustering: consecutive occupied slots attract more insertions (because all of them probe into the same run), creating ever-longer runs. A key that hashes anywhere into a run of length L must probe past the entire run.

Quadratic probing uses offsets 1², 2², 3², …: (idx + i*i) % m. This breaks up primary clusters but can cause secondary clustering — keys with the same initial hash follow the same probe sequence.

Double hashing computes the step size with a second hash function: (h1(key) + i × h2(key)) % m. Different keys with the same h1 use different step sizes — ideal scattering, minimal clustering. The cost is one more hash function call per probe.

Rehashing

When the load factor exceeds the threshold, allocate a new table (typically double the current size, rounded up to the next prime) and reinsert every key from scratch:

/* Illustrates the rehash pattern — build a fresh table from the occupied slots */
OpenHashTable *open_rehash(const OpenHashTable *old) {
    OpenHashTable *fresh = malloc(sizeof(OpenHashTable));
    open_init(fresh);
    for (int i = 0; i < OPEN_TABLE_SIZE; i++) {
        if (old->slots[i].state == OCCUPIED)
            open_insert(fresh, old->slots[i].key, old->slots[i].value);
    }
    return fresh;
}

All tombstones vanish during rehash — every reinserted key finds a clean table. The O(n) cost of rehashing is amortised over the O(n) insertions since the last rehash, giving O(1) amortised per insertion — the same argument as dynamic array resizing.

Comparison: Chaining vs. Open Addressing

Criterion	Chaining	Open addressing
Space per key	Key + value + pointer	Key + value + state byte
Load factor sweet spot	Up to ~1.0–2.0	Keep below 0.7
Cache locality	Poor (pointer to heap)	Excellent (all in one array)
Deletion	Simple (pointer splice)	Tombstone bookkeeping
Worst-case (bad hash)	O(n) chain traversal	O(n) probing
Typical use	Dynamic keys, unknown count	Performance-critical, bounded key universe

Open addressing wins on cache performance — the entire table is a contiguous array, so probing nearby slots hits already-cached memory. Chaining scatters nodes across the heap. For tables that fit in cache, this constant-factor difference dominates the comparison.

Complexity Summary

Operation	Average	Worst case
Insert	O(1) amortised	O(n)
Lookup	O(1)	O(n)
Delete	O(1)	O(n)

The worst case occurs when all keys collide — a degenerate hash function, or adversarial input crafted to cause collisions. Production hash tables (Python dict, Java HashMap, Rust HashMap) use hash functions with a random seed (SipHash, randomised polynomial hashing) specifically to prevent this attack: an adversary who doesn’t know the seed cannot construct a collision-heavy payload.

Hash tables also provide no ordering — you cannot ask “what is the smallest key?” or “list all keys between A and B” without scanning the entire table. When you need ordered operations and fast lookup, a balanced BST or sorted array with binary search is the right tool.

Hash Tables in Practice: C, C++, and Java

Building a hash table from scratch reveals how the pieces fit together. In production, every major language ships a battle-tested implementation. Here is the same word-count program — reading words from stdin and printing each word’s frequency — written in all three languages:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 101

typedef struct Node { char *key; int count; struct Node *next; } Node;
static Node *buckets[TABLE_SIZE];

static int hash(const char *s) {
    unsigned long long h = 0;
    while (*s) h = h * 31 + (unsigned char)(*s++);
    return (int)(h % TABLE_SIZE);
}

void count_word(const char *word) {
    int idx = hash(word);
    for (Node *n = buckets[idx]; n; n = n->next)
        if (strcmp(n->key, word) == 0) { n->count++; return; }
    Node *n = malloc(sizeof(Node));
    n->key = strdup(word); n->count = 1; n->next = buckets[idx];
    buckets[idx] = n;
}

int main(void) {
    char buf[64];
    while (scanf("%63s", buf) == 1) count_word(buf);
    for (int i = 0; i < TABLE_SIZE; i++)
        for (Node *n = buckets[i]; n; n = n->next)
            printf("%s: %d\n", n->key, n->count);
}

#include <iostream>
#include <string>
#include <unordered_map>

int main() {
    std::unordered_map<std::string, int> freq;
    std::string word;
    while (std::cin >> word) freq[word]++;
    for (const auto &[w, count] : freq)
        std::cout << w << ": " << count << "\n";
}

import java.util.*;

public class WordCount {
    public static void main(String[] args) {
        Map<String, Integer> freq = new HashMap<>();
        Scanner sc = new Scanner(System.in);
        while (sc.hasNext()) {
            String word = sc.next();
            freq.merge(word, 1, Integer::sum);
        }
        freq.forEach((w, count) -> System.out.println(w + ": " + count));
    }
}

The C version is ~30 lines implementing the full hash table manually. The C++ version is 8 lines using std::unordered_map — backed by the same chaining+rehashing logic under the hood, just hidden behind a clean interface. The Java version is similar, using HashMap with merge to handle both insert-with-1 and increment in a single call. Understanding the C version makes you a better user of the C++ and Java versions: you know what load factor means, why iteration order is undefined, and why equality-by-reference vs. equality-by-value matters when using objects as keys.

Bringing It Together: Practice as a Problem-Solver

Hash function analysis. Suppose you hash the strings "abc", "bca", and "cab" using a naive function that just sums ASCII values: h = (a + b + c) % m. What index do all three strings produce? What does this demonstrate about the requirement that good hash functions be sensitive to position (not just character content)? Show how the polynomial rolling hash distinguishes them by computing h("abc") and h("bca") with BASE = 31 and m = 101.
Trace an open-addressing table. Use a table of size 7 (a prime). Insert integer keys {14, 7, 21, 1, 35} using h(k) = k mod 7. Show the table state after each insertion, marking occupied slots. Now delete key 7 using a tombstone. Now look up key 35 — trace the full probe sequence step by step, explaining why the DELETED slot cannot be treated as EMPTY.
Load factor impact. A chaining hash table holds 10,000 keys. Compute the expected chain length (and therefore expected lookup cost) when the table size is (a) 100, (b) 1,000, (c) 10,000, (d) 100,000. At what table size does the lookup cost become effectively O(1) in practice? What is the memory cost of the extra space used at size (d)?
Design a word-frequency counter. Using the chaining hash table in this chapter as a base, implement a function count_word(ChainHashTable *ht, const char *word) that either inserts the word with count 1 if not present, or increments its count if it already exists. You will need to change int value to int count in ChainNode. Write the complete modified struct and the count_word function, then write a main that reads words from stdin using scanf("%63s", buf) and calls count_word for each, printing all words and their counts at the end.