The Deep Dive: Understanding Hash Table Collision Resolution Techniques

What Are Hash Tables and Why Do Collisions Occur?

In the world of data structures, hash tables are the unsung heroes of fast data retrieval. They allow us to store and access data in average-case constant time, or $O(1)$, which is as good as it gets in algorithmic performance.

But what happens when two different keys hash to the same index? That’s where collisions come into play. Understanding how and why collisions occur is essential for mastering collision resolution strategies.

graph TD A["Key: 'apple'"] --> C["Hash Function"] B["Key: 'banana'"] --> C C --> D["Index: 3"] E["Key: 'grape'"] --> F["Hash Function"] F --> G["Index: 3"] style D fill:#4CAF50,stroke:#388E3C,color:white style G fill:#FF9800,stroke:#F57C00,color:white classDef collision fill:#FF5252,stroke:#D32F2F,color:white; class D,G collision;

Hash Tables: The Engine of Fast Lookups

A hash table (also known as a hash map) is a data structure that maps keys to values using a hash function. The hash function computes an index into an array of buckets or slots, from which the desired value can be found.

Here’s a simplified example of how a hash function might look in Python:

# Simple hash function (for demonstration only)
def simple_hash(key, table_size):
    return sum(ord(char) for char in key) % table_size

However, even the best hash functions can’t prevent all collisions. Let’s explore why.

Pro-Tip: A good hash function distributes keys uniformly across the table to minimize collisions. But even then, collisions are inevitable due to the pigeonhole principle — more keys than slots.

Why Do Collisions Happen?

Collisions occur when two distinct keys produce the same hash index. For example:

  • Key "apple" hashes to index 3
  • Key "grape" also hashes to index 3

Even with a large table, if the number of keys exceeds the number of slots, collisions are mathematically guaranteed. This is where collision resolution strategies like chaining or open addressing come into play.

graph LR A["Hash Table"] --> B["Index 0"] A --> C["Index 1"] A --> D["Index 2"] A --> E["Index 3: ['apple', 'grape']"] A --> F["Index 4"] style E fill:#FFEB3B,stroke:#FBC02D,color:black

Key Takeaways

  • Hash tables provide $O(1)$ average time complexity for lookups, insertions, and deletions.
  • Collisions are inevitable due to the pigeonhole principle — more keys than slots.
  • Effective hash functions aim to distribute keys uniformly to reduce collisions.
  • Collision resolution is critical for maintaining performance and correctness.

Next, we’ll dive into how to resolve collisions using chaining and open addressing techniques.

Collision Resolution: The Core Problem in Hash Tables

Hash tables are among the most efficient data structures for fast lookups, insertions, and deletions — but only when collisions are handled effectively. A collision occurs when two or more keys hash to the same index. While inevitable due to the pigeonhole principle, collisions don't have to degrade performance if resolved correctly.

In this masterclass, we’ll explore the two primary collision resolution strategies: chaining and open addressing. You’ll see how they work under the hood, their trade-offs, and when to use each.

graph LR A["Key"] --> B["Hash Function"] B --> C["Index"] C --> D{Collision?} D -- Yes --> E["Resolve via Chaining or Open Addressing"] D -- No --> F["Insert/Retrieve Value"]

Strategy 1: Chaining

In chaining, each slot in the hash table stores a linked list (or dynamic array) of entries that hash to the same index. This is a simple and widely used approach.

Pros

  • Simple to implement
  • Handles high load factors gracefully
  • No need to rehash on resize

Cons

  • Poor cache performance due to pointers
  • Extra memory for list nodes
  • Performance degrades with long chains

Example: Chaining in Python

# Basic implementation of a hash table with chaining
class HashNode:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.buckets = [None] * size

    def _hash(self, key):
        return hash(key) % self.size

    def insert(self, key, value):
        index = self._hash(key)
        node = self.buckets[index]
        if not node:
            self.buckets[index] = HashNode(key, value)
        else:
            while node:
                if node.key == key:
                    node.value = value  # Update value
                    return
                if not node.next:
                    node.next = HashNode(key, value)
                    return
                node = node.next

Strategy 2: Open Addressing

Instead of chaining, open addressing stores all elements directly in the hash table. When a collision occurs, it probes for the next available slot using a probing sequence (linear, quadratic, or double hashing).

Pros

  • Better cache locality
  • No extra memory for pointers
  • Good for fixed-size tables

Cons

  • Requires careful deletion handling
  • Performance degrades as load factor increases
  • Clustering issues with linear probing

Example: Linear Probing

class OpenAddressingHashTable:
    def __init__(self, size=10):
        self.size = size
        self.keys = [None] * size
        self.values = [None] * size

    def _hash(self, key):
        return hash(key) % self.size

    def insert(self, key, value):
        index = self._hash(key)
        while self.keys[index] is not None:
            if self.keys[index] == key:
                self.values[index] = value  # Update
                return
            index = (index + 1) % self.size  # Linear probing
        self.keys[index] = key
        self.values[index] = value

Comparative Analysis

Feature Chaining Open Addressing
Memory Overhead Moderate (pointers) Low
Cache Performance Poor Good
Load Factor Tolerance High Low

Key Takeaways

  • Collision resolution is essential to maintain the performance of hash tables.
  • Chaining is robust and simple but incurs memory overhead.
  • Open addressing offers better cache performance but is sensitive to load factor and clustering.
  • Choose your strategy based on memory constraints, expected load, and performance needs.

Next, we’ll explore advanced hashing techniques and how to optimize hash functions for real-world applications.

Separate Chaining: Storing Multiple Values in One Bucket

Hash tables are a cornerstone of efficient data storage and retrieval, but what happens when two keys hash to the same index? This is where collision resolution strategies come into play. One of the most robust and widely used methods is separate chaining.

In this section, we’ll explore how separate chaining allows multiple key-value pairs to coexist in the same bucket using linked lists, ensuring that your hash table remains both functional and performant—even under load.

What is Separate Chaining?

Separate chaining resolves collisions by storing a linked list (or another dynamic data structure) at each bucket of the hash table. When a collision occurs (i.e., two keys hash to the same index), both entries are stored in the list at that index.

graph TD A["Bucket 0"] --> B["Node(Key1, Value1)"] A --> C["Node(Key2, Value2)"] D["Bucket 1"] --> E["Node(Key3, Value3)"] F["Bucket 2"] --> G["Empty"] H["Bucket 3"] --> I["Node(Key4, Value4)"] I --> J["Node(Key5, Value5)"]

Pro Tip: Separate chaining is especially useful when the load factor (ratio of elements to buckets) is unpredictable. It gracefully handles collisions without requiring rehashing or probing.

How It Works: A Step-by-Step Example

Let’s walk through a simplified implementation of a hash table using separate chaining in Python:

# Node class to represent key-value pairs
class Node:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None

# Hash Table with Separate Chaining
class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.buckets = [None] * self.size

    def _hash(self, key):
        return hash(key) % self.size

    def insert(self, key, value):
        index = self._hash(key)
        new_node = Node(key, value)
        if not self.buckets[index]:
            self.buckets[index] = new_node
        else:
            current = self.buckets[index]
            while current:
                if current.key == key:
                    current.value = value  # Update existing
                    return
                if not current.next:
                    current.next = new_node
                    return
                current = current.next

    def find(self, key):
        index = self._hash(key)
        current = self.buckets[index]
        while current:
            if current.key == key:
                return current.value
            current = current.next
        return None

Performance Characteristics

Let’s analyze the time complexity of operations in a hash table using separate chaining:

Operation Average Case Worst Case
Insert $O(1)$ $O(n)$
Search $O(1)$ $O(n)$
Delete $O(1)$ $O(n)$

Note: The worst-case occurs when all keys hash to the same bucket, effectively turning the hash table into a linked list.

Advantages and Disadvantages

✅ Advantages

  • Simple to implement
  • Handles high load factors gracefully
  • No clustering issues
  • Supports dynamic resizing

❌ Disadvantages

  • Memory overhead from pointers
  • Potential for long chains under poor hash functions
  • Cache performance may suffer

Key Takeaways

  • Separate chaining is a robust method for handling hash collisions using linked lists.
  • It maintains $O(1)$ average time complexity for insert, search, and delete operations.
  • It’s ideal for applications where load factor is unpredictable or memory is not a constraint.
  • For performance-sensitive applications, consider optimizing your hash function or exploring advanced hashing techniques.

Open Addressing: Finding the Next Available Slot

When a hash collision occurs, open addressing offers an elegant alternative to separate chaining. Instead of storing colliding elements in external structures like linked lists, open addressing stores all elements directly in the hash table itself. When a slot is occupied, the algorithm probes for the next available slot using a predefined sequence.

💡 Pro Tip: Open addressing is ideal for environments where memory is scarce and cache performance is critical.

Core Concept: Probing Strategies

Open addressing relies on a probing sequence to resolve collisions. The three most common strategies are:

  • Linear Probing: Check the next slot in a linear sequence.
  • Quadratic Probing: Use a quadratic function to determine the next slot.
  • Double Hashing: Apply a second hash function to calculate the step size.

Linear Probing Animation: Step-by-Step

Below is an interactive visualization of how linear probing works when inserting a key into a hash table. Watch as the algorithm searches for the next available slot.

0
1
2
3
4
🔍 Inserting key 13 at index 1 (collision at index 2)

Code Example: Linear Probing Insertion

Here’s a simplified implementation of linear probing in Python:


def insert(table, key, value):
    index = hash(key) % len(table)
    original_index = index

    while table[index] is not None:
        index = (index + 1) % len(table)  # Linear probing
        if index == original_index:       # Table full
            raise Exception("Hash table is full")

    table[index] = (key, value)

Mermaid.js: Probing Sequence Visualization

Below is a flow diagram showing how linear probing resolves a collision:

graph TD A["Insert Key 13"] --> B["Hash to Index 1"] B --> C["Slot 1 Occupied?"] C -- Yes --> D["Probe to Index 2"] D --> E["Slot 2 Occupied?"] E -- Yes --> F["Probe to Index 3"] F --> G["Slot 3 Free?"] G -- Yes --> H["Insert at Index 3"]

Advantages and Disadvantages

✅ Advantages

  • No extra memory for pointers
  • Better cache locality
  • Simpler memory management

❌ Disadvantages

  • Clustering issues (primary clustering)
  • Requires load factor management
  • Deletion is complex (requires tombstones)

Key Takeaways

  • Open addressing resolves collisions by probing within the hash table itself.
  • Linear probing is simple but can lead to clustering.
  • For high-performance systems, consider advanced hashing techniques or dynamic resizing strategies.
  • Open addressing is ideal for scenarios with predictable load factors and memory constraints.

Linear Probing: The Simplest Form of Open Addressing

Linear probing is the most straightforward collision resolution strategy in open addressing. When a hash collision occurs, it searches for the next available slot in a sequential manner. It's elegant in its simplicity but can lead to performance issues if not managed carefully.

🔍 How Linear Probing Works

When a hash function maps a key to an already occupied slot, linear probing checks the next slot in sequence. If that's also occupied, it continues until it finds an empty slot.

✅ Advantages

  • Simple to implement
  • Cache-friendly access pattern
  • Good for small load factors

❌ Disadvantages

  • Primary clustering (performance degrades)
  • Deletion requires tombstones
  • Not ideal for high load factors

🧠 Conceptual Walkthrough

Imagine inserting the key 25 into a hash table of size 10 using the hash function:

$$ \text{hash}(key) = key \bmod 10 $$

If hash(25) = 5, but slot 5 is occupied, linear probing checks 6, 7, 8, and so on, until it finds an empty slot.

🧮 Linear Probing Algorithm

def insert_linear_probing(hash_table, key, value):
    index = key % len(hash_table)
    original_index = index
    while hash_table[index] is not None:
        if hash_table[index][0] == key:
            hash_table[index] = (key, value)  # Update existing
            return
        index = (index + 1) % len(hash_table)
        if index == original_index:
            raise Exception("Hash table is full")
    hash_table[index] = (key, value)

Explanation: This function attempts to insert a key-value pair. If the slot is occupied, it moves to the next slot linearly until it finds an empty one.

📊 Visualizing Linear Probing with Mermaid

graph LR A["Key: 25 → hash(25) = 5"] --> B["Slot 5 Occupied?"] B --> C["Check Slot 6"] C --> D["Slot 6 Occupied?"] D --> E["Check Slot 7"] E --> F["Slot 7 Free! Insert Here"]

⚡ Animated Insertion Flow

25
26
27
28
29
30
31
32
33
34

📌 Key Takeaways

  • Linear probing resolves collisions by sequentially scanning the table.
  • It's simple but can cause clustering, especially under high load.
  • For better performance, consider advanced indexing strategies or dynamic resizing.

Quadratic Probing: Reducing Clustering with Non-Linear Steps

In the previous section, we explored how linear probing can lead to primary clustering, where consecutive occupied slots form long chains, degrading performance. Quadratic probing is a smarter alternative that uses a non-linear step sequence to reduce clustering and distribute keys more evenly across the hash table.

🔍 Why Quadratic Probing?

Instead of probing at fixed intervals (like index + 1), quadratic probing uses a sequence like:

$$ \text{probe}_i = (\text{hash}(key) + i^2) \mod \text{tableSize} $$

This non-linear step sequence helps avoid the clustering issues of linear probing by spreading out probe positions.

📊 Visualizing Probe Sequences

graph TD A["Linear Probing"] --> B["index + 1"] B --> C["index + 2"] C --> D["index + 3"] D --> E["..."] F["Quadratic Probing"] --> G["index + 1²"] G --> H["index + 2²"] H --> I["index + 3²"] I --> J["..."]

🧮 How It Works

Let’s walk through a simplified implementation of quadratic probing in Python:


class QuadraticHashTable:
    def __init__(self, size):
        self.size = size
        self.keys = [None] * size
        self.values = [None] * size

    def hash(self, key):
        return key % self.size

    def put(self, key, value):
        index = self.hash(key)
        i = 0
        while self.keys[index] is not None:
            index = (self.hash(key) + i * i) % self.size
            i += 1
            if i >= self.size:
                raise Exception("Hash table is full")
        self.keys[index] = key
        self.values[index] = value
    

📌 Key Takeaways

  • Quadratic probing reduces primary clustering by using a non-linear probe sequence.
  • It's more efficient than linear probing under moderate load factors.
  • However, it may still suffer from secondary clustering if not all permutations are explored.
  • For more advanced collision resolution, consider dynamic resizing or cuckoo hashing.

Double Hashing: A Smarter Way to Probe

While collision resolution strategies like linear and quadratic probing are useful, they can still suffer from clustering issues. Double hashing offers a more robust solution by using a secondary hash function to determine the step size, reducing clustering and distributing keys more evenly across the hash table.

🔍 How Double Hashing Works

Double hashing uses two hash functions:

  • Primary hash function: Determines the initial index.
  • Secondary hash function: Determines the step size for probing.

The probe sequence is calculated as:

$$ \text{probe}(key, i) = (\text{hash}_1(key) + i \cdot \text{hash}_2(key)) \mod \text{tableSize} $$

This ensures that even if two keys collide at the same initial index, their probe sequences will differ due to the secondary hash function.

📊 Mermaid.js Diagram: Double Hashing Probe Sequence

graph TD A["Key"] --> B["Hash Function 1"] A --> C["Hash Function 2"] B --> D["Initial Index"] C --> E["Step Size"] D --> F["Probe Sequence"] E --> F

💻 Python Implementation

# Double Hashing Implementation
class DoubleHashTable:
    def __init__(self, size):
        self.size = size
        self.keys = [None] * self.size
        self.values = [None] * self.size

    def hash1(self, key):
        return key % self.size

    def hash2(self, key):
        # Ensure hash2 is always odd and coprime to table size
        return 7 - (key % 7)

    def put(self, key, value):
        index = self.hash1(key)
        step = self.hash2(key)
        i = 0
        while self.keys[index] is not None:
            i += 1
            index = (index + i * step) % self.size
            if i >= self.size:
                raise Exception("Hash table is full")
        self.keys[index] = key
        self.values[index] = value
  

📌 Key Takeaways

  • Double hashing reduces both primary and secondary clustering by using a secondary hash function for step size.
  • It provides a more uniform distribution of keys, especially under high load factors.
  • For more advanced collision resolution, consider dynamic resizing or cuckoo hashing.

Load Factor and Its Impact on Performance

In hash table design, the load factor is a critical performance metric that determines how full your table is. It's defined as the ratio of the number of stored elements to the total number of slots in the table:

$$ \text{Load Factor} = \frac{\text{Number of Elements}}{\text{Table Size}} $$

As the load factor increases, so does the likelihood of collisions, which directly impacts the performance of operations like insertion, search, and deletion. Let's explore how different collision resolution strategies behave under varying load factors.

📊 Load Factor vs. Average Probe Length

graph TD A["Load Factor (α)"] --> B["Linear Probing"] A --> C["Quadratic Probing"] A --> D["Double Hashing"] B --> E["Avg Probe Length = 1/2 * (1 + 1/(1 - α))"] C --> F["Avg Probe Length = -ln(1 - α) / α"] D --> G["Avg Probe Length = 1 / (1 - α)"]

🔍 Performance Comparison at Different Load Factors

Linear Probing

  • Fast at low load factors
  • Performance degrades significantly as α approaches 1
  • Primary clustering becomes a major issue

Quadratic Probing

  • Better distribution than linear probing
  • Still suffers from secondary clustering
  • Limited to load factors below 0.5 for optimal performance

Double Hashing

  • Most uniform key distribution
  • Maintains performance even at high load factors
  • Preferred for high-performance systems

🧮 Code: Calculating Load Factor

# HashTable class with load factor tracking
class HashTable:
    def __init__(self, size):
        self.size = size
        self.keys = [None] * self.size
        self.values = [None] * self.size
        self.count = 0  # Number of elements inserted

    def load_factor(self):
        return self.count / self.size

    def put(self, key, value):
        if self.load_factor() >= 0.75:
            print("Warning: Load factor is high. Consider resizing.")
        # Insert logic here...
        self.count += 1

📌 Key Takeaways

  • The load factor is a key determinant of hash table performance.
  • As load factor increases, so does the average probe length, especially for linear probing.
  • Double hashing maintains better performance under high load due to its uniform probe sequences.
  • For scalable systems, consider dynamic resizing or cuckoo hashing to manage load factor efficiently.

Comparing Collision Resolution Techniques: Trade-offs and Use Cases

In the world of hash tables, collisions are inevitable. But how you resolve them can make or break your system's performance. In this masterclass, we'll dissect the most common collision resolution strategies—chaining, linear probing, quadratic probing, and double hashing—and explore their trade-offs in real-world applications.

📊 Comparison Table

Technique Time Complexity (Avg) Space Usage Cache Performance Use Case
Chaining $O(1 + \alpha)$ Dynamic Moderate General-purpose, unpredictable load
Linear Probing $O(1)$ (if $\alpha < 0.5$) Fixed High Cache-sensitive systems
Quadratic Probing $O(1)$ (with low clustering) Fixed High Avoids primary clustering
Double Hashing $O(1)$ (uniform distribution) Fixed High High-load, low-collision tolerance

🧠 Conceptual Flow: Hash Table Collision Resolution

graph TD A["Insert Key"] --> B{"Collision?"} B -- Yes --> C[Resolve Collision] C --> D[Chaining] C --> E[Open Addressing] E --> F[Linear Probing] E --> G[Quadratic Probing] E --> H[Double Hashing] B -- No --> I[Success]

🛠️ Code Example: Linear Probing Insertion

def insert_linear_probing(hash_table, key, value, table_size):
    index = hash(key) % table_size
    original_index = index

    while hash_table[index] is not None:
        if hash_table[index][0] == key:
            hash_table[index] = (key, value)  # Update value
            return
        index = (index + 1) % table_size
        if index == original_index:
            raise Exception("Hash table is full")

    hash_table[index] = (key, value)

📌 Key Takeaways

  • Chaining is robust and simple but may suffer from poor cache performance due to pointer indirection.
  • Linear probing offers excellent cache locality but risks primary clustering under high load.
  • Quadratic probing reduces clustering but may not guarantee full table coverage.
  • Double hashing provides the best probe distribution, ideal for high-load scenarios.
  • For scalable systems, consider dynamic resizing or cuckoo hashing to manage load factor efficiently.

Real-World Applications of Hash Tables and Collision Resolution

Hash tables are the unsung heroes of modern software. From caching layers in web servers to database indexing and even scalable e-commerce systems, hash tables power the speed and efficiency of countless applications. But their real-world use goes beyond just fast lookups. Let’s explore how they’re used in production systems and how collision resolution strategies play a critical role in performance.

Hash Tables in the Wild

  • Databases: Indexing with hash maps for O(1) key lookups.
  • Caching Systems: Redis and Memcached use hash tables for fast key-value access.
  • Compilers: Symbol tables for variable and function name resolution.
  • Networking: Flow tables in routers and switches for packet routing.
  • Blockchain: Transaction lookups and Merkle tree hashing.
graph LR A["Web Server"] --> B["Hash Table Cache"] B --> C["Fast Response"] D["Compiler"] --> E["Symbol Table"] E --> F["Fast Lookup"] G["Router"] --> H["Flow Table"] H --> I["Packet Routing"]

🔍 Collision Resolution in Production

In high-performance systems, how you resolve collisions can make or break your application. Let’s look at how different strategies are applied in real-world scenarios.

Chaining

Used in Redis and Memcached for dynamic key-value storage. Each bucket stores a linked list of entries, allowing for flexible growth.

Open Addressing

Used in Python dictionaries and JavaScript objects. Linear probing and double hashing are common to maintain cache locality.

🛠️ Code Example: Custom Hash Table with Chaining

Here’s a simplified implementation of a hash table using chaining to resolve collisions. This is a common pattern in production-grade systems.

# Hash Table with Chaining
class HashNode:
    def __init__(self, key, value):
        self.key = key
        self.value = value
        self.next = None

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.buckets = [None] * self.size

    def _hash(self, key):
        return hash(key) % self.size

    def insert(self, key, value):
        index = self._hash(key)
        node = self.buckets[index]
        if not node:
            self.buckets[index] = HashNode(key, value)
        else:
            while node:
                if node.key == key:
                    node.value = value  # Update existing
                    return
                if not node.next:
                    node.next = HashNode(key, value)
                    return
                node = node.next

    def get(self, key):
        index = self._hash(key)
        node = self.buckets[index]
        while node:
            if node.key == key:
                return node.value
            node = node.next
        raise KeyError("Key not found")

# Example usage
ht = HashTable()
ht.insert("user123", {"name": "Alice", "role": "admin"})
print(ht.get("user123"))

📊 Performance Trade-offs

Choosing the right collision resolution strategy depends on your system’s constraints:

Strategy Best For Load Factor Cache Locality
Chaining Dynamic datasets High Low
Linear Probing Cache-sensitive systems Moderate High
Double Hashing High-load systems High Moderate

📌 Key Takeaways

  • Hash tables are foundational in databases, caches, and compilers.
  • Chaining is ideal for dynamic datasets but may suffer from poor cache performance.
  • Open addressing (e.g., linear probing) is used in Python dicts and JS objects for memory efficiency.
  • Collision resolution strategy directly impacts performance under load.
  • For scalable systems, consider dynamic resizing or cuckoo hashing to manage load factor efficiently.

Advanced Considerations: Dynamic Resizing and Tombstones

In the previous sections, we explored how hash tables resolve collisions using chaining and open addressing. But real-world systems demand more than just collision resolution—they require scalability, memory efficiency, and robustness under dynamic conditions. Two critical concepts in this domain are:

  • Dynamic Resizing – Adapting the table size to maintain performance.
  • Tombstones – Handling deletions in open addressing without breaking lookups.

In this section, we’ll dive into how these mechanisms work under the hood, and why they’re essential for production-grade hash tables.

🔍 Conceptual Overview

Dynamic resizing ensures that the load factor (ratio of elements to slots) remains within acceptable bounds. Tombstones are markers used in open addressing to indicate deleted slots, preserving the integrity of probing sequences.

Dynamic Resizing: Keeping Load Factor in Check

Hash tables perform best when the load factor is kept below a threshold (commonly 0.75). When this threshold is exceeded, performance degrades due to increased collisions. To prevent this, the table must resize dynamically.

Resizing involves:

  • Creating a new, larger array (typically double the size).
  • Rehashing all existing keys into the new array.
  • Replacing the old table with the new one.

💡 Pro Tip: Resizing is expensive—$O(n)$—but amortized over many insertions, it’s still $O(1)$ per operation.

🔁 Resizing in Action

[Old Slot]
[Old Slot]
[Old Slot]
[Old Slot]
[New Slot]
[New Slot]
[New Slot]
[New Slot]
[New Slot]
[New Slot]
[New Slot]
[New Slot]

Tombstones: Handling Deletions in Open Addressing

In open addressing, deleting a key naively can break the probing chain, leading to missed lookups. To solve this, we use tombstones—special markers that indicate a slot was once occupied but is now available for reuse.

Here’s how tombstones work:

  • When a key is deleted, its slot is marked as a tombstone.
  • Probing continues past tombstones during lookups.
  • Insertions can overwrite tombstones.

🪦 Tombstone Visualization

Key A
Tombstone
Key C

Code Example: Tombstone-Based Hash Table


class TombstoneHashTable:
    def __init__(self, size=8):
        self.size = size
        self.keys = [None] * size
        self.values = [None] * size
        self.deleted = [False] * size  # Tombstone markers
        self.count = 0

    def _hash(self, key):
        return hash(key) % self.size

    def insert(self, key, value):
        index = self._hash(key)
        original_index = index
        while self.keys[index] is not None and not self.deleted[index]:
            if self.keys[index] == key:
                self.values[index] = value
                return
            index = (index + 1) % self.size
            if index == original_index:
                raise Exception("Hash table is full")
        self.keys[index] = key
        self.values[index] = value
        self.deleted[index] = False
        self.count += 1

    def delete(self, key):
        index = self._hash(key)
        original_index = index
        while self.keys[index] is not None:
            if self.keys[index] == key and not self.deleted[index]:
                self.deleted[index] = True
                self.count -= 1
                return
            index = (index + 1) % self.size
            if index == original_index:
                break
        raise KeyError("Key not found")

    def get(self, key):
        index = self._hash(key)
        original_index = index
        while self.keys[index] is not None:
            if self.keys[index] == key and not self.deleted[index]:
                return self.values[index]
            index = (index + 1) % self.size
            if index == original_index:
                break
        raise KeyError("Key not found")

Mermaid.js: Tombstone Deletion Flow

graph TD A["Start Lookup"] --> B{Slot Occupied?} B -- Yes --> C{Is Tombstone?} C -- Yes --> D[Continue Probing] C -- No --> E{Key Match?} E -- Yes --> F[Return Value] E -- No --> D B -- No --> G[Key Not Found] D --> H{End of Probe?} H -- No --> B H -- Yes --> G

Key Takeaways

  • Dynamic resizing is essential for maintaining performance under load. Learn more about efficient resizing strategies.
  • Tombstones allow safe deletion in open addressing without breaking probe chains.
  • Resizing and tombstones are used in real-world systems like Python’s dict and JavaScript’s object maps.
  • For advanced collision handling, explore collision resolution strategies and their trade-offs.

Frequently Asked Questions

What is a hash table collision and why does it happen?

A hash table collision occurs when two different keys produce the same hash index. This happens because the number of possible keys is typically much larger than the number of available slots in the table.

Which collision resolution technique is best for performance?

It depends on the use case. Separate chaining is simpler and handles high load factors well, while open addressing techniques like double hashing offer better cache performance but require careful load factor management.

What is the load factor in a hash table?

The load factor is the ratio of the number of stored elements to the total number of slots in the hash table. It indicates how full the table is and affects performance and collision frequency.

How does separate chaining work in hash tables?

Separate chaining resolves collisions by storing multiple elements in the same slot using a data structure like a linked list. Each bucket holds a list of entries that hash to the same index.

What is the difference between linear probing and quadratic probing?

Linear probing resolves collisions by checking the next slot sequentially (index+1, index+2, etc.), while quadratic probing uses a quadratic step size (index+1², index+2², etc.) to reduce primary clustering.

What are tombstones in hash tables?

Tombstones are markers used in open addressing to indicate that a slot was previously occupied but is now deleted. They help maintain correct probing sequences during lookup operations.

Can hash table collisions affect performance?

Yes, frequent collisions increase the time required for insertions and lookups. Techniques like resizing and choosing a good hash function help maintain performance.

Post a Comment

Previous Post Next Post