Hash Table with Linear Probing Collision Resolution

The goal of this Hash Table is to implement the following functions in O(1) time complexity.

Insert()
Remove()
Get()
GetLast()
GetFirst()

To implement O(1) complexity for GetFirst() and GetLast(), a least recently used cache was used, which is typically implemented using a map and a linked list. Note that GetLast(): is the opposite end of the least recently used cache.

The Hash Table is designed such that if a new Key cannot be found in any bucket and if all buckets are occupied, the new Key is not added. The Hash Table will alert the user. The method used for finding a valid bucket for a Key is linear probing. That is, it will iterate from its current bucket until it finds an empty bucket.

By default, the hash function is naively implemented using the sum of characters. A more robust implementation using random numbers to generate hash at program launch can provide more meaningful runtime statistical analysis of the Hash Table implementation.

Linear Probing Collision Handling

bool HashTable::insertIdx(const std::wstring& key, int& outIdx, bool& isKeyUpdated) const {
    outIdx = computeHash(key);
    isKeyUpdated = false;
    bool toInsert = false; // do not insert key if it cannot find a bucket (hash map is saturated)
    if (!m_tableEntries[outIdx].hasValue) { // if bucket is not occupied, take it for the key.
        return true;
    }
    else { // Attempt to search for the key in all buckets and stop when found (Collision)
        for (auto [idx, i] = std::tuple{ outIdx, 0 }; i < m_tableSize; (idx = (idx + 1) % m_tableSize), i++) { // linear probe (max iterations = table size)
            if (m_tableEntries[idx].hasValue && m_tableEntries[idx].key.compare(key) == 0) {
                outIdx = idx;
                toInsert = true;
                isKeyUpdated = true;
                break; // The key was found in another bucket. Replace the key.
            }
        }
        if (!toInsert) { // The key could not be found in another bucket. Occupy the next free bucket, if available.
            for (auto [idx, i] = std::tuple{ outIdx, 0 }; i < m_tableSize; (idx = (idx + 1) % m_tableSize), i++) { // linear probe (max iterations = table size)
                if (!m_tableEntries[idx].hasValue) {
                    outIdx = idx;
                    toInsert = true;
                    break; // Insert key into free bucket
                }
                else {
                    // If no free spots cannot be found, hash table is saturated and can't insert new values.
                    // Without this check, new keys can override existing keys and lose data
                }
            }
        }
    }
    return toInsert;
}

Test Bench

std::cout << "Hello World Hash Table!" << '\n';

unsigned int tableSize = 150000;

if (argc != 3) {
    std::cerr << "Usage: " << argv[0] << " <filepath> <tableSize>" << std::endl;
    return 1;
}

std::string filePath = argv[1];
int inputSize = std::atoi(argv[2]);

if (inputSize < 0) {
    std::cerr << "Error: tableSize must be a non-negative integer." << std::endl;
    return 1; // Exit with an error code
}
else {
    tableSize = static_cast<unsigned int>(inputSize); // Update tableSize if valid
}

std::cout << "File path: " << filePath << '\n';
std::cout << "Table size: " << tableSize << '\n';

// On Linux Platforms, wstring is wasteful
// Note: MSVC (UTF-16) | Linux (UTF-32) support using wstring
std::vector<std::wstring> tokens;
if (!dw::parseFileIntoTokens(std::filesystem::path(filePath), tokens)) {
    return -1;
}

// debug -> ensure that parsing of unicode file into tokens, also writes tokens into unicode
//dw::writeTokensToFile(DW_DATA_DIR "/debug/debug_token_output.txt", tokens);

dw::HashTable ht(tableSize);
int i = 0;
int numFailedInserts = 0;
for (const auto& token : tokens) {
    if (!ht.Insert(token, i)) {
        // In this instance, saturation prevents a new key from inadvertantly remove an older key.
        // Rather than data being lost without awareness, the Hash Table is designed to alert the user if it is unable to insert.
        std::wcout << "Unable to enter " << token << " idx " << i << " into hash table. It is saturated." << '\n';
        numFailedInserts++;
    }
    i++;
}

std::cout << "Num of tokens " << i << '\n';
std::cout << "Num of failed token inserts " << numFailedInserts << '\n';

{
    std::wcout << '\n' << "View First and Last Entries" << '\n';
    dw::HashTableViewer htv(&ht);
    int viewAmount = 5;
    std::wcout << "First " << viewAmount << " in Hash Table." << '\n';
    htv.ViewFirst(viewAmount);
    std::wcout << "Last " << viewAmount << " in Hash Table." << '\n';
    htv.ViewLast(viewAmount);
}

{
    std::wcout << '\n' << "Getting a known key" << '\n';
    std::wstring key{ L"French" }; // very likely to be in the book
    auto [value, success] = ht.Get(key);
    if (success) {
        std::wcout << "Key " << key << " Value " << value << '\n';
    }
    else {
        std::wcout << "Key " << key << " not found." << '\n';
    }
}

{
    std::wcout << '\n' << "Getting an unknown key" << '\n';
    std::wstring key{ L"francaiserevolution" }; // unlikely to be in the book (spelt incorrectly)
    auto [value, success] = ht.Get(key);
    if (success) {
        std::wcout << "Key " << key << " Value " << value << '\n';
    }
    else {
        std::wcout << "Key " << key << " not found." << '\n';
    }
}

{
    std::wcout << '\n' << "Getting a first key" << '\n';
    auto [key, value, success] = ht.GetFirst();
    if (success) {
        std::wcout << "First Key " << key << " Value " << value << '\n';
    }
}

{
    std::wcout << '\n' << "Getting a last key" << '\n';
    auto [key, value, success] = ht.GetLast();
    if (success) {
        std::wcout << "Last Key " << key << " Value " << value << '\n';
    }
}

{
    std::wcout << '\n' << "Viewing First and Last Entries before and after entering new key" << '\n';
    dw::HashTableViewer htv(&ht);
    int viewAmount = 5;
    std::wcout << "First " << viewAmount << " in Hash Table." << '\n';
    htv.ViewFirst(viewAmount);
    std::wcout << "Last " << viewAmount << " in Hash Table." << '\n';
    htv.ViewLast(viewAmount);

    auto [key, value, success] = ht.GetFirst();
    if (success) {
        ht.Remove(key);
    }
    auto [lastkey, lastvalue, lastsuccess] = ht.GetLast();
    if (lastsuccess) {
        ht.Remove(lastkey);
        ht.Insert(L"francaiserevolution", 1789); // insert unlikely key
    }

    std::wcout << "First " << viewAmount << " in Hash Table." << '\n';
    htv.ViewFirst(viewAmount);
    std::wcout << "Last " << viewAmount << " in Hash Table." << '\n';
    htv.ViewLast(viewAmount);
}

Result

Results with a table of 150,000 buckets.

In this case, the table is not saturated because the table size is greater than the number of tokens in the data file (141,690).

Hello World Hash Table!
File path: .\data\98-0.txt
Table size: 150000
File ".\\data\\98-0.txt" exists.
Num of tokens 141690
Num of failed token inserts 0

View First and Last Entries
First 5 in Hash Table.
Front 0 Key Title Value 98
Front 1 Key Tale Value 100
Front 2 Key Author Value 110
Front 3 Key Release Value 113
Front 4 Key Date Value 114
Last 5 in Hash Table.
Last 0 Key eBooks Value 141689
Last 1 Key new Value 141688
Last 2 Key about Value 141687
Last 3 Key hear Value 141686
Last 4 Key to Value 141685

Getting a known key
Key French Value 130776

Getting an unknown key
Key francaiserevolution not found.

Getting a first key
First Key Title Value 98

Getting a last key
Last Key eBooks Value 141689

Viewing First and Last Entries before and after entering new key
First 5 in Hash Table.
Front 0 Key Title Value 98
Front 1 Key Tale Value 100
Front 2 Key Author Value 110
Front 3 Key Release Value 113
Front 4 Key Date Value 114
Last 5 in Hash Table.
Last 0 Key eBooks Value 141689
Last 1 Key new Value 141688
Last 2 Key about Value 141687
Last 3 Key hear Value 141686
Last 4 Key to Value 141685
First 5 in Hash Table.
Front 0 Key Tale Value 100
Front 1 Key Author Value 110
Front 2 Key Release Value 113
Front 3 Key Date Value 114
Front 4 Key January Value 115
Last 5 in Hash Table.
Last 0 Key francaiserevolution Value 1789
Last 1 Key new Value 141688
Last 2 Key about Value 141687
Last 3 Key hear Value 141686
Last 4 Key to Value 141685

Results with a table of 100 buckets.

In this case, the table became saturated quickly because the table only has space for 100 unique entries. One thing to note is that there are 141,690 tokens but had 104,510 failed entries. There were quite a number of keys that repeated in the text.

Hello World Hash Table!
File path: .\data\98-0.txt
Table size: 100
File ".\\data\\98-0.txt" exists.
Unable to enter TWO idx 148 into hash table. It is saturated.
Unable to enter CITIES idx 149 into hash table. It is saturated.
Unable to enter TWO idx 153 into hash table. It is saturated.
Unable to enter CITIES idx 154 into hash table. It is saturated.
Unable to enter STORY idx 156 into hash table. It is saturated.
... REMOVING EXCESS ENTRIES
Unable to enter newsletter idx 141684 into hash table. It is saturated.
Unable to enter hear idx 141686 into hash table. It is saturated.
Unable to enter about idx 141687 into hash table. It is saturated.
Unable to enter new idx 141688 into hash table. It is saturated.
Unable to enter eBooks idx 141689 into hash table. It is saturated.

Num of tokens 141690
Num of failed token inserts 104510

View First and Last Entries
First 5 in Hash Table.
Front 0 Key Title Value 98
Front 1 Key Tale Value 100
Front 2 Key Author Value 110
Front 3 Key Release Value 113
Front 4 Key Date Value 114
Last 5 in Hash Table.
Last 0 Key to Value 141685
Last 1 Key and Value 141677
Last 2 Key Gutenberg Value 141666
Last 3 Key Project Value 141665
Last 4 Key the Value 141664

Getting a known key
Key French Value 130776

Getting an unknown key
Key francaiserevolution not found.

Getting a first key
First Key Title Value 98

Getting a last key
Last Key to Value 141685

Viewing First and Last Entries before and after entering new key
First 5 in Hash Table.
Front 0 Key Title Value 98
Front 1 Key Tale Value 100
Front 2 Key Author Value 110
Front 3 Key Release Value 113
Front 4 Key Date Value 114
Last 5 in Hash Table.
Last 0 Key to Value 141685
Last 1 Key and Value 141677
Last 2 Key Gutenberg Value 141666
Last 3 Key Project Value 141665
Last 4 Key the Value 141664
First 5 in Hash Table.
Front 0 Key Tale Value 100
Front 1 Key Author Value 110
Front 2 Key Release Value 113
Front 3 Key Date Value 114
Front 4 Key January Value 115
Last 5 in Hash Table.
Last 0 Key francaiserevolution Value 1789
Last 1 Key and Value 141677
Last 2 Key Gutenberg Value 141666
Last 3 Key Project Value 141665
Last 4 Key the Value 141664

Notes for improvement

Template for Hash Table to utilize different built-in variable types other than wstring and int
1. Note: wstring is wasteful in Linux since it may use 4 bytes instead of 2 bytes as in Windows
Class support and Class hashing functions.
File logging instead of standard output / error
1. Inject logging and specify verbosity using its CLI
Unit Tests
Statistical analysis to describe Hash Table performance.
LRU Cache continually grows with respect to the number of unique tokens inserted into the Hash Table. We could use caches that store the first 5 or last 5 tokens if we need to save space in memory.