December 13, 2020

Working with hashes in C++

Introduction/background

Contrary to other programming languages, C++ doesn't have a root base type like Java and C# have the Object class.

In those 2 languages, there are two special methods in the Object class:

  • Equality comparison
  • Hash code calculation

These methods have certain properties that must hold true (although this isn't enforced by any of the compilers or runtime).

For two given instances A and B, if A == B then Hash(A) == Hash(B).

For example:

Person A("John", "Doe");
Person B("John", "Doe");
Person C("Jane", "Doe");

If A == B returns true, it's expected that Hash(A) and Hash(B) return the same value K

In the case where A != C, then there is nothing special about Hash(A) and Hash(C)

Extending the same concept for collections, if we have an ordered collection (either explicitly, like an array where items have an index, or explicitly, like a sorted collection), same thing is generally true:

var A = new int[] { 1, 2, 3 };
var B = new int[] { 1, 2, 3 };
var C = new int[] { 3, 2, 1 };

We have the exact same situation as above, while in the next case:

var A = new HashSet<int>() { 1, 2, 3 };
var B = new HashSet<int>() { 3, 2, 1 };

It should be true that A == B and the hashes of A and B are also the same (in this case the collection is not ordered). Ignore syntax and particularities of specific languages, this should be the expected behavior of == and hash

Since those two languages have the common root class Object, the methods Equals and HashCode are defined as virtual methods there. Each derived class that plans to use equality and hash-related operations, should override those methods. If you don't do that, you'll have to live with the default implementation of Equals and HashCode, which, without knowing the business logic behind your class, will end up just comparing memory addresses. If you don't override Equals for the class Person above:

  • A == A will return true
  • A == B will return false

Why does it matter? In two words: object identity.

Identity matters

Objects are considered identical not because they occupy the same memory address (because two objects can't occupy the same memory address), but because they represent the same entity. In which case, an object that is constructed with the data typed by a user in a form (like the Person with first name "John" and last name "Doe") is the same as the object that is constructed with the data read from a database. Otherwise you would never be able to do any checks with the objects that represent the same entity but are created differently.

The case for the hash

In addition to the equality comparer, hash structures rely on a small digest (the hash) being calculated for a bigger object. So if you want to add objects to a hash table you need to be able to calculate the hash values correctly and in an expected way. Still using the examples above, if you want to build a hash table whose key is the type Person, if the hash of A and hash of B are different values (let's say, you forgot to override the HashCode method), the "same person" will potentially end up in two positions in the hash. So an operation like:

var myHash = new Dictionary<Person, int>(); // int is the age
myHash[A] = 0; // person is born
Person.WriteToDatabase(A); // save to the database
…
var B = Person.ReadFromDatabase(…); // read previously stored
myHash[B]++; // happy birthday

throws an exception because myHash[B] doesn't exist.

But why do I need both Equals and HashCode? Isn't HashCode sufficient?

HashCode would be enough if the computer had enough memory to store all possible hashed values and the hash values were 100% evenly distributed, which is not true. There will be collisions where two different objects yield the same hash value. In those cases, how do you tell if they are the same object or not? By using the equality operation. If the objects have the same hash value but they are different, then they are treated as different entities and some collision management has to happen in the hash structure to give the different object a different home.

Summarizing

If you are creating a new class XYZ that you intend to use as a key to a hash table, you need to override both Equals and HashCode operations, otherwise the hash table will be either inefficient or simply not work.

What about C++?

Now let's look at the particular case of C++… C++ has no common ancestor to all objects, so while all classes do have a default operator== that can be overridden, that only solves half of our problem. Without a common hash method, there are no easy ways to get them to work with the hash structures.

Going back to C#, that language has a very interesting feature called "extension methods". Extension methods in C# can be used to overcome a good practice of OO design that is to make your classes final so that behavior is not overridden by the consumers (imagine an Authenticate class that is not sealed and a hacker can override the CheckPassword method…). Most of the time you won't be able to extend framework classes or 3rd party classes. In order to add new operations to existing classes, one possibility is to define these extension methods. They are a syntactic sugar for creating static methods with an implicit this parameter. So you can add a method WriteToLog to any class, even if that class doesn't define an abstract or virtual WriteToLog method.

The equivalent to an extension method in C++ is something like this:

template<>
struct TheExtensionYouWantToAdd {
// add your extension implementation here
};

The std::unordered_map container

This template is defined in header <unordered_map> with a couple optional parameters:

template<
    class Key,
    class T,
    class Hash = std::hash<Key>,
    class KeyEqual = std::equal_to<Key>,
    class Allocator = std::allocator< std::pair<const Key, T> >
> class unordered_map;

Note that there is a default std::hash type for the Hash template parameter defined in header <functional>.

template< class Key >
struct hash;

std::hash<T> is defined for all the primitive types plus the pointer to any type. It is also defined for some of the STL types, like std::string and std::unique_ptr (full list here). This template defines the operator()(const T&) so that you can call std::hash<T>()(something with type T). That's exactly the operation that is called by std::unordered_map to calculate the hash value.

You can also define a specific std::equal_to<T> if you are using a 3rd party class that didn't override the operator==

See that everything in C++ is planned towards supporting legacy code, since we can't just go ahead and rebuild all the C++ code to add a base Object class everywhere.

A practical (broken) example

How can things go wrong

Suppose you defined a class Person like the following and try to use Person as the key.

struct Person {
    Person(const Person& other) : Person(other.First, other.Last) {}
    Person(const std::string& first, const std::string& last) 
        : First(first), Last(last) {}

    const std::string First;
    const std::string Last;
};

int main()
{
    std::unordered_map<Person, int> myMap;
}

This is the error you'll get:

Error        C2280        'std::_Uhash_compare<_Kty,_Hasher,_Keyeq>::_Uhash_compare(const std::_Uhash_compare<_Kty,_Hasher,_Keyeq> &)': attempting to reference a deleted function        cppscratch        C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.28.29333\include\unordered_map        51        

Very helpful and user friendly… Not! It doesn't even tell you where the error happened.

How to start making things right

You need to define a hash operation for the Person type, like this (add this code between the struct definition and main):

template<>
struct std::hash<Person> {
    std::size_t operator()(const Person& person) const {
        // std::string already has an implementation for std::hash
        // so we can combine the hash values of first and last name
        std::hash<std::string> stringHasher;

        return stringHasher(person.First) + stringHasher(person.Last);
    }
};

Now the code builds fine! But it's still broken…

It is missing the equality operator and you'll notice that as soon as you attempt to insert anything in the hash:

int main()
{
    std::unordered_map<Person, int> myMap;
    myMap.emplace(Person{ "John","Doe" }, 1);
}

This will fail with:

Severity        Code        Description        Project        File        Line        Suppression State
Error        C2676        binary '==': 'const _Ty' does not define this operator or a conversion to a typeacceptable to the predefined operator        cppscratch        C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.28.29333\include\xstddef        91        

If you add the code below also before main():

template<>
struct std::equal_to<Person> {
    bool operator()(const Person& lhs, const Person& rhs) const {
        // std::string already has an implementation for std::equal_to
        // so we leverage that for first and last name
        std::equal_to<std::string> stringEqualTo;

        return stringEqualTo(lhs.First, rhs.First)
            && stringEqualTo(lhs.Last, rhs.Last);
    }
};

Then it finally succeeds! To compile… But there is a bigger problem lingering around…

Let's check the behavior of the hash function by looking at 3 different people:

int main()
{
    std::hash<Person> personHasher;
    std::equal_to<Person> personEqual;

    Person p1{ "John", "Doe" };
    Person p2{ "John", "Doe" };
    Person p3{ "Doe", "John" };

    std::cout 
        << "equal=" << personEqual(p1,p2) 
        << " p1 hash=" << personHasher(p1) 
        << " p2 hash=" << personHasher(p2) << std::endl;

    std::cout 
        << "equal=" << personEqual(p1,p3) 
        << " p3 hash=" << personHasher(p3) << std::endl;
}

That code outputs this:

equal=1 p1 hash=2540568255 p2 hash=2540568255
equal=0 p3 hash=2540568255

Wait a minute! The first two are expected, right? It's the equal/hash property: if A == B, then hash(A)==hash(B)

How about p3? They are different but have the same hash… Also expected, right? Because we can't guarantee that different values will have different hashes.

But then comes trouble… Looking closer to the Person class, it has an order. "John Doe" is not the same as "Doe John", so we should have taken the order into consideration also for the hash calculation!

And this is where C++ STL is missing a critical functionality!

First, let's look at what went wrong…

Let's recap our definition of the Person hash:

return stringHasher(person.First) + stringHasher(person.Last);

See that if First and Last are simply switched the hash will be the same. This will also happen if the First and Last names are different but they yield the same hash.

What could be the impact of that?

Hash collisions are expected, but we should try to avoid them as much as possible, because they are expensive to work around. Techniques like open addressing, buckets of keys, fix the problem but then they require a O(N) cost (maybe slightly lower, if you make the code much more complex), which throws the O(1) cost of the hash out of the window.

So it is super important to have a great hashing function in order to minimize those collisions! Not doing that can cause hard-to-find performance issues.

How is this solved in C++?

While both C# and Java offer a standard hash combination routine, so that you can easily combine multiple hash values, C++ STL doesn't provide a hash combination utility function!

There are alternatives… We can define our own hash combinator, or we can use one from the STL competitor: boost. If you don't want to take another dependency (on boost), here's a convenient variadic template:

template <typename T, typename... TS>
inline void HashCombine(std::size_t& hashValue, T value, TS... values)

A fully functional example

We only need to replace our naïve hash combination (using addition) with the utility function:
template<>
struct std::hash<Person> {
    std::size_t operator()(const Person& person) const {
        std::size_t result = 0;

        HashCombine(result, person.First, person.Last);

        return result;
    }
};

Now running the same example as before, we get this:

equal=1 p1 hash=861162551 p2 hash=861162551

equal=0 p3 hash=1964666581

See that p3 hash is no longer the same as p1 and p2.

Problem solved!

Appendix – Show me the code!

inline void HashCombine(std::size_t& hashValue)
{
}

template <typename T, typename... TS>
inline void HashCombine(std::size_t& hashValue, const T value, const TS... values)
{
 static std::hash<T> hasher;

 // The constants below are arbitrary, but we follow the same logic as boost::hash_combine,
 // but with different rotation values 13 and 19.
 // (magic constant comes from here https://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm)
 hashValue ^= hasher(value) + 0x9e3779b9 + (hashValue << 13) + (hashValue >> 19);
 HashCombine(hashValue, values...);
}

SeeSharpr by SeeSharpr is licensed under CC BY-SA 4.0