Hashing algorithms were invented in the mid-1970s and remain popular today. This landmark discovery was originally invented to accomodate computational storage constraints of the time period, but has widespread applications the original inventors could not have fathomed. To name a few, hashes verify cryptocurrency transactions on the blockchain, obfuscate sensitive information in record-linkage and even minimize exposure to data breaches. In this post I will delve into the technical details of hashing algorithms, introduce a popular hash function (SHA256), and discuss, in detail, applications of this ubiquitous tool.
At a high level, a hash function is a mathematical operation that maps an input value to a near-unique output, called a hash value or hash for short. An input string can vary in length, from a single character to paragraphs of text, but hashed outputs remain a fixed length. For example, the figure below demonstrates how slight changes to an input string, "every good boy does fine.", alters the output hash psuedo-randomly.
Hash functions are underpinned by a number of desireable statistical and mathematical
properties. In hash functions, the same one-way transformation is applied identically
to an input string which means hashed values depend only on the algorithm and the
input text. This differs from conventional cryptographic encryption which is
bi-directional and depends on a unique key that can be used for decryption. After
passing through a hash function it becomes nearly impossible to recover the original
input. Uniqueness is a desireable statistical property that gurantees each unique input
has a corresponding unique output. Most production hash functions are psuedo-unique
and there exists the possibilty of two input strings sharing an identical output.
In computer memory, all information is represented as bits, nothing more than a sequence of binary numbers. Bits are strung together to form bytes and 8 bits = 1 byte. You might recognize bytes from a file explorer and when you open a file with a recognizable file extension, your computer translates the underlying bytes into human readable text. Hash functions work by shifting these underlying bits using a mathematical operator (usually XOR) with a large prime number, which is generally fixed to ensure identifiablility. Different hash functions use different operators and one of the most popular is the SHA256.
SHA was first developed by the NSA in the 1990s and stands for Secure Hashing Algorithm. SHA1 is the predecessor to SHA2, has been phased out because of a recent vulnerability that makes it easier to crack using a brute force guess & check attack. For something to be considered a production grade hash function it must be fast, deterministic, “irreversible”, and pseudo-random. Slight changes in input should dramatically change the output. The SHA2 family includes SHA224, SHA256, SHA384 and SHA512, where the three-digit number indicates the output size in bits. These variations provide the ability to generate longer hashes, considered more secure. Detailed algorithmic information regarding the SHA mandated by NIST can be found here, but basic details are below:
The utility of hash functions is derived from the near unique, one-way transformation of an input value to an unrecognizable output. A wide variety of applications make use of hash functions and I'll briefly review some examples.
Hashes obfuscate sensitive information, while preserving the uniqueness of the original value linking records like a primary key. Personally identifiable information ,or PII, often uniquely defines an observation in real world studies. PII can be sensitive, like social security numbers, and needs to be masked for compliance. Hashed copies of PII can be used in place of the original data fields for accurate record linkage that could otherwise be very difficult.
When you login on a secure application, like a mobile banking app, you unknowingly are taking advantage of hash functions! In typical applications, your password is locally converted to a hash and transferred to an application server for authentication. The server compares the user supplied hash with a hashed copy stored in a database. By storing a hashed copy of a password, instead of the original, it preserves user privacy and avoids the unnecessary risk in the event of a data breach as hashed credentials are useless to hackers. Meta (formerly Facebook) has been in the news for storing millions of user passwords in plain text!
Hash functions define data structures in programming languages called hashmaps and an example in Python is a dictionary. When a key is passed to dictonary it's hashed and uniquely identifies a location in memory of the dictonary "value". This lookup is computationally fast (O(c)) and hashmaps should always be used to reduce search times if possible.
In this article I reviewed the basic structure of hash functions, why they are desireable and reviewed some basic applications.