Basics of Hashing Algorithms

11/16/2019

Andrew Amore

Overview

Hashing algorithms were invented in the mid-1970s and remain popular today. This landmark discovery was originally invented to accomodate computational storage constraints of the time period, but has widespread applications the original inventors could not have fathomed. To name a few, hashes verify cryptocurrency transactions on the blockchain, obfuscate sensitive information in record-linkage and even minimize exposure to data breaches. In this post I will delve into the technical details of hashing algorithms, introduce a popular hash function (SHA256), and discuss, in detail, applications of this ubiquitous tool.

What are Hash Functions?

At a high level, a hash function is a mathematical operation that maps an input value to a near-unique output, called a hash value or hash for short. An input string can vary in length, from a single character to paragraphs of text, but hashed outputs remain a fixed length. For example, the figure below demonstrates how slight changes to an input string, "every good boy does fine.", alters the output hash psuedo-randomly.

Hash functions are underpinned by a number of desireable statistical and mathematical properties. In hash functions, the same one-way transformation is applied identically to an input string which means hashed values depend only on the algorithm and the input text. This differs from conventional cryptographic encryption which is bi-directional and depends on a unique key that can be used for decryption. After passing through a hash function it becomes nearly impossible to recover the original input. Uniqueness is a desireable statistical property that gurantees each unique input has a corresponding unique output. Most production hash functions are psuedo-unique and there exists the possibilty of two input strings sharing an identical output. When two inputs share an output it's referred as a collision and is generally ignored as collisons are very low probability events.

In computer memory, all information is represented as bits, nothing more than a sequence of binary numbers. Bits are strung together to form bytes and 8 bits = 1 byte. You might recognize bytes from a file explorer and when you open a file with a recognizable file extension, your computer translates the underlying bytes into human readable text. Hash functions work by shifting these underlying bits using a mathematical operator (usually XOR) with a large prime number, which is generally fixed to ensure identifiablility. Different hash functions use different operators and one of the most popular is the SHA256.

SHA256

SHA was first developed by the NSA in the 1990s and stands for Secure Hashing Algorithm. SHA1 is the predecessor to SHA2, has been phased out because of a recent vulnerability that makes it easier to crack using a brute force guess & check attack. For something to be considered a production grade hash function it must be fast, deterministic, “irreversible”, and pseudo-random. Slight changes in input should dramatically change the output. The SHA2 family includes SHA224, SHA256, SHA384 and SHA512, where the three-digit number indicates the output size in bits. These variations provide the ability to generate longer hashes, considered more secure. Detailed algorithmic information regarding the SHA mandated by NIST can be found here, but basic details are below:

1. The input string is transformed to ASCII char codes
2. Char codes are converted to binary and the binary value is padded until the length of the message is a multiple of 512. Padding adds bits in a known pattern, typically 10000... until the length matches the requirement
3. Padded message is broken down to blocks, each with size 512 bits (each block is composed of 16 data blocks of length 32 bits)
4. Initial hash values are initialized which consist of prime numbers with total length of 256 bits (8 inputs of 32 bits, which equals 256)
5. The message blocks are fed into the hashing algorithm, which shuffles the data in the blocks around (using xor, etc.)
6. Cycle repeats until all input blocks have been used and final hash is computed

Popular Uses for Hash Functions

The utility of hash functions is derived from the near unique, one-way transformation of an input value to an unrecognizable output. A wide variety of applications make use of hash functions and I'll briefly review some examples.

Record-Linkage

Hashes obfuscate sensitive information, while preserving the uniqueness of the original value linking records like a primary key. Personally identifiable information ,or PII, often uniquely defines an observation in real world studies. PII can be sensitive, like social security numbers, and needs to be masked for compliance. Hashed copies of PII can be used in place of the original data fields for accurate record linkage that could otherwise be very difficult.

User Authentication

When you login on a secure application, like a mobile banking app, you unknowingly are taking advantage of hash functions! In typical applications, your password is locally converted to a hash and transferred to an application server for authentication. The server compares the user supplied hash with a hashed copy stored in a database. By storing a hashed copy of a password, instead of the original, it preserves user privacy and avoids the unnecessary risk in the event of a data breach as hashed credentials are useless to hackers. Meta (formerly Facebook) has been in the news for storing millions of user passwords in plain text!

Data Structures

Hash functions define data structures in programming languages called hashmaps and an example in Python is a dictionary. When a key is passed to dictonary it's hashed and uniquely identifies a location in memory of the dictonary "value". This lookup is computationally fast (O(c)) and hashmaps should always be used to reduce search times if possible.

Conclusion

In this article I reviewed the basic structure of hash functions, why they are desireable and reviewed some basic applications.