Skip to content
On this page

Introduction to Hash Algorithms

Python's hashlib provides common hash algorithms like MD5, SHA1, and more.

What is a hash algorithm? A hash algorithm, also known as a digest algorithm or hashing algorithm, converts data of arbitrary length into a fixed-length string (usually represented in hexadecimal) through a function.

For example, if you wrote an article with the content 'how to use python hashlib - by Michael' and its hash is '2d73d4f15c0db7f5ecb321b6a65e5d6d', any tampering of the article would change the hash, allowing you to identify the modification.

In essence, a hash algorithm uses a hash function hash(data) to compute a fixed-length hash digest from any length of data, aimed at detecting if the original data has been altered.

The ability of hash algorithms to identify tampering comes from the fact that hash functions are one-way; calculating digest=hash(data) is easy, but reversing it to find data is extremely difficult. Furthermore, modifying even a single bit of the original data results in a completely different hash.

Taking the common MD5 hash algorithm as an example, you can compute the MD5 value of a string:

python
import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

The output will be:

d26a53750bc40b38b65a520292f69306

For large datasets, you can call update() multiple times, and the final result will be the same:

python
import hashlib

md5 = hashlib.md5()
md5.update('how to use md5 in '.encode('utf-8'))
md5.update('python hashlib?'.encode('utf-8'))
print(md5.hexdigest())

Try changing a letter and see if the result is completely different.

MD5 is the most common hash algorithm, known for its speed, producing a fixed result of 128 bits/16 bytes, typically represented as a 32-character hexadecimal string.

Another common hash algorithm is SHA1, which is called similarly:

python
import hashlib

sha1 = hashlib.sha1()
sha1.update('how to use sha1 in '.encode('utf-8'))
sha1.update('python hashlib?'.encode('utf-8'))
print(sha1.hexdigest())

SHA1 produces a result of 160 bits/20 bytes, usually represented as a 40-character hexadecimal string.

More secure algorithms include SHA256 and SHA512, but these are slower and produce longer hashes.

Is it possible for two different pieces of data to produce the same hash through a hash algorithm? Absolutely, as any hash algorithm maps an infinite set of data to a finite set. This scenario is called a collision; for example, if Bob tries to generate an article titled 'how to learn hashlib in python - by Bob' with the same hash as yours, it is theoretically possible, but very difficult.

Applications of Hash Algorithms

Where can hash algorithms be applied? A common example is:

Any website allowing user login will store usernames and passwords. How to store them? One way is to save them in a database table:

name	password
michael	123456
bob	abc999
alice	alice2008

If passwords are stored in plain text, a database breach could expose all users' passwords. Additionally, site administrators can access the database, thus retrieving all user passwords.

The correct way to store passwords is to save the hash of the passwords, like MD5:

username	password
michael	e10adc3949ba59abbe56e057f20f883e
bob	878ef96e86145580c38c87f0410ad153
alice	99b1c2188db85afee403b1536010c2c9

When a user logs in, the entered plain text password is hashed using MD5 and compared with the stored hash; if they match, the password is correct.

Exercise

Calculate the MD5 hash of a password based on user input:

python
def calc_md5(password):
    pass

Storing the MD5 hash means that even if administrators access the database, they cannot know the users' plain text passwords.

Design a function to validate user login, returning True or False based on the correctness of the entered password:

python
db = {
    'michael': 'e10adc3949ba59abbe56e057f20f883e',
    'bob': '878ef96e86145580c38c87f0410ad153',
    'alice': '99b1c2188db85afee403b1536010c2c9'
}

def login(user, password):
    pass

Tests

python
assert login('michael', '123456')
assert login('bob', 'abc999')
assert login('alice', 'alice2008')
assert not login('michael', '1234567')
assert not login('bob', '123456')
assert not login('alice', 'Alice2008')
print('ok')

Is storing passwords with MD5 secure? Not necessarily. If a hacker obtains a database of stored MD5 passwords, how can they reverse-engineer users' plain text passwords? Brute-force cracking is labor-intensive; experienced hackers have better methods.

Many users tend to use simple passwords like 123456, 888888, or password. Therefore, hackers can precompute the MD5 values of these common passwords, creating a reverse table:

python
hash_to_plain = {
    'e10adc3949ba59abbe56e057f20f883e': '123456',
    '21218cca77804d2ba1922c33e0151105': '888888',
    '5f4dcc3b5aa765d61d8327deb882cf99': 'password',
    '...': '...'
}

By comparing MD5 hashes in the database, hackers can obtain user accounts with common passwords.

Users should avoid using overly simple passwords, but can we enhance protection in program design against simple passwords?

Since common password MD5 values are easily computed, we must ensure that stored user passwords aren't those common password MD5s. This can be done by adding a complex string to the original password, known as "salting":

python
def calc_md5(password):
    return get_md5(password + 'the-Salt')

MD5 hashes processed with salt ensure that even if a user inputs a simple password, it becomes challenging to reverse-engineer the plain text password as long as the salt remains unknown.

However, if two users use the same simple password, the database will store two identical MD5 values, indicating they share the same password. Can we have different MD5 values for users with the same password?

Assuming users cannot modify their login names, we can use the username as part of the salt to compute the MD5, ensuring that users with the same password have different MD5s.

Exercise

Simulate user registration based on input username and password, calculating a more secure MD5:

python
db = {}

def register(username, password):
    db[username] = get_md5(password + username + 'the-Salt')

Then, implement user login verification with the modified MD5 algorithm:

python
import hashlib, random

class User(object):
    def __init__(self, username, password):
        self.username = username
        self.salt = ''.join([chr(random.randint(48, 122)) for i in range(20)])
        self.password = get_md5(password + self.salt)

db = {
    'michael': User('michael', '123456'),
    'bob': User('bob', 'abc999'),
    'alice': User('alice', 'alice2008')
}

def get_md5(user, pws):
    return ???

def login(username, password):
    user = db[username]
    return user.password == get_md5(user, password)

Tests

python
assert login('michael', '123456')
assert login('bob', 'abc999')
assert login('alice', 'alice2008')
assert not login('michael', '1234567')
assert not login('bob', '123456')
assert not login('alice', 'Alice2008')
print('ok')

Summary

Hash algorithms are widely applied in various areas. It's important to note that hash algorithms are not encryption algorithms and cannot be used for encryption (since plain text cannot be reversed from a hash); they are solely for tamper detection, leveraging their one-way calculation property to verify user passwords without storing plain text.

Introduction to Hash Algorithms has loaded