Hash Algorithms

A hash algorithm, also known as a digest algorithm, functions by computing a fixed-length output summary from any given set of input data.

Key Characteristics of Hash Algorithms

The most important characteristics of hash algorithms are:

Deterministic Output: The same input always produces the same output.
High Probability of Unique Outputs: Different inputs are highly likely to produce different outputs.

The purpose of a hash algorithm is to verify whether the original data has been tampered with.

Java’s `hashCode()`

Java strings use the hashCode() method as a hash algorithm. Its input is any string, and the output is a fixed 4-byte int:

java

"hello".hashCode(); // 0x5e918d2
"hello, java".hashCode(); // 0x7a9d88e8
"hello, bob".hashCode(); // 0xa0dbae2f

Two identical strings will always produce the same hashCode(). Otherwise, hash-based structures like HashMap would not function correctly. This is why when we override the equals() method in a custom class, we must also correctly override the hashCode() method.

Hash Collisions

A hash collision occurs when two different inputs produce the same output:

java

"AaAaAa".hashCode(); // 0x7460e8c0
"BBAaBB".hashCode(); // 0x7460e8c0

Some may ask: Can collisions be avoided? The answer is no. Collisions are inevitable because the output byte length is fixed. For example, Java’s hashCode() produces a 4-byte integer, allowing for a maximum of 4,294,967,296 unique outputs. However, the input data length is unbounded, resulting in an infinite number of possible inputs. Thus, a hash algorithm maps an infinite input set to a finite output set, guaranteeing collisions.

Collision Probability and Security

Collisions themselves are not inherently problematic; the concern lies in the probability of collisions. The probability of collisions affects the security of the hash algorithm. A secure hash algorithm must satisfy:

Low Collision Probability: Collisions should be rare.
Unpredictable Output: It should be impossible to predict the output based on any input. This means that any single bit change in the input should produce a completely different output, making it difficult to reverse-engineer the input from the output (except by brute force).

For example, consider the following hypothetical hash algorithms:

plaintext

hashA("java001") = "123456"
hashA("java002") = "123457"
hashA("java003") = "123458"

With hashA, it’s easy to reverse-engineer the input from the output because there’s a predictable pattern. This makes hashA insecure. In contrast:

plaintext

hashB("java001") = "123456"
hashB("java002") = "580271"
hashB("java003") = ???

With hashB, the outputs appear random and lack any discernible pattern, making it secure.

Common Hash Algorithms

Algorithm	Output Length (bits)	Output Length (bytes)
MD5	128 bits	16 bytes
SHA-1	160 bits	20 bytes
RipeMD-160	160 bits	20 bytes
SHA-256	256 bits	32 bytes
SHA-512	512 bits	64 bytes

The longer the output length, the harder it is to produce collisions, enhancing the algorithm's security.

Using Hash Algorithms in Java

Java’s standard library provides commonly used hash algorithms and a unified interface for them. Let’s take the MD5 algorithm as an example to see how to compute a hash from input:

java

import java.security.MessageDigest;
import java.util.HexFormat;

public class Main {
    public static void main(String[] args) throws Exception {
        // Create a MessageDigest instance:
        MessageDigest md = MessageDigest.getInstance("MD5");
        // Repeatedly call update to input data:
        md.update("Hello".getBytes("UTF-8"));
        md.update("World".getBytes("UTF-8"));
        byte[] result = md.digest(); // 16 bytes: 68e109f0f40ca72a15e05cc22786f8e6
        System.out.println(HexFormat.of().formatHex(result));
    }
}

When using MessageDigest, first obtain a MessageDigest instance for the desired hash algorithm. Then, repeatedly call update(byte[]) to input data. After all inputs are processed, call the digest() method to obtain the summary as a byte[] array, and finally, convert it to a hexadecimal string.

Running the above code will produce the MD5 hash of the input "HelloWorld" as 68e109f0f40ca72a15e05cc22786f8e6.

Applications of Hash Algorithms

Since the same input always produces the same output, any modification to the input will result in a different output.

Verifying Downloaded Files

When downloading software from a website, download pages often display the hash of the file:

To verify that the downloaded software is original and untampered, you can compute the hash of the local file and compare it with the hash provided by the official website. If they match, the file is downloaded correctly; otherwise, it has been tampered with.

Storing User Passwords

Another important use of hash algorithms is in storing user passwords. Storing raw passwords in a database poses significant security risks:

Database Administrators: Can view user plaintext passwords.
Data Breaches: If the database is compromised, hackers can obtain user plaintext passwords.

To authenticate users without storing raw passwords, store the hash of the passwords instead (e.g., MD5).

When a user inputs their raw password, the system computes the hash of the input and compares it with the stored hash. If they match, the password is correct; otherwise, it is incorrect.

Thus, the table storing usernames and password hashes should look like this:

username	password
bob	f30aa7a662c728b7407c54ae6bfd27d1
alice	25d55ad283aa400af464c76d713c07ad
tim	bed128365216c019988915ed3add75fb

This way, database administrators cannot see user raw passwords. Even if the database is leaked, hackers cannot obtain the original passwords. To retrieve a user's original password, they must use brute-force methods, trying one password at a time until a hash matches the specified value.

Preventing Rainbow Table Attacks

When using hashed passwords, it’s crucial to prevent rainbow table attacks.

What is a Rainbow Table?

As mentioned earlier, if you only have the MD5 hash, you can reverse-engineer the plaintext password using brute-force methods. However, hackers can use precomputed tables of common passwords and their corresponding hashes, known as rainbow tables, to quickly reverse-engineer passwords.

For example, if there is a precomputed table like:

Common Password	MD5 Hash
hello123	f30aa7a662c728b7407c54ae6bfd27d1
12345678	25d55ad283aa400af464c76d713c07ad
passw0rd	bed128365216c019988915ed3add75fb
19700101	570da6d5277a646f6552b8832012f5dc
…	…
20201231	6879c0ae9117b50074ce0a0d4c843060

A hacker can easily find that:

bob’s MD5: f30aa7a662c728b7407c54ae6bfd27d1 corresponds to the original password hello123.
alice’s MD5: 25d55ad283aa400af464c76d713c07ad corresponds to 12345678.
tim’s MD5: bed128365216c019988915ed3add75fb corresponds to passw0rd.

This is why you should avoid using common passwords and avoid using birthdays as passwords.

Salting to Prevent Rainbow Table Attacks

Even if users use common passwords, you can mitigate rainbow table attacks by adding a random salt to each password before hashing. This method is known as salting:

plaintext

digest = md5(salt + inputPassword)

After salting, the database table should look like this:

username	salt	password
bob	H1r0a	a5022319ff4c56955e22a74abcc2c210
alice	7$p2w	e5de688c99e961ed6e560b972dab8b6a
tim	z5Sk9	1eee304b92dc0d105904e7ab58fd2f64

Salting ensures that even if users use common passwords, the rainbow tables become ineffective. Hackers cannot reverse-engineer the original passwords from the salted hashes.

SHA-1

SHA-1 is another hash algorithm with an output of 160 bits (20 bytes). Developed by the U.S. National Security Agency, the SHA family includes SHA-0 (deprecated), SHA-1, SHA-256, SHA-512, among others.

Using SHA-1 in Java is identical to using MD5; you only need to change the algorithm name to "SHA-1":

java

import java.security.MessageDigest;
import java.util.HexFormat;

public class Main {
    public static void main(String[] args) throws Exception {
        // Create a MessageDigest instance:
        MessageDigest md = MessageDigest.getInstance("SHA-1");
        // Repeatedly call update to input data:
        md.update("Hello".getBytes("UTF-8"));
        md.update("World".getBytes("UTF-8"));
        byte[] result = md.digest(); // 20 bytes: db8ac1c259eb89d4a131b253bacfca5f319d54f2
        System.out.println(HexFormat.of().formatHex(result));
    }
}

Similarly, to compute SHA-256, pass "SHA-256" as the algorithm name, and for SHA-512, pass "SHA-512". You can find all hash algorithms supported by the Java standard library here.

Note

Due to its shorter output length, MD5 is vulnerable to collisions within a short time frame and is no longer recommended for security purposes.

Summary

Hash Algorithms can verify data integrity and detect tampering.
Common Hash Algorithms include MD5 and SHA-1.
When Storing Passwords as Hashes, consider rainbow table attacks and implement salting to enhance security.

Hash Algorithms ​

Key Characteristics of Hash Algorithms ​

Java’s hashCode() ​

Hash Collisions ​

Collision Probability and Security ​

Common Hash Algorithms ​

Using Hash Algorithms in Java ​

Applications of Hash Algorithms ​

Verifying Downloaded Files ​

Storing User Passwords ​

Preventing Rainbow Table Attacks ​

SHA-1 ​

Summary ​

Hash Algorithms

Key Characteristics of Hash Algorithms

Java’s `hashCode()`

Hash Collisions

Collision Probability and Security

Common Hash Algorithms

Using Hash Algorithms in Java

Applications of Hash Algorithms

Verifying Downloaded Files

Storing User Passwords

Preventing Rainbow Table Attacks

SHA-1

Summary