Appearance
Hash Algorithms
A hash algorithm, also known as a digest algorithm, functions by computing a fixed-length output summary from any given set of input data.
Key Characteristics of Hash Algorithms
The most important characteristics of hash algorithms are:
- Deterministic Output: The same input always produces the same output.
- High Probability of Unique Outputs: Different inputs are highly likely to produce different outputs.
The purpose of a hash algorithm is to verify whether the original data has been tampered with.
Java’s hashCode()
Java strings use the hashCode()
method as a hash algorithm. Its input is any string, and the output is a fixed 4-byte int
:
java
"hello".hashCode(); // 0x5e918d2
"hello, java".hashCode(); // 0x7a9d88e8
"hello, bob".hashCode(); // 0xa0dbae2f
Two identical strings will always produce the same hashCode()
. Otherwise, hash-based structures like HashMap
would not function correctly. This is why when we override the equals()
method in a custom class, we must also correctly override the hashCode()
method.
Hash Collisions
A hash collision occurs when two different inputs produce the same output:
java
"AaAaAa".hashCode(); // 0x7460e8c0
"BBAaBB".hashCode(); // 0x7460e8c0
Some may ask: Can collisions be avoided? The answer is no. Collisions are inevitable because the output byte length is fixed. For example, Java’s hashCode()
produces a 4-byte integer, allowing for a maximum of 4,294,967,296 unique outputs. However, the input data length is unbounded, resulting in an infinite number of possible inputs. Thus, a hash algorithm maps an infinite input set to a finite output set, guaranteeing collisions.
Collision Probability and Security
Collisions themselves are not inherently problematic; the concern lies in the probability of collisions. The probability of collisions affects the security of the hash algorithm. A secure hash algorithm must satisfy:
- Low Collision Probability: Collisions should be rare.
- Unpredictable Output: It should be impossible to predict the output based on any input. This means that any single bit change in the input should produce a completely different output, making it difficult to reverse-engineer the input from the output (except by brute force).
For example, consider the following hypothetical hash algorithms:
plaintext
hashA("java001") = "123456"
hashA("java002") = "123457"
hashA("java003") = "123458"
With hashA
, it’s easy to reverse-engineer the input from the output because there’s a predictable pattern. This makes hashA
insecure. In contrast:
plaintext
hashB("java001") = "123456"
hashB("java002") = "580271"
hashB("java003") = ???
With hashB
, the outputs appear random and lack any discernible pattern, making it secure.
Common Hash Algorithms
Algorithm | Output Length (bits) | Output Length (bytes) |
---|---|---|
MD5 | 128 bits | 16 bytes |
SHA-1 | 160 bits | 20 bytes |
RipeMD-160 | 160 bits | 20 bytes |
SHA-256 | 256 bits | 32 bytes |
SHA-512 | 512 bits | 64 bytes |
The longer the output length, the harder it is to produce collisions, enhancing the algorithm's security.
Using Hash Algorithms in Java
Java’s standard library provides commonly used hash algorithms and a unified interface for them. Let’s take the MD5 algorithm as an example to see how to compute a hash from input:
java
import java.security.MessageDigest;
import java.util.HexFormat;
public class Main {
public static void main(String[] args) throws Exception {
// Create a MessageDigest instance:
MessageDigest md = MessageDigest.getInstance("MD5");
// Repeatedly call update to input data:
md.update("Hello".getBytes("UTF-8"));
md.update("World".getBytes("UTF-8"));
byte[] result = md.digest(); // 16 bytes: 68e109f0f40ca72a15e05cc22786f8e6
System.out.println(HexFormat.of().formatHex(result));
}
}
When using MessageDigest
, first obtain a MessageDigest
instance for the desired hash algorithm. Then, repeatedly call update(byte[])
to input data. After all inputs are processed, call the digest()
method to obtain the summary as a byte[]
array, and finally, convert it to a hexadecimal string.
Running the above code will produce the MD5 hash of the input "HelloWorld" as 68e109f0f40ca72a15e05cc22786f8e6
.
Applications of Hash Algorithms
Since the same input always produces the same output, any modification to the input will result in a different output.
Verifying Downloaded Files
When downloading software from a website, download pages often display the hash of the file:
To verify that the downloaded software is original and untampered, you can compute the hash of the local file and compare it with the hash provided by the official website. If they match, the file is downloaded correctly; otherwise, it has been tampered with.
Storing User Passwords
Another important use of hash algorithms is in storing user passwords. Storing raw passwords in a database poses significant security risks:
- Database Administrators: Can view user plaintext passwords.
- Data Breaches: If the database is compromised, hackers can obtain user plaintext passwords.
To authenticate users without storing raw passwords, store the hash of the passwords instead (e.g., MD5).
When a user inputs their raw password, the system computes the hash of the input and compares it with the stored hash. If they match, the password is correct; otherwise, it is incorrect.
Thus, the table storing usernames and password hashes should look like this:
username | password |
---|---|
bob | f30aa7a662c728b7407c54ae6bfd27d1 |
alice | 25d55ad283aa400af464c76d713c07ad |
tim | bed128365216c019988915ed3add75fb |
This way, database administrators cannot see user raw passwords. Even if the database is leaked, hackers cannot obtain the original passwords. To retrieve a user's original password, they must use brute-force methods, trying one password at a time until a hash matches the specified value.
Preventing Rainbow Table Attacks
When using hashed passwords, it’s crucial to prevent rainbow table attacks.
What is a Rainbow Table?
As mentioned earlier, if you only have the MD5 hash, you can reverse-engineer the plaintext password using brute-force methods. However, hackers can use precomputed tables of common passwords and their corresponding hashes, known as rainbow tables, to quickly reverse-engineer passwords.
For example, if there is a precomputed table like:
Common Password | MD5 Hash |
---|---|
hello123 | f30aa7a662c728b7407c54ae6bfd27d1 |
12345678 | 25d55ad283aa400af464c76d713c07ad |
passw0rd | bed128365216c019988915ed3add75fb |
19700101 | 570da6d5277a646f6552b8832012f5dc |
… | … |
20201231 | 6879c0ae9117b50074ce0a0d4c843060 |
A hacker can easily find that:
- bob’s MD5:
f30aa7a662c728b7407c54ae6bfd27d1
corresponds to the original passwordhello123
. - alice’s MD5:
25d55ad283aa400af464c76d713c07ad
corresponds to12345678
. - tim’s MD5:
bed128365216c019988915ed3add75fb
corresponds topassw0rd
.
This is why you should avoid using common passwords and avoid using birthdays as passwords.
Salting to Prevent Rainbow Table Attacks
Even if users use common passwords, you can mitigate rainbow table attacks by adding a random salt to each password before hashing. This method is known as salting:
plaintext
digest = md5(salt + inputPassword)
After salting, the database table should look like this:
username | salt | password |
---|---|---|
bob | H1r0a | a5022319ff4c56955e22a74abcc2c210 |
alice | 7$p2w | e5de688c99e961ed6e560b972dab8b6a |
tim | z5Sk9 | 1eee304b92dc0d105904e7ab58fd2f64 |
Salting ensures that even if users use common passwords, the rainbow tables become ineffective. Hackers cannot reverse-engineer the original passwords from the salted hashes.
SHA-1
SHA-1 is another hash algorithm with an output of 160 bits (20 bytes). Developed by the U.S. National Security Agency, the SHA family includes SHA-0 (deprecated), SHA-1, SHA-256, SHA-512, among others.
Using SHA-1 in Java is identical to using MD5; you only need to change the algorithm name to "SHA-1"
:
java
import java.security.MessageDigest;
import java.util.HexFormat;
public class Main {
public static void main(String[] args) throws Exception {
// Create a MessageDigest instance:
MessageDigest md = MessageDigest.getInstance("SHA-1");
// Repeatedly call update to input data:
md.update("Hello".getBytes("UTF-8"));
md.update("World".getBytes("UTF-8"));
byte[] result = md.digest(); // 20 bytes: db8ac1c259eb89d4a131b253bacfca5f319d54f2
System.out.println(HexFormat.of().formatHex(result));
}
}
Similarly, to compute SHA-256, pass "SHA-256"
as the algorithm name, and for SHA-512, pass "SHA-512"
. You can find all hash algorithms supported by the Java standard library here.
Note
Due to its shorter output length, MD5 is vulnerable to collisions within a short time frame and is no longer recommended for security purposes.
Summary
- Hash Algorithms can verify data integrity and detect tampering.
- Common Hash Algorithms include MD5 and SHA-1.
- When Storing Passwords as Hashes, consider rainbow table attacks and implement salting to enhance security.