Base64

Base64 is a method for encoding arbitrary binary data using 64 characters.

When we open files like exe, jpg, or pdf with Notepad, we see a bunch of garbled characters because binary files contain many characters that cannot be displayed or printed. Therefore, to enable text processing software like Notepad to handle binary data, a conversion method from binary to string is needed. Base64 is one of the most common binary encoding methods.

The principle of Base64 is straightforward. First, prepare an array containing 64 characters:

['A', 'B', 'C', ... 'a', 'b', 'c', ... '0', '1', ... '+', '/']

Next, process the binary data in groups of three bytes, totaling 3x8=24 bits, which is divided into four groups of six bits each:

┌───────────────┬───────────────┬───────────────┐
│      b1       │      b2       │      b3       │
├─┬─┬─┬─┬─┬─┬─┬─┼─┬─┬─┬─┬─┬─┬─┬─┼─┬─┬─┬─┬─┬─┬─┬─┤
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
├─┴─┴─┴─┴─┴─┼─┴─┴─┴─┴─┴─┼─┴─┴─┴─┴─┴─┼─┴─┴─┴─┴─┴─┤
│    n1     │    n2     │    n3     │    n4     │
└───────────┴───────────┴───────────┴───────────┘

This gives us four numbers as indexes, which we can use to look up the corresponding four characters, resulting in the encoded string.

Thus, Base64 encoding transforms three bytes of binary data into four bytes of text data, increasing the length by 33%. The benefit is that the encoded text data can be displayed directly in email bodies, web pages, etc.

What if the binary data to be encoded is not a multiple of three bytes and ends with one or two bytes? Base64 pads the end with \x00 bytes and adds one or two '=' signs at the end of the encoded data to indicate how many bytes were padded. During decoding, these are automatically removed.

Python's built-in base64 module allows for straightforward Base64 encoding and decoding:

python

>>> import base64
>>> base64.b64encode(b'binary\x00string')
b'YmluYXJ5AHN0cmluZw=='
>>> base64.b64decode(b'YmluYXJ5AHN0cmluZw==')
b'binary\x00string'

Since standard Base64 encoding may result in the characters '+' and '/', which cannot be used directly as parameters in URLs, there is a "URL safe" Base64 encoding that substitutes '+' and '/' with '-' and '_' respectively:

python

>>> base64.b64encode(b'i\xb7\x1d\xfb\xef\xff')
b'abcd++//'
>>> base64.urlsafe_b64encode(b'i\xb7\x1d\xfb\xef\xff')
b'abcd--__'
>>> base64.urlsafe_b64decode('abcd--__')
b'i\xb7\x1d\xfb\xef\xff'

You can also define your own order of the 64 characters, allowing for custom Base64 encoding, but this is typically unnecessary.

Base64 is a lookup-based encoding method and is not suitable for encryption, even with a custom encoding table.

Base64 is suitable for encoding small segments of content, such as digital certificate signatures, cookie contents, etc.

Since the '=' character may also appear in Base64 encoding, its use in URLs and cookies can cause ambiguity. Therefore, many Base64 encodings remove the '=':

Standard Base64:

'abcd' -> 'YWJjZA=='

Automatically removing '=':

'abcd' -> 'YWJjZA'

How do we decode without the '='? Since Base64 transforms three bytes into four bytes, the length of a Base64 encoded string is always a multiple of four. Therefore, we need to add '=' to make the length a multiple of four, allowing for normal decoding.

Summary

Base64 is a method for encoding arbitrary binary data into text strings, commonly used for transmitting small amounts of binary data in URLs, cookies, and web pages.

Exercise

Please write a function to handle Base64 decoding without the '=':

python

import base64

def safe_base64_decode(s):
    pass

# Test:
assert b'abcd' == safe_base64_decode('YWJjZA=='), safe_base64_decode('YWJjZA==')
assert b'abcd' == safe_base64_decode('YWJjZA'), safe_base64_decode('YWJjZA')
print('ok')

Base64 ​

Standard Base64: ​

Automatically removing '=': ​

Summary ​

Exercise ​

Base64

Standard Base64:

Automatically removing '=':

Summary

Exercise