Strings and Encoding

Character Encoding

As we have discussed, strings are also a data type. However, strings are somewhat special because they involve an encoding issue.

Since computers can only process numbers, to handle text, the text must first be converted into numbers. The earliest computers were designed using 8 bits as one byte, so the largest integer a single byte could represent was 255 (binary 11111111 = decimal 255). To represent larger integers, more bytes are needed. For example, two bytes can represent a maximum integer of 65,535, and four bytes can represent a maximum integer of 4,294,967,295.

Because computers were invented by Americans, initially only 127 characters were encoded into computers, including uppercase and lowercase English letters, numbers, and some symbols. This encoding table is known as ASCII encoding. For example, the uppercase letter 'A' has the encoding 65, and the lowercase letter 'z' has the encoding 122.

However, to handle Chinese characters, one byte is obviously insufficient. At least two bytes are needed, and they must not conflict with ASCII encoding. Therefore, China developed the GB2312 encoding to include Chinese characters.

As you can imagine, there are hundreds of languages worldwide. Japan encodes Japanese into Shift_JIS, South Korea encodes Korean into Euc-kr, and each country has its own standard, inevitably leading to conflicts. As a result, when dealing with multilingual mixed texts, garbled characters may appear.

char-encoding-problem

Therefore, the Unicode character set was created. Unicode unifies all languages into a single encoding set, eliminating garbled character issues.

The Unicode standard is continually evolving, but the most commonly used is UCS-16 encoding, which uses two bytes to represent one character (four bytes are needed for very rare characters). Modern operating systems and most programming languages directly support Unicode.

Now, let's summarize the differences between ASCII encoding and Unicode encoding: ASCII encoding uses 1 byte, while Unicode encoding typically uses 2 bytes.

The letter 'A' in ASCII encoding is decimal 65, binary 01000001.
The character '0' in ASCII encoding is decimal 48, binary 00110000. Note that the character '0' is different from the integer 0.
Chinese characters exceed the range of ASCII encoding. In Unicode encoding, a Chinese character is decimal 20,013, binary 01001110 00101101.

You might guess that if you use Unicode encoding for an ASCII 'A', you simply need to add a leading zero, making the Unicode encoding of 'A' 00000000 01000001.

A new problem arises: if everything is unified into Unicode encoding, the garbled character issue disappears. However, if your text is almost entirely English, Unicode encoding requires twice the storage space compared to ASCII encoding, making it very inefficient for storage and transmission.

Therefore, adhering to the principle of saving space, a "variable-length encoding" called UTF-8 encoding was introduced to convert Unicode encoding. UTF-8 encodes a Unicode character into 1-6 bytes based on the size of the number. Common English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters are encoded into 4-6 bytes. If the text you need to transmit contains a large number of English characters, using UTF-8 encoding can save space:

Character	ASCII	Unicode	UTF-8
A	`01000001`	`00000000 01000001`	`01000001`
中		`01001110 00101101`	`11100100 10111000 10101101`

From the table above, we can also see an additional advantage of UTF-8 encoding: ASCII encoding can actually be considered a subset of UTF-8 encoding. Therefore, many legacy software systems that only support ASCII encoding can continue to operate under UTF-8 encoding.

Having clarified the relationship between ASCII, Unicode, and UTF-8, we can summarize the current common character encoding workflow in computer systems:

In computer memory, Unicode encoding is used uniformly. When saving to the hard drive or transmitting data, it is converted to UTF-8 encoding.
When editing with Notepad, UTF-8 characters read from the file are converted to Unicode characters in memory. After editing, Unicode is converted back to UTF-8 and saved to the file:
When browsing the web, the server converts dynamically generated Unicode content to UTF-8 before transmitting it to the browser:

Therefore, you often see information like <meta charset="UTF-8" /> in the source code of many web pages, indicating that the page uses UTF-8 encoding.

Python's Strings

After understanding the perplexing character encoding issues, let's study Python's strings.

In the latest Python 3 version, strings are encoded in Unicode. This means Python's strings support multiple languages. For example:

python

>>> print('包含中文的str')
包含中文的str

For the encoding of individual characters, Python provides the ord() function to get the integer representation of a character and the chr() function to convert an encoding to the corresponding character:

python

>>> ord('A')
65
>>> ord('中')
20013
>>> chr(66)
'B'
>>> chr(25991)
'文'

If you know the integer encoding of a character, you can also write strings using hexadecimal:

python

>>> '\u4e2d\u6587'
'中文'

Both methods are completely equivalent.

Since Python's string type is str, which is represented in memory using Unicode, each character corresponds to several bytes. If you need to transmit over a network or save to disk, you must convert str to bytes, which are byte-based.

Python represents bytes type data with a leading b in single or double quotes:

python

x = b'ABC'

It's important to distinguish between 'ABC' and b'ABC'. The former is str, while the latter, although it looks the same, has each character occupying only one byte.

A str represented in Unicode can be encoded into specified bytes using the encode() method. For example:

python

>>> 'ABC'.encode('ascii')
b'ABC'
>>> '中文'.encode('utf-8')
b'\xe4\xb8\xad\xe6\x96\x87'
>>> '中文'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

A purely English str can be encoded into bytes using ASCII encoding with identical content. A str containing Chinese characters can be encoded into bytes using UTF-8 encoding. However, a str containing Chinese characters cannot be encoded using ASCII encoding because the range of Chinese encodings exceeds that of ASCII encoding, causing Python to raise an error.

In bytes, bytes that cannot be displayed as ASCII characters are shown using \x##.

Conversely, if you read a byte stream from the network or disk, the data read is bytes. To convert bytes to str, you need to use the decode() method:

python

>>> b'ABC'.decode('ascii')
'ABC'
>>> b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8')
'中文'

If the bytes contain bytes that cannot be decoded, the decode() method will raise an error:

python

>>> b'\xe4\xb8\xad\xff'.decode('utf-8')
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 3: invalid start byte

If only a small portion of the bytes are invalid, you can pass errors='ignore' to ignore the erroneous bytes:

python

>>> b'\xe4\xb8\xad\xff'.decode('utf-8', errors='ignore')
'中'

To calculate how many characters a str contains, you can use the len() function:

python

>>> len('ABC')
3
>>> len('中文')
2

The len() function counts the number of characters in a str. If used on bytes, the len() function counts the number of bytes:

python

>>> len(b'ABC')
3
>>> len(b'\xe4\xb8\xad\xe6\x96\x87')
6
>>> len('中文'.encode('utf-8'))
6

As seen, one Chinese character typically occupies three bytes after UTF-8 encoding, while one English character occupies only one byte.

When working with strings, we often encounter the need to convert between str and bytes. To avoid garbled characters, always use UTF-8 encoding when converting between str and bytes.

Since Python source code is also a text file, when your source code contains Chinese characters, you must ensure that the source code is saved in UTF-8 encoding. When the Python interpreter reads the source code, to read it as UTF-8 encoded, we usually add the following two lines at the beginning of the file:

python

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

The first comment line is to tell Linux/OS X systems that this is an executable Python program. Windows systems will ignore this comment.

The second comment line tells the Python interpreter to read the source code using UTF-8 encoding; otherwise, Chinese characters written in the source code might appear garbled when output.

Declaring UTF-8 encoding does not mean your .py file is encoded in UTF-8. You must ensure that your text editor is using UTF-8 encoding.

If the .py file itself uses UTF-8 encoding and also declares # -*- coding: utf-8 -*-, opening the command prompt and testing will display Chinese characters correctly:

Formatting

The last common issue is how to output formatted strings. We often need to output strings like 'Dear xxx, hello! Your phone bill for xx month is xx, and your balance is xx.', where the xxx content changes based on variables. Therefore, a convenient way to format strings is needed.

In Python, the formatting method used is consistent with C language, using %. For example:

python

>>> 'Hello, %s' % 'world'
'Hello, world'
>>> 'Hi, %s, you have $%d.' % ('Michael', 1000000)
'Hi, Michael, you have $1000000.'

You might guess that the % operator is used for formatting strings. Inside the string, %s indicates a string replacement, %d indicates an integer replacement. The number of %? placeholders should match the number of variables or values that follow, and the order must correspond. If there is only one %?, the parentheses can be omitted.

Common placeholders include:

Placeholder	Replacement Content
`%d`	Integer
`%f`	Floating-point number
`%s`	String
`%x`	Hexadecimal integer

For formatting integers and floating-point numbers, you can also specify whether to pad with zeros and the number of digits before and after the decimal point:

python

print('%2d-%02d' % (3, 1))
print('%.2f' % 3.1415926)

If you're unsure which placeholder to use, %s always works as it converts any data type to a string:

python

>>> 'Age: %s. Gender: %s' % (25, True)
'Age: 25. Gender: True'

Sometimes, how do you include a % character in the string itself? In this case, you need to escape it using %% to represent a single %:

python

>>> 'growth rate: %d %%' % 7
'growth rate: 7 %'

format()

Another method for formatting strings is using the format() method of strings. It replaces placeholders {0}, {1}, etc., with the passed arguments in order. However, this method is more cumbersome to write compared to %:

python

>>> 'Hello, {0}, your score has increased by {1:.1f}%'.format('Xiao Ming', 17.125)
'Hello, Xiao Ming, your score has increased by 17.1%'

f-string

The last method for formatting strings is using strings that start with f, known as f-strings. Unlike regular strings, if the string contains {xxx}, it will be replaced by the corresponding variable:

python

>>> r = 2.5
>>> s = 3.14 * r ** 2
>>> print(f'The area of a circle with radius {r} is {s:.2f}')
The area of a circle with radius 2.5 is 19.62

In the code above, {r} is replaced by the value of variable r, and {s:.2f} is replaced by the value of variable s with formatting parameters specified after the : (i.e., retaining two decimal places). Therefore, {s:.2f} is replaced by 19.62.

Exercise

Xiao Ming's score improved from 72 points last year to 85 points this year. Please calculate the percentage point increase in Xiao Ming's score and use string formatting to display it as 'xx.x%', keeping only one decimal place:

python

s1 = 72
s2 = 85
r = ???  
print('???')

Summary

Python 3's strings use Unicode and directly support multiple languages.

When converting between str and bytes, you need to specify the encoding. The most commonly used encoding is UTF-8. Python also supports other encoding methods, such as encoding Unicode into GB2312:

python

>>> '中文'.encode('gb2312')
b'\xd6\xd0\xce\xc4'

However, this method is purely troublesome. Unless there are special business requirements, always use UTF-8 encoding.

When formatting strings, you can use Python's interactive environment for testing, which is convenient and fast.

Strings and Encoding ​

Character Encoding ​

char-encoding-problem ​

Python's Strings ​

Formatting ​

format() ​

f-string ​

Exercise ​

Summary ​

Strings and Encoding

Character Encoding

char-encoding-problem

Python's Strings

Formatting

format()

f-string

Exercise

Summary