Appearance
Serialization
During the execution of a program, all variables exist in memory. For example, if we define a dictionary:
python
d = dict(name='Bob', age=20, score=88)
We can modify variables at any time, such as changing name
to 'Bill'
. However, once the program ends, the memory occupied by the variables is reclaimed by the operating system. If we do not store the modified value of 'Bill'
on disk, the next time we run the program, the variable will be reinitialized to 'Bob'
.
The process of transforming variables from memory into a format that can be stored or transmitted is known as serialization. In Python, this process is referred to as pickling, while in other programming languages, it may be called serialization, marshalling, flattening, etc. They all refer to the same concept.
After serialization, the serialized content can be written to disk or transmitted over a network to another machine.
Conversely, the process of reading variable content from a serialized object back into memory is called deserialization, or unpickling.
Python provides the pickle
module to implement serialization.
Serializing an Object
Let's first try to serialize an object and write it to a file:
python
import pickle
d = dict(name='Bob', age=20, score=88)
pickle.dumps(d) # Serialize the object to bytes
This will produce a byte representation of the object:
plaintext
b'\x80\x03}q\x00(X\x03\x00\x00\x00ageq\x01K\x14X\x05\x00\x00\x00scoreq\x02KXX\x04\x00\x00\x00nameq\x03X\x03\x00\x00\x00Bobq\x04u.'
The pickle.dumps()
method serializes any object into bytes, which can then be written to a file. Alternatively, we can use the pickle.dump()
method to serialize the object directly into a file-like object:
python
with open('dump.txt', 'wb') as f:
pickle.dump(d, f)
When we check the contents of the dump.txt
file, we will see a series of seemingly random characters, which represent the internal information of the Python object.
Deserializing an Object
When we want to read the object back from disk into memory, we can either read the content into bytes and use the pickle.loads()
method to deserialize it, or we can directly use the pickle.load()
method from a file-like object. Here’s how we can deserialize the object we just saved:
python
with open('dump.txt', 'rb') as f:
d = pickle.load(f)
Now, d
will contain the original dictionary:
python
print(d) # Output: {'age': 20, 'score': 88, 'name': 'Bob'}
It's important to note that this variable and the original variable are completely unrelated objects; they simply contain the same content.
Limitations of Pickle
A limitation of Pickle, as with many other programming languages' specific serialization methods, is that it is only usable within Python. Moreover, different versions of Python may not be compatible with one another. Therefore, you should only use Pickle to save unimportant data, where it doesn't matter if the deserialization fails.
JSON
If we need to pass objects between different programming languages, we must serialize the objects into a standard format, such as XML. However, a better approach is to serialize them into JSON, because JSON is represented as a string that can be read by all languages and can easily be stored on disk or transmitted over the network. JSON is not only a standard format but is also faster than XML and can be easily read directly on web pages.
The mapping between JSON and Python's built-in data types is as follows:
JSON Type | Python Type |
---|---|
{} | dict |
[] | list |
"string" | str |
1234.56 | int or float |
true/false | True/False |
null | None |
The built-in json
module in Python provides robust conversion from Python objects to JSON format. Let’s see how to convert a Python object into JSON:
python
import json
d = dict(name='Bob', age=20, score=88)
json_str = json.dumps(d) # Serialize to JSON string
print(json_str) # Output: '{"age": 20, "score": 88, "name": "Bob"}'
The dumps()
method returns a string that contains standard JSON. Similarly, the dump()
method can directly write JSON to a file-like object.
To deserialize JSON back into Python objects, use the loads()
or the corresponding load()
method. The former deserializes a JSON string, while the latter reads from a file-like object and deserializes:
python
json_str = '{"age": 20, "score": 88, "name": "Bob"}'
print(json.loads(json_str)) # Output: {'age': 20, 'score': 88, 'name': 'Bob'}
Since JSON specifies that the encoding is UTF-8, we can always correctly convert between Python's str
and JSON strings.
Advanced JSON
Python's dict
objects can be directly serialized into JSON's {}
, but many times we prefer to represent objects using classes. For example, let's define a Student
class and serialize it:
python
import json
class Student(object):
def __init__(self, name, age, score):
self.name = name
self.age = age
self.score = score
s = Student('Bob', 20, 88)
print(json.dumps(s)) # This will raise TypeError
Running this code will result in a TypeError
:
TypeError: <__main__.Student object at 0x10603cc50> is not JSON serializable
The error occurs because the Student
object is not a JSON-serializable object.
This seems unreasonable, considering that class instances should also be serializable to JSON!
Don't worry; if we look closely at the dumps()
method's parameter list, we can see that it provides several optional parameters in addition to the required obj
parameter:
One of these optional parameters, default
, allows us to convert any object into a JSON-serializable object. We can write a conversion function specifically for the Student
class:
python
def student2dict(std):
return {
'name': std.name,
'age': std.age,
'score': std.score
}
Now, we can serialize the Student
instance:
python
print(json.dumps(s, default=student2dict)) # Output: {"age": 20, "name": "Bob", "score": 88}
However, if we encounter an instance of a Teacher
class next time, we will still be unable to serialize it to JSON. To make it more general, we can convert any class instance to a dictionary:
python
print(json.dumps(s, default=lambda obj: obj.__dict__))
Typically, class instances have a __dict__
attribute, which is a dictionary that stores instance variables. There are a few exceptions, such as classes that define __slots__
.
Similarly, if we want to deserialize JSON back into an instance of the Student
class, the loads()
method first converts the JSON string into a dictionary, and then we provide an object_hook
function to convert the dictionary into a Student
instance:
python
def dict2student(d):
return Student(d['name'], d['age'], d['score'])
Here’s the output when deserializing:
python
json_str = '{"age": 20, "score": 88, "name": "Bob"}'
print(json.loads(json_str, object_hook=dict2student)) # Output: <__main__.Student object at 0x10cd3c190>
The printed output is an instance of the Student
class.
Exercise
When performing JSON serialization on Chinese characters, observe the effect of the ensure_ascii
parameter in json.dumps()
:
python
import json
obj = dict(name='小明', age=20)
s = json.dumps(obj, ensure_ascii=True)
print(s) # Check the output
Summary
The Python-specific serialization module is pickle
, but if we want to make serialization more universal and compliant with web standards, we can use the json
module.
The dumps()
and loads()
functions in the json
module are excellent examples of well-defined interfaces. When using them, we only need to provide one required parameter. However, when the default serialization or deserialization mechanism does not meet our needs, we can pass additional parameters to customize the rules, achieving both simplicity and extensibility in the interface.