Skip to content

Serialization

During the execution of a program, all variables exist in memory. For example, if we define a dictionary:

python
d = dict(name='Bob', age=20, score=88)

We can modify variables at any time, such as changing name to 'Bill'. However, once the program ends, the memory occupied by the variables is reclaimed by the operating system. If we do not store the modified value of 'Bill' on disk, the next time we run the program, the variable will be reinitialized to 'Bob'.

The process of transforming variables from memory into a format that can be stored or transmitted is known as serialization. In Python, this process is referred to as pickling, while in other programming languages, it may be called serialization, marshalling, flattening, etc. They all refer to the same concept.

After serialization, the serialized content can be written to disk or transmitted over a network to another machine.

Conversely, the process of reading variable content from a serialized object back into memory is called deserialization, or unpickling.

Python provides the pickle module to implement serialization.

Serializing an Object

Let's first try to serialize an object and write it to a file:

python
import pickle
d = dict(name='Bob', age=20, score=88)
pickle.dumps(d)  # Serialize the object to bytes

This will produce a byte representation of the object:

plaintext
b'\x80\x03}q\x00(X\x03\x00\x00\x00ageq\x01K\x14X\x05\x00\x00\x00scoreq\x02KXX\x04\x00\x00\x00nameq\x03X\x03\x00\x00\x00Bobq\x04u.'

The pickle.dumps() method serializes any object into bytes, which can then be written to a file. Alternatively, we can use the pickle.dump() method to serialize the object directly into a file-like object:

python
with open('dump.txt', 'wb') as f:
    pickle.dump(d, f)

When we check the contents of the dump.txt file, we will see a series of seemingly random characters, which represent the internal information of the Python object.

Deserializing an Object

When we want to read the object back from disk into memory, we can either read the content into bytes and use the pickle.loads() method to deserialize it, or we can directly use the pickle.load() method from a file-like object. Here’s how we can deserialize the object we just saved:

python
with open('dump.txt', 'rb') as f:
    d = pickle.load(f)

Now, d will contain the original dictionary:

python
print(d)  # Output: {'age': 20, 'score': 88, 'name': 'Bob'}

It's important to note that this variable and the original variable are completely unrelated objects; they simply contain the same content.

Limitations of Pickle

A limitation of Pickle, as with many other programming languages' specific serialization methods, is that it is only usable within Python. Moreover, different versions of Python may not be compatible with one another. Therefore, you should only use Pickle to save unimportant data, where it doesn't matter if the deserialization fails.

JSON

If we need to pass objects between different programming languages, we must serialize the objects into a standard format, such as XML. However, a better approach is to serialize them into JSON, because JSON is represented as a string that can be read by all languages and can easily be stored on disk or transmitted over the network. JSON is not only a standard format but is also faster than XML and can be easily read directly on web pages.

The mapping between JSON and Python's built-in data types is as follows:

JSON TypePython Type
{}dict
[]list
"string"str
1234.56int or float
true/falseTrue/False
nullNone

The built-in json module in Python provides robust conversion from Python objects to JSON format. Let’s see how to convert a Python object into JSON:

python
import json
d = dict(name='Bob', age=20, score=88)
json_str = json.dumps(d)  # Serialize to JSON string
print(json_str)  # Output: '{"age": 20, "score": 88, "name": "Bob"}'

The dumps() method returns a string that contains standard JSON. Similarly, the dump() method can directly write JSON to a file-like object.

To deserialize JSON back into Python objects, use the loads() or the corresponding load() method. The former deserializes a JSON string, while the latter reads from a file-like object and deserializes:

python
json_str = '{"age": 20, "score": 88, "name": "Bob"}'
print(json.loads(json_str))  # Output: {'age': 20, 'score': 88, 'name': 'Bob'}

Since JSON specifies that the encoding is UTF-8, we can always correctly convert between Python's str and JSON strings.

Advanced JSON

Python's dict objects can be directly serialized into JSON's {}, but many times we prefer to represent objects using classes. For example, let's define a Student class and serialize it:

python
import json

class Student(object):
    def __init__(self, name, age, score):
        self.name = name
        self.age = age
        self.score = score

s = Student('Bob', 20, 88)
print(json.dumps(s))  # This will raise TypeError

Running this code will result in a TypeError:

TypeError: <__main__.Student object at 0x10603cc50> is not JSON serializable

The error occurs because the Student object is not a JSON-serializable object.

This seems unreasonable, considering that class instances should also be serializable to JSON!

Don't worry; if we look closely at the dumps() method's parameter list, we can see that it provides several optional parameters in addition to the required obj parameter:

Python JSON documentation

One of these optional parameters, default, allows us to convert any object into a JSON-serializable object. We can write a conversion function specifically for the Student class:

python
def student2dict(std):
    return {
        'name': std.name,
        'age': std.age,
        'score': std.score
    }

Now, we can serialize the Student instance:

python
print(json.dumps(s, default=student2dict))  # Output: {"age": 20, "name": "Bob", "score": 88}

However, if we encounter an instance of a Teacher class next time, we will still be unable to serialize it to JSON. To make it more general, we can convert any class instance to a dictionary:

python
print(json.dumps(s, default=lambda obj: obj.__dict__))

Typically, class instances have a __dict__ attribute, which is a dictionary that stores instance variables. There are a few exceptions, such as classes that define __slots__.

Similarly, if we want to deserialize JSON back into an instance of the Student class, the loads() method first converts the JSON string into a dictionary, and then we provide an object_hook function to convert the dictionary into a Student instance:

python
def dict2student(d):
    return Student(d['name'], d['age'], d['score'])

Here’s the output when deserializing:

python
json_str = '{"age": 20, "score": 88, "name": "Bob"}'
print(json.loads(json_str, object_hook=dict2student))  # Output: <__main__.Student object at 0x10cd3c190>

The printed output is an instance of the Student class.

Exercise

When performing JSON serialization on Chinese characters, observe the effect of the ensure_ascii parameter in json.dumps():

python
import json

obj = dict(name='小明', age=20)
s = json.dumps(obj, ensure_ascii=True)
print(s)  # Check the output

Summary

The Python-specific serialization module is pickle, but if we want to make serialization more universal and compliant with web standards, we can use the json module.

The dumps() and loads() functions in the json module are excellent examples of well-defined interfaces. When using them, we only need to provide one required parameter. However, when the default serialization or deserialization mechanism does not meet our needs, we can pass additional parameters to customize the rules, achieving both simplicity and extensibility in the interface.

Serialization has loaded