Skip to content

Memory

Today we will step into a new module—Memory. This is an extremely important topic in LLM application development that we haven't discussed before.

Most LLM applications require multiple rounds of interaction with users. If the LLM cannot remember the previous conversation history during interactions, it will be unable to provide coherent and meaningful responses, much like having "dementia." This significantly diminishes the user experience, as shown in the figure below:

The Memory module is designed to address this issue; it functions like the human hippocampus, responsible for storing and recalling key information to ensure smooth and accurate conversations.

In today’s session, we will first briefly discuss the basic concepts and principles of the Memory module. Then, through a simple LLM example, you will experience the power of the Memory module firsthand. Finally, we will dive into the source code to learn how traditional chains and LCEL chains in LangChain use and implement Memory.

Basic Principles of the Memory Module

How can we enable the LLM to respond based on conversation history? From the previous experience of constructing RAG applications, it is easy to think of manipulating the construction of prompts. We can directly send past conversation content along with the question to the LLM, allowing it to reference the historical conversation content when answering.

For example, the following prompt template:

python
promptTemplate = """
The <history> tag below contains the historical conversation records between AI and Human. Reference the historical context to answer the user's question.
Historical conversation:
<history>
{chat_history}
</history>
Human: {question}
AI:"""

In this prompt template, we have left a placeholder for the historical conversation content. Before sending it to the LLM for execution, we query the relevant historical dialogue and fill it into the prompt. After the LLM responds, we find a way to store this question and answer for future reference.

Yes, the basic principle of Memory is that before LLM execution, we query historical records, and after LLM execution, we store the current dialogue records.

memory1.webp

Although the integration of Memory into LLMs is not complex, designing a functional and accurate Memory system is not simple.

The First Question: How to Query?

Current large language models have token limits, so if we merge all conversation content into the prompt each time, as the number of interactions increases, the prompt will exceed the model’s context length limit.

Even if the number of tokens supported by the model is sufficient, including all conversation content can lead to skyrocketing conversation costs. Moreover, the LLM may struggle to extract the most relevant historical content related to the current dialogue, resulting in a decline in answer quality.

Thus, how to query the context related to the current question from historical conversations—the query algorithm—will directly impact the quality of the model's responses.

The Second Question: How to Store?

Storing conversation records is relatively straightforward. LangChain integrates various databases, making it easy to use the classes and methods encapsulated in LangChain to save memory to different databases. You can click here for specific integration links.

Of course, which storage method to use depends more on the structure and algorithm of the query data. For example, if using vector similarity or knowledge graphs as query algorithms, we need to choose different databases for storage.

Whether to store the entire conversation content also depends on the design of the actual system. For instance, to avoid wasting storage space, we may choose to retain only the data from the past month.

Sometimes, we don’t store the original conversation records. Based on the accuracy of the query and storage space considerations, we can merge relevant conversations before storage.

LangChain provides different storage methods and Memory components with query algorithms, allowing us to use them out of the box. We will explain the implementation principles and usage of these components in the next section.

Now, let’s experience the transformation of LLM after adding Memory through a simple example.

An Example to Experience the Power of Memory

The first step in “enhancing” the LLM with Memory is to instantiate a Memory object. Here, we choose ConversationBufferMemory, which is a relatively simple Memory implementation in LangChain.

python
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history")

The memory_key is assigned the name of the placeholder variable representing the historical conversation in the prompt template, as follows:

python
from langchain.prompts import PromptTemplate

template_str = """
You are a chatbot having a conversation with a human.
Previous conversation:
{chat_history}
Human: {question}
AI:"""

prompt_template = PromptTemplate.from_template(template_str)

Next, we create a model instance and build the chain using LLMChain, passing the ConversationBufferMemory object created above into the memory parameter of the LLMChain constructor.

python
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain

llm = ChatOpenAI(temperature=0)
memory_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    verbose=True,  # verbose means print detailed process
    memory=memory,
)

Now we have an LLM application chain with memory functionality. Let’s try asking a question:

python
memory_chain.predict(question="Hello, I am Jack.")
> Entering new LLMChain chain...
Prompt after formatting:

You are a chatbot having a conversation with a human.
Previous conversation:

Human: Hello, I am Jack.
AI:

> Finished chain.
Hello, Jack. I am a chatbot. How can I assist you?

Now let's ask, “Do you remember what my name is?”:

python
memory_chain.predict(question="Do you remember what my name is?")
> Entering new LLMChain chain...
Prompt after formatting:

You are a chatbot having a conversation with a human.
Previous conversation:
Human: Hello, I am Jack.
AI: Hello, Jack. I am a chatbot. How can I assist you?
Human: Do you remember what my name is?
AI:

> Finished chain.
Of course, you just told me your name is Jack. Do you have any other questions?

We can see that in the subsequent dialogue, the prompts sent to the LLM included the historical conversation content, and the LLM indeed provided satisfactory responses.

Implementation of the Memory Module in LangChain

The two most fundamental and important actions in the Memory module are querying and storing. In this section, we will analyze the source code to see how LangChain integrates these two actions into the construction of chains.

As we learned earlier, there are two types of chains in LangChain: traditional chains and LCEL. The way these two types link to the Memory module differs.

Memory Implementation in Traditional Chains

In traditional chains, all implementation chain classes inherit from Chain, and the execution path generally follows: XXXChain.predict/XXXChain(...) -> Chain.__call__ -> Chain.invoke.

python
class LLMChain(Chain):
    ...
    def predict(self, callbacks: Callbacks = None, **kwargs: Any) -> str:
        return self(kwargs, callbacks=callbacks)[self.output_key]
        
class Chain(RunnableSerializable[Dict[str, Any], Dict[str, Any]], ABC):
    ...
    def __call__(...) -> Dict[str, Any]:
        ...
        return self.invoke(...)

This means that ultimately, it will call the invoke method of the Chain.

python
class Chain(RunnableSerializable[Dict[str, Any], Dict[str, Any]], ABC):
    def invoke(...):
        ...
        inputs = self.prep_inputs(input)
        ...
        # Call the specific Chain's _call interface to perform the actual LLM call
        outputs = self._call(inputs)
        ...
        final_outputs: Dict[str, Any] = self.prep_outputs(...)
        ...
        return final_outputs

From the implementation code of invoke, we can see that before calling the model, it first calls the prep_inputs method to perform some preprocessing. After the model responds, it also calls prep_outputs to perform some output preprocessing before returning to the user.

Memory's querying and storing are handled in these two places. Let’s first look at prep_inputs:

python
class Chain(RunnableSerializable[Dict[str, Any], Dict[str, Any]], ABC):
    ...
    memory: Optional[BaseMemory] = None
    ...
    def prep_inputs(self, inputs: Union[Dict[str, Any], Any]) -> Dict[str, str]:
    ...
    if self.memory is not None:
        external_context = self.memory.load_memory_variables(inputs)
        inputs = dict(inputs, **external_context)
    ...
    return inputs

Do you remember the Memory object ConversationBufferMemory that was passed to LLMChain in the previous example? This object is assigned to the Chain.memory variable.

When Chain.memory is assigned, it calls the load_memory_variables method of the Memory object to retrieve historical records. The retrieved historical records are added to the final prompt.

Having understood the path for fetching historical records, let’s see how prep_outputs stores the current dialogue.

python
class Chain(RunnableSerializable[Dict[str, Any], Dict[str, Any]], ABC):
    ...
    memory: Optional[BaseMemory] = None
    ...
    def prep_outputs(
        self,
        inputs: Dict[str, str],
        outputs: Dict[str, str],
        return_only_outputs: bool = False,
    ) -> Dict[str, str]:
        ...
        if self.memory is not None:
            self.memory.save_context(inputs, outputs)
        ...

In prep_outputs, the current user input and model response are passed to the save_context method of the Chain.memory variable for storage.

Now, let’s focus on the Chain.memory variable, which is an instance of BaseMemory. BaseMemory is an interface designed by LangChain for Memory, containing necessary abstract methods such as load_memory_variables and save_context:

python
class BaseMemory(Serializable, ABC):
    @abstractmethod
    def load_memory_variables(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        """Return key-value pairs given the text input to the chain."""
    @abstractmethod
    def save_context(self, inputs: Dict[str, Any], outputs: Dict[str, str]) -> None:
        """Save the context of this chain run to memory."""
    ...

Any specific Memory class can seamlessly integrate into LangChain's chains as long as it implements these methods.

memory2.webp

Implementation of Memory in LCEL Chains

LCEL is a chain-building method recommended by LangChain. Although many Memory components are still developed using the traditional chain integration method (implementing the BaseMemory interface) as of LangChain version 0.1.13, the framework code for integrating Memory with LCEL has already been implemented.

Let's rewrite the previous Memory example using the LCEL approach to quickly understand how to add Memory functionality to an LCEL chain.

The prompt template and model instantiation are the same as before, so I'll directly copy the code for clarity:

python
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

template_str = """
You are a chatbot having a conversation with a human.
Previous conversation:
{chat_history}
Human: {question}
AI:"""
prompt_template = PromptTemplate.from_template(template_str)
llm = ChatOpenAI(temperature=0)

Next, we build a chain using LCEL syntax:

python
memory_chain = prompt_template | llm

Now, we need to implement a function to retrieve the conversation history:

python
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory

store = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

This function takes a string variable session_id (a requirement by LangChain) and returns an instance of the BaseChatMessageHistory class, representing the historical records retrieved based on the session ID.

Finally, we instantiate a RunnableWithMessageHistory object using the memory_chain and get_session_history:

python
from langchain_core.runnables.history import RunnableWithMessageHistory

with_message_history = RunnableWithMessageHistory(
    memory_chain,
    get_session_history,
    input_messages_key="question",
    history_messages_key="chat_history",
)

Here, input_messages_key specifies the variable name for user input in the prompt template, while history_messages_key specifies the variable name for historical records.

To execute the chain, specify the session ID:

python
with_message_history.invoke(
    {"question": "你好,我是jack"},
    config={"configurable": {"session_id": "abc123"}},
)

with_message_history.invoke(
    {"question": "我的名字叫什么?"},
    config={"configurable": {"session_id": "abc123"}},
)

At first glance, it may seem like there’s no memory storage code present. However, the key lies in the BaseChatMessageHistory returned by get_session_history.

BaseChatMessageHistory is designed by LangChain to store historical dialogues, functioning similarly to BaseMemory in traditional chains. It provides an abstract method add_message for adding historical records.

python
class BaseChatMessageHistory(ABC):
    ...
    messages: List[BaseMessage]
    ...
    def add_message(self, message: BaseMessage) -> None:

For instance, the ChatMessageHistory class inherits from BaseChatMessageHistory, and its add_message method simply adds the current message to the messages array.

python
class ChatMessageHistory(BaseChatMessageHistory, BaseModel):
    ...
    def add_message(self, message: BaseMessage) -> None:
        """Add a self-created message to the store"""
        self.messages.append(message)
    ...

Thus, after the model responds, the BaseChatMessageHistory returned by get_session_history can call add_message to store the current dialogue.

Next, let's focus on RunnableWithMessageHistory, which extends RunnableBindingBase and serves as a decorator to manage chat message history.

We can trace its execution process starting from the call method. The invoke method of with_message_history calls:

python
class RunnableBindingBase(RunnableSerializable[Input, Output]):
    def invoke(...) -> Output:
        return self.bound.invoke(
            input,
            self._merge_configs(config),
            **{**self.kwargs, **kwargs},
        )

Here, the first point of interest is self._merge_configs(config), where config is the passed {"configurable": {"session_id": "abc123"}}. The _merge_configs method of RunnableWithMessageHistory is then called:

python
class RunnableWithMessageHistory(RunnableBindingBase):
    def _merge_configs(self, *configs: Optional[RunnableConfig]) -> RunnableConfig:
        ...
        config = super()._merge_configs(*configs)
        configurable = config.get("configurable", {})
        self.history_factory_config = [
            ConfigurableFieldSpec(
                id="session_id",
                annotation=str,
                name="Session ID",
                description="Unique identifier for a session.",
                default="",
                is_shared=True,
            ),
        ]
        ...
        message_history = self.get_session_history(
                **{key: configurable[key] for key in expected_keys}
        )
        config["configurable"]["message_history"] = message_history
        return config

In _merge_configs, the get_session_history function retrieves the corresponding BaseChatMessageHistory object, which is then passed down.

Looking back at RunnableBindingBase's invoke method, it ultimately calls the invoke method of the bound variable. The assignment of bound in RunnableWithMessageHistory is as follows:

python
class RunnableWithMessageHistory(RunnableBindingBase):
    def __init__(
        self,
        runnable,
        get_session_history,
        ...
        input_messages_key: Optional[str] = None,
        ...
        history_messages_key: Optional[str] = None,
        ...
    ) -> None:
        history_chain: Runnable = RunnableLambda(
            self._enter_history, self._aenter_history
        ).with_config(run_name="load_history")
        messages_key = history_messages_key or input_messages_key
        if messages_key:
            history_chain = RunnablePassthrough.assign(
                **{messages_key: history_chain}
            ).with_config(run_name="insert_history")
        bound = (
            history_chain | runnable.with_listeners(on_end=self._exit_history)
        ).with_config(run_name="RunnableWithMessageHistory")

Here, a new LCEL chain is created by adding pre-processing (_enter_history) and post-processing (_exit_history) to the incoming LCEL chain, and it is assigned to the bound variable.

memory3.webp

Now, let's see what _enter_history and _exit_history do:

python
class RunnableWithMessageHistory(RunnableBindingBase):
    def _enter_history(self, input: Any, config: RunnableConfig) -> List[BaseMessage]:
        hist = config["configurable"]["message_history"]
        ...
        return hist.messages.copy()
    
    def _exit_history(self, run: Run, config: RunnableConfig) -> None:
        hist = config["configurable"]["message_history"]
        ...
        for m in input_messages + output_messages:
            hist.add_message(m)

In _enter_history, it simply copies the contents of BaseChatMessageHistory.messages. In _exit_history, it calls BaseChatMessageHistory.add_message to store both the user and model response messages.

At this point, you might wonder how to add complex logic for querying historical records, such as custom Memory components that vectorize storage for vector queries. To achieve this, you can use Python's property decorator to access the return values of methods as attributes. For example, defining a method named messages and using @property will allow you to access hist.messages as a method call.

python
class VectorMessageHistory(BaseChatMessageHistory, BaseModel):
    @property
    def messages(self) -> List[BaseMessage]:
        """Use vector similarity to query related dialogue records"""
        ...

In my opinion, it would be more intuitive if BaseChatMessageHistory provided a get_messages abstract method for developers to inherit and implement.

Summary

The Memory module is a crucial topic in LLM application development. The principle behind integrating Memory is straightforward: query historical records before executing the LLM to merge them into the prompt, and store the current dialogue records after execution.

The key to designing a Memory system lies in the data structures and algorithms used for querying and storing historical records. LangChain offers a variety of Memory components suited for different use cases. As of version 0.1.7, most Memory components are designed for traditional chains, but it's expected that future versions will introduce components for LCEL chains.

The integration methods for Memory in traditional and LCEL chains in LangChain differ. This article has provided a detailed source code analysis, and following this walkthrough should deepen your understanding and help you adapt more quickly to future iterations of the framework.

Memory has loaded