Retrieval Part 2

In the previous section, we introduced the MultiQueryRetriever and SelfQueryRetriever, which preprocess the query before actual retrieval. Today, we'll explore several other retrievers that operate at different stages of the retrieval process.

ContextualCompressionRetriever

The ContextualCompressionRetriever addresses the issue where too many documents are retrieved, or the documents are too large, resulting in only a small portion of the returned documents being relevant to the query. Passing a large number of irrelevant documents to an LLM increases the cost and lowers the quality of the response.

Example

python

texts = [
    "LLM is a deep learning model trained on a large corpus of text data, capturing the complexity and diversity of language. By learning, LLM understands user inputs and generates coherent and accurate text responses. This model usually has billions or even trillions of parameters, enabling it to handle complex language tasks. LLM can act as a chatbot, providing 24/7 customer support, answering frequently asked questions, and improving service efficiency; programmers can use LLM to assist in writing and optimizing code, thereby enhancing development efficiency. As technology advances, LLM will be applied in more fields, bringing convenience to work and life. At the same time, developers need to consider ethical and social responsibility issues to ensure healthy technological development.",
    "The weather is nice today."
]
vectorstore = Chroma.from_texts(texts, OpenAIEmbeddings())
# Perform semantic similarity retrieval for the top 2 documents
base_retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 2})
original_docs = base_retriever.get_relevant_documents("How can LLM assist programmers?")
print("original_docs: ", original_docs)

Output:

plaintext

original_docs:  [
    Document(page_content='LLM is a deep learning model trained on a large corpus of text data... technological development.'),
    Document(page_content='The weather is nice today.')
]

Since we specified returning 2 documents, even an unrelated document was included.

ContextualCompressionRetriever solves this by "trimming" the retrieved documents, keeping only the content closely related to the query.

Implementation of `_get_relevant_documents`

python

class ContextualCompressionRetriever(BaseRetriever):
    # Compressor for trimming documents after retrieval
    base_compressor: BaseDocumentCompressor
    # Retriever used for actual retrieval
    base_retriever: BaseRetriever
    ...

    def _get_relevant_documents(
        self,
        query: str,
        ...
    ) -> List[Document]:
        docs = self.base_retriever.get_relevant_documents(query, ...)
        if docs:
            compressed_docs = self.base_compressor.compress_documents(
                docs, query, ...
            )
            return list(compressed_docs)
        else:
            return []

The code first retrieves a list of documents, then uses the compress_documents method of the compressor to obtain a trimmed list of documents. Understanding ContextualCompressionRetriever hinges on these DocumentCompressor implementations.

LLMChainFilter

LLMChainFilter is a basic document compressor that traverses the retrieved document list, using an LLM to determine which document segments are relevant to the query.

Prompt used:

plaintext

prompt_template = """Given the following question and context, return YES if the context is relevant to the question and NO if it isn't.

> Question: {question}
> Context:
>>>
{context}
>>>
> Relevant (YES / NO):"""

Translated:

plaintext

Given the following question and context, return YES if the context is relevant to the question, and NO if it isn't.

> Question: {question}
> Context:
>>>
{context}
>>>
Relevant (YES/NO):

Example Usage

python

from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import OpenAI
from langchain.retrievers.document_compressors import LLMChainFilter

compressor = LLMChainFilter.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=base_retriever
)
docs = compression_retriever.get_relevant_documents("How can LLM assist programmers?")
print(docs)

Output:

plaintext

[Document(page_content='LLM is a deep learning model trained on a large corpus of text data... technological development.')]

Here, irrelevant documents are filtered out.

LLMChainExtractor

When the original retriever returns a large document chunk, LLMChainFilter may not be effective since the entire chunk is related to the query. In this case, LLMChainExtractor can be used to extract the relevant parts from the document chunk without summarizing, retaining the original text.

Prompt used:

plaintext

prompt_template = """Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return {no_output_str}. 

Remember, *DO NOT* edit the extracted parts of the context.

> Question: {{question}}
> Context:
>>>
{{context}}
>>>
Extracted relevant parts:"""

Translated:

plaintext

Based on the following question and context, extract the relevant parts of the context *as is*. If none of the context is relevant, return {no_output_str}.

Remember, *do not* edit the extracted parts.

> Question: {{question}}
> Context:
>>>
{{context}}
>>>
Extracted relevant parts:

The prompt emphasizes maintaining the original text unchanged, effectively making LLMChainExtractor a secondary segmentation and retrieval after vector retrieval.

Example Usage

python

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import OpenAI

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=base_retriever
)
docs = compression_retriever.get_relevant_documents("How can LLM assist programmers?")
print(docs)

Output:

plaintext

[Document(page_content='LLM is a deep learning model trained on a large corpus of text data... ensuring healthy technological development.')]

The LLMChainExtractor not only filters out irrelevant document chunks but also trims the remaining chunks to keep only the parts relevant to the query.

LongContextReorder

Related research has found that the performance of models in handling contextual information is related to the position of relevant information within the context: when relevant information appears at the beginning or end of the input context, the performance is typically highest; however, when the model needs to retrieve relevant information from the middle of the context, performance significantly declines.

Therefore, after querying a list of relevant documents from the vector database, it can be reordered and then passed to the LLM, placing the most relevant documents at the beginning and end, while placing the least relevant ones in the middle, which can effectively improve the response quality of the LLM.

LangChain provides LongContextReorder to support the reordering of document lists. Below is the implementation code for reordering:

python

class LongContextReorder(BaseDocumentTransformer, BaseModel):
    def transform_documents(
        self, documents: Sequence[Document], **kwargs: Any
    ) -> Sequence[Document]:
        """Reorders documents."""
        return _litm_reordering(list(documents))
    
def _litm_reordering(documents: List[Document]) -> List[Document]:
    documents.reverse()
    reordered_result = []
    for i, value in enumerate(documents):
        if i % 2 == 1:
            reordered_result.append(value)
        else:
            reordered_result.insert(0, value)
    return reordered_result

In the _litm_reordering function above, the document list is first reversed, and then documents are inserted into the new array from the front (i.e., the lowest-ranked) to the back, so that the higher-ranked documents will be distributed at both ends of the new array.

Below is the actual code effect:

python

from langchain_core.documents import Document
from langchain_community.document_transformers import (
    LongContextReorder,
)

# Simulate the original retriever's returned document list
docs = [Document(page_content="A"), Document(page_content="B"), Document(page_content="C"), 
        Document(page_content="D"), Document(page_content="E"), Document(page_content="F"), 
        Document(page_content="G")]
        
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)
print(reordered_docs)

Output:

[Document(page_content='A'), Document(page_content='C'), Document(page_content='E'), Document(page_content='G'), Document(page_content='F'), Document(page_content='D'), Document(page_content='B')]

EnsembleRetriever

Rank Fusion (RRF) is an algorithm used to combine document rankings from multiple information retrieval systems. By using a simple scoring formula to rank documents, RRF has been shown to outperform any single system and other ranking learning methods in multiple experiments.

The principle of RRF is simple. Suppose we have three information retrieval systems A, B, and C, which return different document ranking results for the same query task:

Ranking from System A:

Document 1
Document 2
Document 3
Document 4

Ranking from System B:

Document 3
Document 1
Document 4
Document 2

Ranking from System C:

Document 2
Document 4
Document 1
Document 3

Now, we will use the RRF method to compute the combined rankings of these documents.

First, we need to calculate an RRF score for each document. According to the RRF formula, RRFscore(d) = Σ(1/(k + r(d))), where k is a constant (set to 60 in the paper), and r(d) is the position of the document in the ranking.

RRF score for Document 1:

In System A, the ranking of Document 1 is 1, so r(d) = 1 ;
In System B, the ranking of Document 1 is 2, so r(d) = 2 ;
In System C, the ranking of Document 1 is 3, so r(d) = 3 .

According to the RRF formula, the RRF score for Document 1 is:

RRFscore(Document 1) = \frac{1}{60 + 1} + \frac{1}{60 + 2} + \frac{1}{60 + 3} = \frac{1}{61} + \frac{1}{62} + \frac{1}{63} \approx 0.04836

Similarly, the RRF scores for the other documents are calculated as follows:

$RRFscore(Document 2) = \frac{1}{60 + 2} + \frac{1}{60 + 4} + \frac{1}{60 + 1} = \frac{1}{62} + \frac{1}{64} + \frac{1}{61} \approx 0.04814$
$RRFscore(Document 3) = \frac{1}{60 + 3} + \frac{1}{60 + 1} + \frac{1}{60 + 4} = \frac{1}{63} + \frac{1}{61} + \frac{1}{64} \approx 0.04789$
$RRFscore(Document 4) = \frac{1}{60 + 4} + \frac{1}{60 + 3} + \frac{1}{60 + 2} = \frac{1}{64} + \frac{1}{63} + \frac{1}{62} \approx 0.04762$

Thus, the final ranking of the documents is as follows:

Document 1
Document 2
Document 3
Document 4

This combined ranking reflects the overall assessment of document relevance by all three systems. Through RRF, we can see a more balanced and possibly more accurate ranking result that combines the strengths of different systems while reducing the biases that may exist in a single system.

Based on the theory of RRF, LangChain has designed EnsembleRetriever, which uses multiple retrievers to query and obtain multiple sets of document lists, and then applies the RRF algorithm to obtain a more accurate document ranking.

Next, we will modify an example from the official website to use BM25 relevance search and vector semantic similarity retrieval to observe the effectiveness of EnsembleRetriever.

Here, we briefly explain the BM25 search algorithm. It is based on keyword searching and calculates the relevance score of documents to queries by combining factors such as term frequency (the number of times the keyword appears), inverse document frequency (the rarity of the keyword across the entire document), and document length normalization.

To use BM25, we need to first install the dependency package rank_bm25:

bash

pip install rank_bm25

python

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

doc_list_2 = [
    "I like apples",
    "You like apples",
    "You like oranges",
]

vectorstore = Chroma.from_texts(
    doc_list_2, OpenAIEmbeddings(), metadatas=[{"source": 2}] * len(doc_list_2)
)

chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, chroma_retriever], weights=[0.5, 0.5]
)

docs = ensemble_retriever.get_relevant_documents("apples")
print(docs)

Output:

[Document(page_content='I like apples', metadata={'source': 2}), Document(page_content='Apples and oranges are fruits', metadata={'source': 1}), Document(page_content='You like apples', metadata={'source': 2})]

When initializing EnsembleRetriever, it is important to note that LangChain introduces weights, allowing each retriever to be assigned a weight. By default, each retriever has the same weight.

With weights assigned, the formula for calculating the final score for each document is as follows:

RRFscore(Document) = weight(Retriever A) * 1/(k + rank(Retriever A)) + weight(Retriever B) * 1/(k + rank(Retriever B)) + ...

MultiVectorRetriever

First, consider this question: Does having larger embedded document blocks make it easier to find results that match user queries?

Actually, it does not. The embedding models have limitations on the dimensions they can support; larger documents contain more content, and when converted into fixed-dimensional vectors, the resulting vectors may not encompass all the information of the document block, leading to missed search results.

What if we make the embedded documents smaller?

This also presents issues. Smaller documents may accurately reflect their contents, but they often contain less information, resulting in potentially incomplete answers.

Thus, there is an inherent conflict between the size of document embeddings and the search results.

Embedding overly large document blocks can lose semantic information, causing retrieval losses, while embedding overly small blocks retains complete semantic information but leads to incomplete results due to the limited information.

Using a hierarchical strategy, separating the document blocks used for embedding from those returned during retrieval, can effectively resolve this issue.

To clarify, suppose we have document block A, which we believe contains complete information for LLM processing. However, embedding document block A directly may result in lost semantic information.

In this case, we can use a transformation method to split document block A into multiple smaller documents, store their embeddings, and during vector database queries, retrieve the corresponding larger document A based on the returned smaller documents.

This ensures both the semantic integrity of the embeddings and the accuracy of the final retrieval results.

To address this solution, LangChain provides MultiVectorRetriever. Let's look at the implementation of this retriever:

python

class MultiVectorRetriever(BaseRetriever):
    vectorstore: VectorStore
    docstore: BaseStore[str, Document]
    id_key: str = "doc_id"
    ...
    def _get_relevant_documents(self, query: str, ...) -> List[Document]:
        ...
        sub_docs = self.vectorstore.similarity_search(query, ...)
        ...
        ids = []
        for d in sub_docs:
            if self.id_key in d.metadata and d.metadata[self.id_key] not in ids:
                ids.append(d.metadata[self.id_key])
        docs = self.docstore.mget(ids)
        return [d for d in docs if d is not None]

vectorstore: The vector storage object used for embedding documents and querying from the vector database.
docstore: BaseStore is an abstract class designed for storing and managing data. Here, it's used to store the original larger document blocks.

In _get_relevant_documents, the process first queries the vectorstore for a list of sub-documents, and then retrieves the corresponding larger documents from the docstore based on the IDs found in the metadata of the sub-documents.

Now, let's focus on common methods for splitting larger documents into smaller ones.

Using Splitters for Document Segmentation

This is the most intuitive and understandable approach. The goal is to capture semantics as closely as possible during embedding while conveying more contextual information during retrieval.

LangChain has specifically designed ParentDocumentRetriever for this processing method. Below, we demonstrate it using MultiVectorRetriever:

python

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryByteStore
import uuid

# Original document list
docs = [
    Document(page_content="LangChain is a framework for developing applications powered by large language models (LLMs)"),
    Document(page_content="Build your applications using LangChain's open-source building blocks and components. Hit the ground running using third-party integrations and Templates. Use LangSmith to inspect, monitor and evaluate your chains, so that you can continuously optimize and deploy with confidence. Turn any chain into an API with LangServe.")
]

# Store for the original document list
store = InMemoryByteStore()

vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
# Specify the key in the metadata for identifying the corresponding larger document
id_key = "doc_id"

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
# Assign an ID to each original document
doc_ids = [str(uuid.uuid4()) for _ in docs]
# Split original documents into smaller documents
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        # Store the corresponding larger document's ID in the sub-document metadata
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)
# Store sub-documents' embeddings in the vector database
retriever.vectorstore.add_documents(sub_docs)
# Store original documents in docstore
retriever.docstore.mset(list(zip(doc_ids, docs)))

print(retriever.vectorstore.similarity_search("LangServe"))

Summarizing Document Block Contents

Sometimes, summarizing content can effectively distill the essence of a document block, ensuring that the embedded semantics are preserved. We can first use an LLM to generate multiple summaries of the original document, then store these summaries for embedding and retrieval.

The implementation is similar to the above, and interested readers can refer to the official examples.

Generating Hypothetical Queries

Finally, another interesting method involves generating hypothetical questions for each document block using an LLM. A sample prompt might look like this:

Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}

The generated hypothetical questions become our sub-document blocks, which we can then embed and store.

In this model, the retrieval pattern shifts from question -> answer to question -> question.

The implementation is straightforward, and you can refer to the official examples for guidance.

Summary

In today's lesson, we continued from last week and introduced several advanced document retrievers:

ContextualCompressionRetriever: After returning a list of documents from the vector store, it performs secondary filtering to retain highly relevant content. Different filtering logic can be implemented by inheriting from BaseDocumentCompressor.
LongContextReorder: By reordering the retrieved document list, it places the most relevant documents at both ends and the least relevant ones in the middle, enhancing LLM response quality.
EnsembleRetriever: It combines multiple retriever queries using the RRF algorithm to obtain a more accurate document ranking.
MultiVectorRetriever: It splits larger documents into smaller ones for embedding storage, retrieving larger documents based on the smaller ones to ensure semantic integrity and accurate results. Common splitting methods include further segmenting documents with splitters, summarizing document blocks, and generating hypothetical queries.

Retrieval Part 2 ​

ContextualCompressionRetriever ​

Example ​

Implementation of _get_relevant_documents ​

LLMChainFilter ​

Example Usage ​

LLMChainExtractor ​

Example Usage ​

LongContextReorder ​

EnsembleRetriever ​

MultiVectorRetriever ​

Using Splitters for Document Segmentation ​

Summarizing Document Block Contents ​

Generating Hypothetical Queries ​

Summary ​

Retrieval Part 2

ContextualCompressionRetriever

Example

Implementation of `_get_relevant_documents`

LLMChainFilter

Example Usage

LLMChainExtractor

Example Usage

LongContextReorder

EnsembleRetriever

MultiVectorRetriever

Using Splitters for Document Segmentation

Summarizing Document Block Contents

Generating Hypothetical Queries

Summary