Skip to content

Document Loading

In the previous section, we learned about function calls, which enhance the integration capabilities of LLMs with external systems. Today, we introduce RAG (Retrieval Augmented Generation), a technology used to enhance the interaction between LLMs and private domain data.

RAG, or Retrieval Augmented Generation, provides external data to the LLM to generate more accurate and context-aware answers.

The Need for RAG

We have repeatedly mentioned the limitations of LLMs: they cannot generate accurate answers for knowledge outside of their training data and may even hallucinate by providing incorrect responses.

However, after training on large datasets, LLMs have developed "intelligent brains," lacking only the relevant knowledge.

RAG compensates for this by retrieving external data before calling the LLM, providing the retrieved knowledge to the LLM to fill in gaps in the question's context and domain knowledge, thereby fully utilizing the LLM's reasoning capabilities.

This is akin to an open-book exam where students can consult reference materials for answers. It tests their reasoning abilities rather than their knowledge in specific areas, which is also the core idea behind RAG.

Moreover, decoupling domain knowledge from LLM training allows LLMs to retain their reasoning capabilities while easily adapting to continuously updated knowledge. Compared to the traditional approach of fine-tuning models to fit specific domain knowledge, RAG is simpler and more convenient.

A Complete RAG System Workflow

A RAG system consists of the following five steps:

  1. Document Loading: Retrieve data from specified external storage.
  2. Splitting: Divide large documents into smaller chunks.
  3. Embedding: Use an embedding model to vectorize the data chunks.
  4. Vector Storage: Store the vectorized data in a vector database.
  5. Retrieval: Retrieve the question from the vector database to obtain relevant document knowledge.

LangChain provides various components and tools for all these steps, enabling you to build a RAG system that meets your specific requirements easily.

Today, we'll start with Document Loading.

docload1.webp

Introduction to Document Loaders

External data from users is diverse, featuring:

  • Variety of Data Sources: Local, online, or database sources.
  • Different Data Formats: Formats like PDF, HTML, JSON, TXT, and Markdown.
  • Various Operations: Different access and reading methods depending on the source and format.

To efficiently load and process document data, LangChain has designed a unified interface (BaseLoader) to load and parse documents.

python
class BaseLoader(ABC):
    ...
    @abstractmethod
    def load(self) -> List[Document]:
        ...
    ...

Based on this, LangChain provides different DocumentLoader components for various document data types. As seen in the code above, any DocumentLoader inheriting from BaseLoader must implement the load method, which returns an array of Document objects.

python
class Document(Serializable):
    page_content: str
    metadata: dict = Field(default_factory=dict)
    ...

In the Document class, two important attributes are:

  • page_content: Represents the document content.
  • metadata: Represents the metadata of the data. Different data types may have different metadata, but there will always be a source field indicating the document's origin.

Here is an example of a Document object for an HTML file:

plaintext
Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)

The DocumentLoader uses the load method to load external data into an array of Document objects, converting it into a data structure that LangChain can understand, enabling seamless integration with other components.

LangChain implements many document loaders, which can be found here and here. Today, we'll only introduce some of the more common ones.

PyPDFLoader

PDF is one of the most common data formats, and LangChain provides extensive support for loading and extracting content from PDF files. PyPDFLoader is one of the commonly used PDF loaders based on the pypdf library, which reads PDF files page by page. Each loaded Document object represents one page of the PDF, with the metadata field containing information about the page number.

Here’s a demonstration using a PDF paper about ReAct. You can download it locally for testing.

To use PyPDFLoader, you first need to install the pypdf library:

bash
pip install pypdf

Loading a PDF

Here’s how you can load a PDF file using PyPDFLoader:

python
from langchain_community.document_loaders import PyPDFLoader

# Initialize the loader with the PDF file path
loader = PyPDFLoader("pdf_loader_demo.pdf")

# Load the PDF pages
pages = loader.load()

# Display the number of pages loaded
print(len(pages)) # Output: 33, matching the number of pages in the PDF

# Display the content of the first page
print(pages[0])

Output:

plaintext
Document(page_content='Published as a conference...', metadata={'source': 'pdf_loader_demo.pdf', 'page': 0})

Using this loading method, you can easily retrieve data based on the page number.

Extracting Images

By default, PyPDFLoader does not parse the content of images in the PDF. To extract text from images, set the extract_images parameter to True. This feature relies on the rapidocr-onnxruntime library, so you need to install it beforehand:

bash
pip install rapidocr-onnxruntime

Here's how to load a PDF with image extraction enabled:

python
from langchain_community.document_loaders import PyPDFLoader

# Initialize the loader with image extraction enabled
loader = PyPDFLoader("pdf_loader_demo.pdf", extract_images=True)

# Load the PDF pages
pages = loader.load()

# Display the content of the second page
print(pages[1].page_content)

Output:

plaintext
Published as a conference paper at ICLR 2023
Type Definition ReAct CoT
SuccessTrue positive Correct reasoning trace and facts 94% 86%
False positive Hallucinated reasoning trace or facts 6% 14%
FailureReasoning error Wrong reasoning trace (including failing to recover from repetitive steps) 47% 16%
Search result error Search return empty or does not contain useful information 23% -
Hallucination Hallucinated reasoning trace or facts 0% 56%
Label ambiguity Right prediction but did not match the label precisely 29% 28%
Table 2: Types of success and failure modes of ReAct and CoT on HotpotQA, as well as their percentages in randomly selected examples studied by human.
...

docload2.webp

After setting extract_images to True, the content from tables and images in the original PDF is successfully extracted.

Other PDF Loaders

Apart from PyPDFLoader, LangChain offers several other document loaders for PDFs, such as:

  • MathpixPDFLoader: Specialized for parsing mathematical formulas.
  • PyMuPDFLoader: Offers faster parsing speed.

For more information about these loaders, you can check out LangChain's documentation.

JSONLoader

JSON is one of the most commonly used data formats in daily development. LangChain leverages the jq library to create the JSONLoader, allowing flexible parsing and extraction of target JSON content using powerful JQ expressions.

To use JSONLoader, you first need to install the jq library:

bash
pip install jq

JQ provides a powerful query language designed specifically for working with JSON structures. With the jq_schema parameter, you can use JQ expressions to parse and extract data from JSON files.

Here’s an example demonstrating how to use JSONLoader:

Given a JSON file (example.json):

json
[
    {
        "id": 1,
        "name": "张伟",
        "email": "zhangwei@example.com",
        "age": 28,
        "city": "北京"
    },
    {
        "id": 2,
        "name": "赵小刀",
        "email": "zhaoxiaodao@example.com",
        "age": 26,
        "city": "上海"
    },
    {
        "id": 3,
        "name": "李雷",
        "email": "lilei@example.com",
        "age": 32,
        "city": "深圳"
    }
]

To load the email addresses from this file:

python
from langchain_community.document_loaders import JSONLoader

# Initialize the loader with the file path and jq schema
loader = JSONLoader(
    file_path='example.json',
    jq_schema='.[].email'
)

# Load the data
data = loader.load()

# Display the loaded data
print(data)

Output:

plaintext
[Document(page_content='zhangwei@example.com', metadata={'source': 'example.json', 'seq_num': 1}),
 Document(page_content='zhaoxiaodao@example.com', metadata={'source': 'example.json', 'seq_num': 2}),
 Document(page_content='lilei@example.com', metadata={'source': 'example.json', 'seq_num': 3})]

UnstructuredFileLoader

Unlike the previous document loaders that parse specific file formats, UnstructuredFileLoader automatically detects the file type provided. It uses the unstructured library, which analyzes the content of the file and attempts to segment it into different elements for extraction.

UnstructuredFileLoader supports two extraction modes:

  1. single (default): Converts the entire document into a single Document object.
  2. elements: Converts each paragraph of the document into separate Document objects.

For better extraction results, it’s common to use subclasses of UnstructuredFileLoader, such as UnstructuredPDFLoader for PDFs, UnstructuredMarkdownLoader for Markdown files, or UnstructuredHTMLLoader for HTML files.

UnstructuredFileLoader and its subclasses may require several dependencies. Here are the necessary libraries:

bash
pip install unstructured
pip install pdf2image
pip install pdfminer.six
pip install pillow_heif
pip install unstructured_inference
pip install pytesseract
pip install pikepdf
# For macOS, install poppler
brew install poppler

Usage Example

Let's use the previous PDF document as an example to demonstrate the effect of the "unstructured" approach with UnstructuredPDFLoader.

Single Mode

python
from langchain_community.document_loaders import UnstructuredPDFLoader

# Default is single mode
loader = UnstructuredPDFLoader("pdf_loader_demo.pdf")

# Load the document
pages = loader.load()

# Display the number of pages
print(len(pages)) # Output: 1

Elements Mode

python
from langchain_community.document_loaders import UnstructuredPDFLoader

# Use elements mode
loader = UnstructuredPDFLoader("pdf_loader_demo.pdf", mode="elements")

# Load the document
pages = loader.load()

# Display content of the 11th element
print(pages[10].page_content)

Output:

plaintext
While large language models (LLMs) have demonstrated impressive performance across
...
only one or two in-context examples.

Using the elements mode, each paragraph or section of the document is treated as a separate Document object, providing more granular control over the extracted content.

docload3.webp

DirectoryLoader

When working with multiple documents stored in a directory, manually loading each one can be time-consuming. To streamline this process, LangChain offers the DirectoryLoader, which allows loading all documents within a specified directory.

Parameters

DirectoryLoader supports several optional parameters:

  • loader_cls: By default, DirectoryLoader uses UnstructuredFileLoader to extract file contents. You can specify a different document loader through this parameter.
  • glob: Controls which types of files in the directory are loaded. For example, to load only .md files, use glob="*.md"; to load only PDF files, use glob="*.pdf".
  • use_multithreading: By default, a single thread is used to load all files in the directory. Setting use_multithreading to True enables multithreaded loading, which can increase loading efficiency.

Here’s an example of using DirectoryLoader to load PDF files from the src directory:

python
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader

# Specify the directory and the loader class
loader = DirectoryLoader("src/", glob="*.pdf", loader_cls=PyPDFLoader)

# Load the data
data = loader.load()

# Display the number of loaded documents
print(len(data)) # Output: 33

In this example, we load PDF files from the src directory using PyPDFLoader to extract the document data.

WebBaseLoader

WebBaseLoader is a class designed for loading and parsing web pages. It uses the urllib library to load HTML content and the BeautifulSoup library to parse the content.

BeautifulSoup is a Python library for parsing HTML and XML documents. It constructs a tree structure from the HTML file and provides easy-to-use methods for locating page elements, making it convenient to extract and manipulate data.

Here’s how to use WebBaseLoader to load content from a webpage:

python
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Initialize the loader with the URL and parsing conditions
loader = WebBaseLoader(
    web_path="https://www.gov.cn/jrzg/2013-10/25/content_2515601.htm",
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("p1")))
)

# Load the documents
docs = loader.load()

In the example, we first inspect the page's element structure and find that the content is located in elements with class="p1". We then create a filter and assign it to bs_kwargs to extract only the content of those elements.

SeleniumURLLoader

While WebBaseLoader is effective for extracting static web content, it struggles with dynamic elements that require browser rendering to access the content. This is because, when requesting a dynamic web page, the server returns JavaScript files and possibly incomplete or empty HTML structures. The browser then executes the JavaScript to render the web page, whereas WebBaseLoader parses the returned HTML immediately, potentially missing the required data.

Additionally, WebBaseLoader cannot access pages that require login.

To address these limitations, LangChain provides SeleniumURLLoader, which is based on the selenium library. Selenium is an automation testing library that can create a browser instance to execute JavaScript files, render web pages, and simulate real user behavior. This allows data extraction once the page content is fully rendered.

Limitations of SeleniumURLLoader

The current implementation of SeleniumURLLoader is quite basic and may not always deliver the desired results. Here's the load method implementation:

python
class SeleniumURLLoader(BaseLoader):
    ...
    def load(self) -> List[Document]:
        from unstructured.partition.html import partition_html
        docs: List[Document] = list()
        driver = self._get_driver()

        for url in self.urls:
            try:
                driver.get(url)
                page_content = driver.page_source
                elements = partition_html(text=page_content)
                text = "\n\n".join([str(el) for el in elements])
                metadata = self._build_metadata(url, driver)
                docs.append(Document(page_content=text, metadata=metadata))
            except Exception as e:
                if self.continue_on_failure:
                    logger.error(f"Error fetching or processing {url}, exception: {e}")
                else:
                    raise e

        driver.quit()
        return docs

The load method uses driver.get to fetch the content immediately, without waiting for the dynamic resources to load. This approach has two drawbacks:

  1. Rendering dynamic resources takes time, and fetching the content right after driver.get may not capture the complete data.
  2. It lacks support for complex scenarios, such as simulating user actions or specifying certain elements, which are core capabilities of Selenium.

Optimized Approach: NewSeleniumURLLoader

To improve upon the limitations of SeleniumURLLoader, here’s a modified version called NewSeleniumURLLoader:

python
class NewSeleniumURLLoader(BaseLoader):
    # Similar implementation as SeleniumURLLoader
    ...
    def __init__(..., handler: Callable[[WebDriver, str], str] = None):
        ...
        self.handler = handler or self._default_handler

    # Default handler retrieves the content immediately after calling driver.get
    def _default_handler(self, driver: WebDriver, url: str) -> str:
        driver.get(url)
        return driver.page_source

    def load(self) -> List[Document]:
        from unstructured.partition.html import partition_html
        docs: List[Document] = list()
        driver = self._get_driver()

        for url in self.urls:
            try:
                # Use the custom handler to load page content
                page_content = self.handler(driver, url)
                # The following logic remains the same as SeleniumURLLoader
                elements = partition_html(text=page_content)
                text = "\n\n".join([str(el) for el in elements])
                metadata = self._build_metadata(url, driver)
                docs.append(Document(page_content=text, metadata=metadata))
            except Exception as e:
                if self.continue_on_failure:
                    logger.error(f"Error fetching or processing {url}, exception: {e}")
                else:
                    raise e

        driver.quit()
        return docs

The issue with SeleniumURLLoader is the coupling between content retrieval logic and converting content to Document objects. By providing a custom handler, users can define how to load the page content.

Conclusion

To overcome the limitations of language models (LLMs) in generating accurate answers outside their training datasets, RAG (Retrieval-Augmented Generation) uses data loading, text splitting, text embedding, vector storage, and retrieval to provide external data. LangChain offers a variety of document loaders to handle different data sources and formats.

Today’s discussion covered document loaders for loading PDF files (PyPDFLoader), parsing JSON data (JSONLoader), extracting unstructured files (UnstructuredFileLoader), loading multiple documents from a directory (DirectoryLoader), and fetching web resources (WebBaseLoader and SeleniumURLLoader). Learning to choose the right loader is crucial for building a robust RAG system.

Document Loading has loaded