Appearance
Document Loading
In the previous section, we learned about function calls, which enhance the integration capabilities of LLMs with external systems. Today, we introduce RAG (Retrieval Augmented Generation), a technology used to enhance the interaction between LLMs and private domain data.
RAG, or Retrieval Augmented Generation, provides external data to the LLM to generate more accurate and context-aware answers.
The Need for RAG
We have repeatedly mentioned the limitations of LLMs: they cannot generate accurate answers for knowledge outside of their training data and may even hallucinate by providing incorrect responses.
However, after training on large datasets, LLMs have developed "intelligent brains," lacking only the relevant knowledge.
RAG compensates for this by retrieving external data before calling the LLM, providing the retrieved knowledge to the LLM to fill in gaps in the question's context and domain knowledge, thereby fully utilizing the LLM's reasoning capabilities.
This is akin to an open-book exam where students can consult reference materials for answers. It tests their reasoning abilities rather than their knowledge in specific areas, which is also the core idea behind RAG.
Moreover, decoupling domain knowledge from LLM training allows LLMs to retain their reasoning capabilities while easily adapting to continuously updated knowledge. Compared to the traditional approach of fine-tuning models to fit specific domain knowledge, RAG is simpler and more convenient.
A Complete RAG System Workflow
A RAG system consists of the following five steps:
- Document Loading: Retrieve data from specified external storage.
- Splitting: Divide large documents into smaller chunks.
- Embedding: Use an embedding model to vectorize the data chunks.
- Vector Storage: Store the vectorized data in a vector database.
- Retrieval: Retrieve the question from the vector database to obtain relevant document knowledge.
LangChain provides various components and tools for all these steps, enabling you to build a RAG system that meets your specific requirements easily.
Today, we'll start with Document Loading.
Introduction to Document Loaders
External data from users is diverse, featuring:
- Variety of Data Sources: Local, online, or database sources.
- Different Data Formats: Formats like PDF, HTML, JSON, TXT, and Markdown.
- Various Operations: Different access and reading methods depending on the source and format.
To efficiently load and process document data, LangChain has designed a unified interface (BaseLoader
) to load and parse documents.
python
class BaseLoader(ABC):
...
@abstractmethod
def load(self) -> List[Document]:
...
...
Based on this, LangChain provides different DocumentLoader
components for various document data types. As seen in the code above, any DocumentLoader
inheriting from BaseLoader
must implement the load
method, which returns an array of Document
objects.
python
class Document(Serializable):
page_content: str
metadata: dict = Field(default_factory=dict)
...
In the Document
class, two important attributes are:
page_content
: Represents the document content.metadata
: Represents the metadata of the data. Different data types may have different metadata, but there will always be asource
field indicating the document's origin.
Here is an example of a Document
object for an HTML file:
plaintext
Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)
The DocumentLoader
uses the load
method to load external data into an array of Document
objects, converting it into a data structure that LangChain can understand, enabling seamless integration with other components.
LangChain implements many document loaders, which can be found here and here. Today, we'll only introduce some of the more common ones.
PyPDFLoader
PDF is one of the most common data formats, and LangChain provides extensive support for loading and extracting content from PDF files. PyPDFLoader is one of the commonly used PDF loaders based on the pypdf
library, which reads PDF files page by page. Each loaded Document
object represents one page of the PDF, with the metadata
field containing information about the page number.
Here’s a demonstration using a PDF paper about ReAct. You can download it locally for testing.
To use PyPDFLoader, you first need to install the pypdf
library:
bash
pip install pypdf
Loading a PDF
Here’s how you can load a PDF file using PyPDFLoader:
python
from langchain_community.document_loaders import PyPDFLoader
# Initialize the loader with the PDF file path
loader = PyPDFLoader("pdf_loader_demo.pdf")
# Load the PDF pages
pages = loader.load()
# Display the number of pages loaded
print(len(pages)) # Output: 33, matching the number of pages in the PDF
# Display the content of the first page
print(pages[0])
Output:
plaintext
Document(page_content='Published as a conference...', metadata={'source': 'pdf_loader_demo.pdf', 'page': 0})
Using this loading method, you can easily retrieve data based on the page number.
Extracting Images
By default, PyPDFLoader does not parse the content of images in the PDF. To extract text from images, set the extract_images
parameter to True
. This feature relies on the rapidocr-onnxruntime
library, so you need to install it beforehand:
bash
pip install rapidocr-onnxruntime
Here's how to load a PDF with image extraction enabled:
python
from langchain_community.document_loaders import PyPDFLoader
# Initialize the loader with image extraction enabled
loader = PyPDFLoader("pdf_loader_demo.pdf", extract_images=True)
# Load the PDF pages
pages = loader.load()
# Display the content of the second page
print(pages[1].page_content)
Output:
plaintext
Published as a conference paper at ICLR 2023
Type Definition ReAct CoT
SuccessTrue positive Correct reasoning trace and facts 94% 86%
False positive Hallucinated reasoning trace or facts 6% 14%
FailureReasoning error Wrong reasoning trace (including failing to recover from repetitive steps) 47% 16%
Search result error Search return empty or does not contain useful information 23% -
Hallucination Hallucinated reasoning trace or facts 0% 56%
Label ambiguity Right prediction but did not match the label precisely 29% 28%
Table 2: Types of success and failure modes of ReAct and CoT on HotpotQA, as well as their percentages in randomly selected examples studied by human.
...
After setting extract_images
to True
, the content from tables and images in the original PDF is successfully extracted.
Other PDF Loaders
Apart from PyPDFLoader, LangChain offers several other document loaders for PDFs, such as:
- MathpixPDFLoader: Specialized for parsing mathematical formulas.
- PyMuPDFLoader: Offers faster parsing speed.
For more information about these loaders, you can check out LangChain's documentation.
JSONLoader
JSON is one of the most commonly used data formats in daily development. LangChain leverages the jq
library to create the JSONLoader
, allowing flexible parsing and extraction of target JSON content using powerful JQ expressions.
To use JSONLoader, you first need to install the jq
library:
bash
pip install jq
JQ provides a powerful query language designed specifically for working with JSON structures. With the jq_schema
parameter, you can use JQ expressions to parse and extract data from JSON files.
Here’s an example demonstrating how to use JSONLoader:
Given a JSON file (example.json
):
json
[
{
"id": 1,
"name": "张伟",
"email": "zhangwei@example.com",
"age": 28,
"city": "北京"
},
{
"id": 2,
"name": "赵小刀",
"email": "zhaoxiaodao@example.com",
"age": 26,
"city": "上海"
},
{
"id": 3,
"name": "李雷",
"email": "lilei@example.com",
"age": 32,
"city": "深圳"
}
]
To load the email addresses from this file:
python
from langchain_community.document_loaders import JSONLoader
# Initialize the loader with the file path and jq schema
loader = JSONLoader(
file_path='example.json',
jq_schema='.[].email'
)
# Load the data
data = loader.load()
# Display the loaded data
print(data)
Output:
plaintext
[Document(page_content='zhangwei@example.com', metadata={'source': 'example.json', 'seq_num': 1}),
Document(page_content='zhaoxiaodao@example.com', metadata={'source': 'example.json', 'seq_num': 2}),
Document(page_content='lilei@example.com', metadata={'source': 'example.json', 'seq_num': 3})]
UnstructuredFileLoader
Unlike the previous document loaders that parse specific file formats, UnstructuredFileLoader
automatically detects the file type provided. It uses the unstructured
library, which analyzes the content of the file and attempts to segment it into different elements for extraction.
UnstructuredFileLoader
supports two extraction modes:
- single (default): Converts the entire document into a single
Document
object. - elements: Converts each paragraph of the document into separate
Document
objects.
For better extraction results, it’s common to use subclasses of UnstructuredFileLoader
, such as UnstructuredPDFLoader
for PDFs, UnstructuredMarkdownLoader
for Markdown files, or UnstructuredHTMLLoader
for HTML files.
UnstructuredFileLoader
and its subclasses may require several dependencies. Here are the necessary libraries:
bash
pip install unstructured
pip install pdf2image
pip install pdfminer.six
pip install pillow_heif
pip install unstructured_inference
pip install pytesseract
pip install pikepdf
# For macOS, install poppler
brew install poppler
Usage Example
Let's use the previous PDF document as an example to demonstrate the effect of the "unstructured" approach with UnstructuredPDFLoader
.
Single Mode
python
from langchain_community.document_loaders import UnstructuredPDFLoader
# Default is single mode
loader = UnstructuredPDFLoader("pdf_loader_demo.pdf")
# Load the document
pages = loader.load()
# Display the number of pages
print(len(pages)) # Output: 1
Elements Mode
python
from langchain_community.document_loaders import UnstructuredPDFLoader
# Use elements mode
loader = UnstructuredPDFLoader("pdf_loader_demo.pdf", mode="elements")
# Load the document
pages = loader.load()
# Display content of the 11th element
print(pages[10].page_content)
Output:
plaintext
While large language models (LLMs) have demonstrated impressive performance across
...
only one or two in-context examples.
Using the elements mode, each paragraph or section of the document is treated as a separate Document
object, providing more granular control over the extracted content.
DirectoryLoader
When working with multiple documents stored in a directory, manually loading each one can be time-consuming. To streamline this process, LangChain offers the DirectoryLoader
, which allows loading all documents within a specified directory.
Parameters
DirectoryLoader
supports several optional parameters:
loader_cls
: By default,DirectoryLoader
usesUnstructuredFileLoader
to extract file contents. You can specify a different document loader through this parameter.glob
: Controls which types of files in the directory are loaded. For example, to load only.md
files, useglob="*.md"
; to load only PDF files, useglob="*.pdf"
.use_multithreading
: By default, a single thread is used to load all files in the directory. Settinguse_multithreading
toTrue
enables multithreaded loading, which can increase loading efficiency.
Here’s an example of using DirectoryLoader
to load PDF files from the src
directory:
python
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
# Specify the directory and the loader class
loader = DirectoryLoader("src/", glob="*.pdf", loader_cls=PyPDFLoader)
# Load the data
data = loader.load()
# Display the number of loaded documents
print(len(data)) # Output: 33
In this example, we load PDF files from the src
directory using PyPDFLoader
to extract the document data.
WebBaseLoader
WebBaseLoader
is a class designed for loading and parsing web pages. It uses the urllib
library to load HTML content and the BeautifulSoup
library to parse the content.
BeautifulSoup
is a Python library for parsing HTML and XML documents. It constructs a tree structure from the HTML file and provides easy-to-use methods for locating page elements, making it convenient to extract and manipulate data.
Here’s how to use WebBaseLoader
to load content from a webpage:
python
import bs4
from langchain_community.document_loaders import WebBaseLoader
# Initialize the loader with the URL and parsing conditions
loader = WebBaseLoader(
web_path="https://www.gov.cn/jrzg/2013-10/25/content_2515601.htm",
bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("p1")))
)
# Load the documents
docs = loader.load()
In the example, we first inspect the page's element structure and find that the content is located in elements with class="p1"
. We then create a filter and assign it to bs_kwargs
to extract only the content of those elements.
SeleniumURLLoader
While WebBaseLoader
is effective for extracting static web content, it struggles with dynamic elements that require browser rendering to access the content. This is because, when requesting a dynamic web page, the server returns JavaScript files and possibly incomplete or empty HTML structures. The browser then executes the JavaScript to render the web page, whereas WebBaseLoader
parses the returned HTML immediately, potentially missing the required data.
Additionally, WebBaseLoader
cannot access pages that require login.
To address these limitations, LangChain provides SeleniumURLLoader
, which is based on the selenium
library. Selenium is an automation testing library that can create a browser instance to execute JavaScript files, render web pages, and simulate real user behavior. This allows data extraction once the page content is fully rendered.
Limitations of SeleniumURLLoader
The current implementation of SeleniumURLLoader
is quite basic and may not always deliver the desired results. Here's the load
method implementation:
python
class SeleniumURLLoader(BaseLoader):
...
def load(self) -> List[Document]:
from unstructured.partition.html import partition_html
docs: List[Document] = list()
driver = self._get_driver()
for url in self.urls:
try:
driver.get(url)
page_content = driver.page_source
elements = partition_html(text=page_content)
text = "\n\n".join([str(el) for el in elements])
metadata = self._build_metadata(url, driver)
docs.append(Document(page_content=text, metadata=metadata))
except Exception as e:
if self.continue_on_failure:
logger.error(f"Error fetching or processing {url}, exception: {e}")
else:
raise e
driver.quit()
return docs
The load
method uses driver.get
to fetch the content immediately, without waiting for the dynamic resources to load. This approach has two drawbacks:
- Rendering dynamic resources takes time, and fetching the content right after
driver.get
may not capture the complete data. - It lacks support for complex scenarios, such as simulating user actions or specifying certain elements, which are core capabilities of Selenium.
Optimized Approach: NewSeleniumURLLoader
To improve upon the limitations of SeleniumURLLoader
, here’s a modified version called NewSeleniumURLLoader
:
python
class NewSeleniumURLLoader(BaseLoader):
# Similar implementation as SeleniumURLLoader
...
def __init__(..., handler: Callable[[WebDriver, str], str] = None):
...
self.handler = handler or self._default_handler
# Default handler retrieves the content immediately after calling driver.get
def _default_handler(self, driver: WebDriver, url: str) -> str:
driver.get(url)
return driver.page_source
def load(self) -> List[Document]:
from unstructured.partition.html import partition_html
docs: List[Document] = list()
driver = self._get_driver()
for url in self.urls:
try:
# Use the custom handler to load page content
page_content = self.handler(driver, url)
# The following logic remains the same as SeleniumURLLoader
elements = partition_html(text=page_content)
text = "\n\n".join([str(el) for el in elements])
metadata = self._build_metadata(url, driver)
docs.append(Document(page_content=text, metadata=metadata))
except Exception as e:
if self.continue_on_failure:
logger.error(f"Error fetching or processing {url}, exception: {e}")
else:
raise e
driver.quit()
return docs
The issue with SeleniumURLLoader
is the coupling between content retrieval logic and converting content to Document
objects. By providing a custom handler, users can define how to load the page content.
Conclusion
To overcome the limitations of language models (LLMs) in generating accurate answers outside their training datasets, RAG (Retrieval-Augmented Generation) uses data loading, text splitting, text embedding, vector storage, and retrieval to provide external data. LangChain offers a variety of document loaders to handle different data sources and formats.
Today’s discussion covered document loaders for loading PDF files (PyPDFLoader
), parsing JSON data (JSONLoader
), extracting unstructured files (UnstructuredFileLoader
), loading multiple documents from a directory (DirectoryLoader
), and fetching web resources (WebBaseLoader
and SeleniumURLLoader
). Learning to choose the right loader is crucial for building a robust RAG system.