Document Segmentation

We will introduce how segmentation can provide efficiency and accuracy in RAG (Retrieval-Augmented Generation) applications from several angles: why segmentation is needed, what segmentation strategies exist, and how to choose a strategy for segmentation.

Why Segmentation is Necessary

Document segmentation refers to the process of dividing loaded documents into smaller chunks.

In the previous lesson, we actually encountered this, such as how PyPDFLoader loads the entire data into multiple document chunks by page; the elements mode of UnstructuredFileLoader, which chunks the entire data by elements.

Let's review the entire RAG process. If we do not consider document segmentation, the loaded documents will first be vectorized using an embedding model, stored in a vector database, and then retrieved to be submitted along with the user's question to the LLM (Large Language Model). From this perspective, the process seems straightforward. So, why is chunking still necessary?

Document Embedding

Document vectorization involves extracting data features, compressing and reducing dimensions, and converting them into a set of numerical arrays or matrices in vector form. The larger the document, the more information is lost after vectorization, which can severely impact subsequent retrieval effectiveness. Additionally, different embedding models perform with varying efficiency on different sizes of document chunks. For instance, sentence embedding models perform better when handling individual sentences, while OpenAI's text-embedding-ada-002 embedding model performs better on chunks sized at 256 or 512 tokens.

Storing in Vector Databases

The dimensional values of vectorized large document chunks will generally be higher than those of smaller chunks, posing a challenge for vector databases. This increases storage space and computational resource requirements while reducing data retrieval performance and efficiency.

Retrieval

When retrieving document data based on user questions, if a single document is too large, the relevance of the retrieval results will be lower. For example, a 1000-word document may only briefly mention programming in 50 words. If the user inputs a question related to programming, the entire large document may be returned, which reduces the relevance of the document. The relevance of the document directly affects the accuracy of the output results from the LLM.

LLM Invocation

Currently, LLM models have limitations on the number of tokens that can be sent in a single request. The entire document data may exceed the model's limits. However, as LLMs evolve, these limitations will likely diminish. For example, the domestic large model Kimi has recently been released and can now support 2 million Chinese characters in context, which is about 32 times that of GPT-4 Turbo!

Thus, document segmentation is essential when building RAG applications. The goal of segmentation is to ensure that the data chunks are small enough while maintaining the semantic relevance of the document chunks.

These two objectives can be contradictory. If the chunks are too small, it may lead to the loss of contextual information, resulting in incomplete semantics of the document chunks. For instance: “Xiaoming likes Xiaohong, and he also likes Xiaoqing.” If this document is divided into two chunks: “Xiaoming likes Xiaohong” and “he also likes Xiaoqing,” when asked the question, “Who does Xiaoming like?”, only “Xiaoming likes Xiaohong” can be retrieved, losing the information about his other love, “Xiaoqing.”

Therefore, document segmentation needs to be considered with trade-offs based on the application scenario. LangChain provides a series of document segmentation methods for us to choose from. Below, we will explain some common strategies in detail.

Common Segmentation Strategies in LangChain

The segmenters in LangChain are located in the langchain_text_splitters library, which we need to install manually:

bash

pip install langchain_text_splitters

Most document segmenters inherit from TextSplitter. Let's take a look at some key parts of the TextSplitter code:

python

class TextSplitter(BaseDocumentTransformer, ABC):
    # Initialize the splitter
    def __init__(
        self,
        # Size of each document chunk after segmentation
        chunk_size: int = 4000,
        # Number of overlapping characters between two segments
        chunk_overlap: int = 200,
        ...
    ) -> None:
    
    @abstractmethod
    def split_text(self, text: str) -> List[str]:
        """Split text into multiple components."""

    # Split multiple texts into a list of documents based on a strategy
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            ...
            # Call split_text to divide large texts into smaller texts
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(_metadatas[i])
                ...
                # Wrap into document objects
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents
 
    # Split multiple documents into multiple document lists based on a strategy
    def split_documents(self, documents: Iterable[Document]) -> List[Document]:
        texts, metadatas = [], []
        for doc in documents:
            texts.append(doc.page_content)
            metadatas.append(doc.metadata)
        return self.create_documents(texts, metadatas=metadatas)
    
    def _merge_splits(self, splits: Iterable[str], separator: str) -> List[str]:

create_documents and split_documents are two methods we frequently use in actual development, both of which can create a list of documents. The split_documents method first retrieves the text content (page_content) and metadata from the passed documents, then calls create_documents for re-segmentation.

The create_documents method ultimately calls split_text for segmentation. The TextSplitter itself does not implement split_text; individual document segmenters must implement the segmentation logic according to their own strategies.

Next, let’s look at the two parameters passed when initializing TextSplitter: chunk_size and chunk_overlap.

chunk_size: Limits the size of each document chunk after segmentation.
chunk_overlap: Maximum number of overlapping characters between two document chunks.

chunk_overlap ensures the coherence and completeness of document chunk semantics. Taking the previous example document, “Xiaoming likes Xiaohong, and he also likes Xiaoqing,” if we segment it into two document chunks: “Xiaoming likes Xiaohong” and “he also likes Xiaoqing,” and set chunk_overlap > 0, the two documents may be merged back into one: “Xiaoming likes Xiaohong, and he also likes Xiaoqing.” This way, the semantic integrity of the document is preserved.

The merging method is provided by TextSplitter through _merge_splits. This method attempts to merge the provided text list based on chunk_size and chunk_overlap according to the following rules:

If a document chunk > chunk_size, do not merge.
If chunk_overlap < Document chunk < chunk_size:
- If Document chunk + subsequent document chunk > chunk_size, do not merge.
- If Document chunk + subsequent document chunk <= chunk_size, merge.
If Document chunk < chunk_overlap:
- If adjacent document + Document chunk > chunk_size, do not merge.
- If previous document + Document chunk <= chunk_size, merge with the previous document.

From the illustration above, we can see that _merge_splits can merge smaller documents (usually split by a specific delimiter) into documents that do not exceed chunk_size as much as possible (there may still be special cases that exceed chunk_size, as seen in the last image where BC chunk exceeds the limit).

At the same time, particularly small documents (less than chunk_overlap) may also be merged with adjacent documents, allowing for some overlap and maintaining semantic continuity.

It is important to note that _merge_splits takes into account the length of the delimiter when merging. Assuming the delimiter in the illustration is "\n\n", if document chunk A has a length of 60, which includes a 2-character delimiter, then the actual content length is only 58.

_merge_splits is optional; each segmenter can choose whether to call it in the split_text method based on their needs.

With that, we have introduced the basic principles and architecture of TextSplitter in LangChain. Next, we will explore several common document segmenters in LangChain, from simple to complex.

Character Text Segmentation

Character-based segmentation is the simplest segmentation strategy, where we specify a delimiter and split the text according to that delimiter. In LangChain, this segmentation strategy is implemented by the CharTextSplitter.

python

class CharTextSplitter(TextSplitter):
    def __init__(
        self,
        ...
        # Specify the delimiter
        separator: str = "\n\n",
    ) -> None:

    def split_text(self, text: str) -> List[str]:
        # Get the delimiter
        separator = (
            self._separator if self._is_separator_regex else re.escape(self._separator)
        )
        # Split the text by the delimiter
        splits = _split_text_with_regex(text, separator, self._keep_separator)
        _separator = "" if self._keep_separator else self._separator
        # Call TextSplitter._merge_splits to merge the split text
        return self._merge_splits(splits, _separator)

CharTextSplitter defaults to splitting by double line breaks \n\n. Let’s look at a specific example. To prevent _merge_splits from merging documents, we set chunk_size and chunk_overlap to very small values.

python

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1,
    chunk_overlap=0
)
 
text = '666666\n\n333\n\n22'
print(splitter.split_text(text))
"""
['666666', '333', '22']
"""

Sentence Segmentation

As mentioned earlier, some embedding models are specifically optimized for individual sentence embeddings. Therefore, in some cases, we may want to split data by sentences.

NLTK and spaCy are two popular natural language processing libraries in Python, both of which provide rich tools and functionalities for analyzing and processing text data. LangChain leverages these two libraries to design NLTKTextSplitter and SpacyTextSplitter.

NLTKTextSplitter

NLTK is one of the earliest NLP libraries in Python and offers tools for various language processing tasks, including tokenization, part-of-speech tagging, named entity recognition, and parsing. LangChain uses the sentence tokenizer from the NLTK library to implement text segmentation by sentences.

python

class NLTKTextSplitter(TextSplitter):
    """Splitting text using NLTK package."""
    def __init__(
        self, separator: str = "\n\n", language: str = "english", **kwargs: Any
    ) -> None:
        ...
        from nltk.tokenize import sent_tokenize
        self._tokenizer = sent_tokenize
        ...
        self._separator = separator
        self._language = language
    
    def split_text(self, text: str) -> List[str]:
        """Split incoming text and return chunks."""
        # Use the tokenizer to split sentences
        splits = self._tokenizer(text, language=self._language)
        # Call TextSplitter._merge_splits to merge the split sentences
        return self._merge_splits(splits, self._separator)

NLTKTextSplitter also calls TextSplitter._merge_splits to merge after splitting sentences.

python

from langchain_text_splitters import NLTKTextSplitter

splitter = NLTKTextSplitter(
    chunk_size=1,
    chunk_overlap=0
)

text = 'This is a test sentence for testing NLTKTextSplitter! It will be splitted to several sub sentences, let see how it works.'
print(splitter.split_text(text))
"""
['This is a test sentence for testing NLTKTextSplitter!', 'It will be splitted to several sub sentences, let see how it works.']
"""

As we can see, NLTKTextSplitter can automatically differentiate between different punctuation marks and will split sentences when encountering terminal punctuation (such as question marks, periods, exclamation marks, etc.).

SpacyTextSplitter

NLTK can be slower when processing large texts, as it is more geared towards educational scenarios.

spaCy is a Python library developed using C++, which is faster and optimized for production environments. It uses newer algorithms and pre-trained models to provide higher accuracy in segmentation, along with effective parallel processing and better memory management solutions, making it suitable for handling large datasets.

Before using SpacyTextSplitter, we need to install the spaCy library and the required models in advance. SpacyTextSplitter defaults to using the en_core_web_sm model. Below is how to install the model in a normal Python environment and when managing the project with pdm:

Normal Python Environment:

bash

python -m spacy download en_core_web_sm

Using PDM:

bash

pdm add https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz

The above URL can be replaced with the actual version number of the model needed. You can find the correct version number and download link on the official spaCy model release page.

After downloading the model, using SpacyTextSplitter is straightforward:

python

from langchain_text_splitters import SpacyTextSplitter

splitter = SpacyTextSplitter(
    chunk_size=1,
    chunk_overlap=0
)

text = 'This is a test sentence for testing NLTKTextSplitter! It will be splitted to several sub sentences, let see how it works.'
print(splitter.split_text(text))
"""
['This is a test sentence for testing NLTKTextSplitter!', 'It will be splitted to several sub sentences, let see how it works.']
"""

Recursive Character Text Segmentation

The problem with CharTextSplitter is that it only specifies a single delimiter, which may result in document chunks that are significantly larger than the intended chunk_size. The RecursiveCharacterTextSplitter effectively addresses this issue by allowing us to specify a set of delimiters. After initially splitting the document with the first delimiter, if the resulting chunks do not meet the expected size, it recursively applies the remaining delimiters until the document chunks reach the desired size or all delimiters have been tried.

The default delimiter list for RecursiveCharacterTextSplitter is ["\n\n", "\n", " ", ""].

Here is an example of using RecursiveCharacterTextSplitter:

python

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=5,
    chunk_overlap=1
)

# The following text is a test for the RecursiveCharacterTextSplitter
text = "This is a test text for the RecursiveCharacterTextSplitter. It is a long text with many words."
print(splitter.split_text(text))
"""
['This', 'is a', 'test', 'text', 'for', 'the', 'Recu', 'ursiv', 'veCha', 'aract', 'terTe', 'extSp', 'plitt', 'ter.', 'It', 'is a', 'long', 'text', 'with', 'many', 'word', 'ds.']
"""

As we can see, the default delimiter "" ensures that we can generate document chunks of the desired size. However, it may lead to chunks that lack clear semantics, so we need to decide whether to use this based on the specific context.

The ability of RecursiveCharacterTextSplitter to recursively split documents with multiple delimiters grants it a wide range of applications, including the ability to split code files from various programming languages.

LangChain provides a list of supported languages for splitting in langchain_text_splitters.Language.

python

from langchain_text_splitters import Language

print([e.value for e in Language])
"""
['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'swift', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl']
"""

LangChain has predefined different delimiter lists for various languages.

python

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

print(RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON))
"""
['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']
"""

We can create a language-specific splitter using RecursiveCharacterTextSplitter.from_language, which retrieves the corresponding delimiter list for the specified language and instantiates a RecursiveCharacterTextSplitter object.

python

class RecursiveCharacterTextSplitter(TextSplitter):
    @classmethod
    def from_language(
        cls, language: Language, **kwargs: Any
    ) -> RecursiveCharacterTextSplitter:
        separators = cls.get_separators_for_language(language)
        return cls(separators=separators, is_separator_regex=True, **kwargs)

Let’s look at the segmentation effect for Python code:

python

from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

code = """
def hello_world():
    print("Hello World!")

if __name__ == '__main__':
    hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)

python_chunks = python_splitter.split_text(code)
print(python_chunks)
"""
['def hello_world():\n    print("Hello World!")', "if __name__ == '__main__':\n    hello_world()"]
"""

This example demonstrates how RecursiveCharacterTextSplitter effectively splits Python code into manageable chunks while respecting the syntax of the language.

Semantic Segmentation

The splitting strategies mentioned above have a notable drawback: they do not consider the semantic completeness and coherence of the text. While NLTKTextSplitter and SpacyTextSplitter can accurately segment documents into sentences, they do so without regard to semantic relationships. For example, consider the text: "Xiao Ming likes Xiao Hong. He also likes Xiao Qing." Both NLTKTextSplitter and SpacyTextSplitter will split this into two document chunks due to the period, despite the strong semantic connection between them. Treating these as a single document chunk would be more reasonable.

Other splitting strategies simply divide the text based on predetermined delimiters, which further exacerbates this issue.

SemanticChunker

Is there a method that can group semantically related content together as a single document chunk? LangChain has implemented SemanticChunker, which first segments the entire text into sentences. It then uses an embedding model to compute the vector values of each sentence, comparing the cosine distances between adjacent sentences to determine their semantic similarity and decide whether to merge them.

Sentence Segmentation

In SemanticChunker, the input text is segmented using the delimiters '.', '?', and '!'.

It's important to note that this implementation does not consider Chinese punctuation marks. Therefore, special attention must be given to modifying these delimiters when processing Chinese text.

python

class SemanticChunker(BaseDocumentTransformer):
    def split_text(
        self,
        text: str,
    ) -> List[str]:
        # Splitting the essay on '.', '?', and '!'
        single_sentences_list = re.split(r"(?<=[.?!])\s+", text)
        ...

Initial Sentence Merging

SemanticChunker merges adjacent sentences to maintain continuity, similar to the chunk_overlap discussed earlier. It uses a configurable parameter called buffer_size for this purpose, with a default value of 1. For example, after the first step of segmentation resulting in chunks A, B, C, and D, a buffer_size of 1 will merge them into AB, ABC, BCD, and CD. In other words, each sentence chunk is combined with the preceding and succeeding sentence blocks based on the buffer_size.

python

def combine_sentences(sentences: List[dict], buffer_size: int = 1) -> List[dict]:
    for i in range(len(sentences)):
        combined_sentence = ""

        for j in range(i - buffer_size, i):
            if j >= 0:
                combined_sentence += sentences[j]["sentence"] + " "

        combined_sentence += sentences[i]["sentence"]

        for j in range(i + 1, i + 1 + buffer_size):
            if j < len(sentences):
                combined_sentence += " " + sentences[j]["sentence"]

        sentences[i]["combined_sentence"] = combined_sentence
    return sentences

Calculating Sentence Embeddings

Next, the embeddings for each chunk are computed:

python

embeddings = self.embeddings.embed_documents(
            [x["combined_sentence"] for x in sentences]
        )
for i, sentence in enumerate(sentences):
    sentence["combined_sentence_embedding"] = embeddings[i]

Calculating Cosine Distances

The cosine distance between each chunk and the next is calculated and recorded:

python

def calculate_cosine_distances(sentences: List[dict]) -> Tuple[List[float], List[dict]]:
    """Calculate cosine distances between sentences."""
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]["combined_sentence_embedding"]
        embedding_next = sentences[i + 1]["combined_sentence_embedding"]

        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
        distance = 1 - similarity

        distances.append(distance)
        sentences[i]["distance_to_next"] = distance
    return distances, sentences

Now we have an array of cosine distances between the chunks.

python

print(distances)
"""
[0.08081114249044896, 0.02726339916925502, 0.04722227403602797]
"""

Visualizing this data can provide clearer insights into the relationships:

[Graph taken from the implementation article]

Setting Breakpoints

A larger cosine distance indicates that the document chunks are semantically less related. We can set a threshold value to traverse through the sentence chunks, merging them as long as the cosine distance to the next sentence is below the threshold, until it exceeds this limit.

[Graph taken from the implementation article]

Threshold Calculation

SemanticChunker currently supports three methods for calculating breakpoints: Percentile, Standard Deviation, and Interquartile.

python

BREAKPOINT_DEFAULTS: Dict[BreakpointThresholdType, float] = {
    "percentile": 95,
    "standard_deviation": 3,
    "interquartile": 1.5,
}

python

class SemanticChunker(BaseDocumentTransformer):
    def __init__(
        ...
        breakpoint_threshold_type: BreakpointThresholdType = "percentile",
        breakpoint_threshold_amount: Optional[float] = None,
        ...
    ):
        self.breakpoint_threshold_type = breakpoint_threshold_type
        if breakpoint_threshold_amount is None:
            self.breakpoint_threshold_amount = BREAKPOINT_DEFAULTS[
                breakpoint_threshold_type
            ]
        else:
            self.breakpoint_threshold_amount = breakpoint_threshold_amount
    
    # Calculate breakpoint threshold
    def _calculate_breakpoint_threshold(self, distances: List[float]) -> float:
        if self.breakpoint_threshold_type == "percentile":
            return cast(
                float,
                np.percentile(distances, self.breakpoint_threshold_amount),
            )
        elif self.breakpoint_threshold_type == "standard_deviation":
            return cast(
                float,
                np.mean(distances)
                + self.breakpoint_threshold_amount * np.std(distances),
            )
        elif self.breakpoint_threshold_type == "interquartile":
            q1, q3 = np.percentile(distances, [25, 75])
            iqr = q3 - q1

            return np.mean(distances) + self.breakpoint_threshold_amount * iqr

The default calculation method for SemanticChunker is Percentile. You can adjust this during initialization by modifying breakpoint_threshold_type.

Each calculation method has a coefficient, breakpoint_threshold_amount, which affects the final threshold value. Here’s a brief explanation of each method:

Percentile: Represents the proportion of observations that fall below a certain value. For example, the 50th percentile (median) indicates that half the observations are below this value. By default, the percentile threshold is set to 95, meaning the threshold is determined by the 95th percentile.
Standard Deviation: Measures the degree of deviation of values in a dataset from the mean. A higher standard deviation indicates a wider distribution of data, while a lower standard deviation indicates a more concentrated distribution. The threshold is calculated as the mean plus the product of the standard deviation and the breakpoint_threshold_amount.
Interquartile: Divides a dataset into four equal parts, determined by three points (25th percentile ( q1 ), 50th percentile ( q2 ), and 75th percentile ( q3 )). The Interquartile Range (IQR) is the difference between ( q3 ) and ( q1 ), which describes the distribution range of the central 50% of the data. In SemanticChunker, the IQR is multiplied by breakpoint_threshold_amount and added to the mean to yield an adjusted weighted average.

Practical Example

It's important to note that the effectiveness of this semantic segmentation has not been extensively validated in practical scenarios. Therefore, LangChain currently includes SemanticChunker in the langchain_experimental package.

Let’s demonstrate the effectiveness of SemanticChunker using a segment from the LangChain official website:

python

text = """
LangChain is a framework for developing applications powered by language models. It enables applications that:

Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
This framework consists of several parts.

LangChain Libraries: The Python and JavaScript libraries. Contains interfaces and integrations for a myriad of components, a basic run time for combining these components into chains and agents, and off-the-shelf implementations of chains and agents.
LangChain Templates: A collection of easily deployable reference architectures for a wide variety of tasks.
LangServe: A library for deploying LangChain chains as a REST API.
LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain.
"""
from langchain_openai import OpenAIEmbeddings

text_splitter = SemanticChunker(embeddings=OpenAIEmbeddings())
chunks = text_splitter.split_text(text)
print(len(chunks))  # 2
print(chunks)
"""
['\nLangChain is a framework for developing applications powered by language models. It enables applications that:\n\nAre context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)\nReason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)\nThis framework consists of several parts. LangChain Libraries: The Python and JavaScript libraries.', 
 'Contains interfaces and integrations for a myriad of components, a basic run time for combining these components into chains and agents, and off-the-shelf implementations of

 chains and agents. LangChain Templates: A collection of easily deployable reference architectures for a wide variety of tasks. LangServe: A library for deploying LangChain chains as a REST API. LangSmith: A developer platform that lets you debug, test, evaluate, and monitor chains built on any LLM framework and seamlessly integrates with LangChain. ']
"""

By adjusting the breakpoint_threshold_type and breakpoint_threshold_amount during initialization, you can achieve different results.

Considerations for Splitting Strategies

There is no one-size-fits-all solution for splitting strategies; different strategies are suitable for various scenarios. It’s crucial to choose the appropriate strategy based on your specific business context. When considering splitting strategies, several aspects should be evaluated. Here are some key points to keep in mind:

What is the type of content in the original document? Is it a book, an article, or a chat message?
What embedding model is being used? Different models perform differently based on the size of the document chunks.
What characteristics do the expected user queries have? Are they simple short phrases or complex long sentences?
What are the token limitations of the LLM model being used? What is the performance like under different token counts?

Addressing these questions can help us balance performance and accuracy in selecting an appropriate splitting strategy. It’s important to note that determining a splitting strategy is also an iterative process that requires continuous adjustments and testing to achieve the desired results.

Conclusion

In LangChain, document splitting is a critical step that influences the retrieval efficiency and accuracy of RAG applications. By employing a suitable splitting strategy, we can maintain the semantic relevance of document chunks while ensuring they remain sufficiently small, thus enhancing overall application performance.

Today, we discussed several common splitters implemented in LangChain. It’s essential not only to understand how to use them but also to comprehend their underlying principles. This understanding will provide greater confidence when choosing or developing our own document splitters.

Formulating a splitting strategy requires a comprehensive consideration of various factors, including the characteristics of the document content, the embedding model being used, user query habits, and the limitations of the LLM model. This ensures that the final chosen strategy meets the application's needs. Additionally, the selection and adjustment of document splitting strategies are iterative processes that necessitate ongoing testing and optimization to achieve the best results.

Document Segmentation ​

Why Segmentation is Necessary ​

Document Embedding ​

Storing in Vector Databases ​

Retrieval ​

LLM Invocation ​

Common Segmentation Strategies in LangChain ​

Character Text Segmentation ​

Sentence Segmentation ​

NLTKTextSplitter ​

SpacyTextSplitter ​

Recursive Character Text Segmentation ​

Semantic Segmentation ​

SemanticChunker ​

Sentence Segmentation ​

Initial Sentence Merging ​

Calculating Sentence Embeddings ​

Calculating Cosine Distances ​

Setting Breakpoints ​

Threshold Calculation ​

Practical Example ​

Considerations for Splitting Strategies ​

Conclusion ​

Document Segmentation

Why Segmentation is Necessary

Document Embedding

Storing in Vector Databases

Retrieval

LLM Invocation

Common Segmentation Strategies in LangChain

Character Text Segmentation

Sentence Segmentation

NLTKTextSplitter

SpacyTextSplitter

Recursive Character Text Segmentation

Semantic Segmentation

SemanticChunker

Sentence Segmentation

Initial Sentence Merging

Calculating Sentence Embeddings

Calculating Cosine Distances

Setting Breakpoints

Threshold Calculation

Practical Example

Considerations for Splitting Strategies

Conclusion