GraphRAG

chunk

Chunk a piece of text into smaller pieces.

Usage

verb: text_chunk
args:
    column: <column name> # The name of the column containing the text to chunk, this can either be a column with text, or a column with a list[tuple[doc_id, str]]
    to: <column name> # The name of the column to output the chunks to
    strategy: <strategy config> # The strategy to use to chunk the text, see below for more details

Strategies

The text chunk verb uses a strategy to chunk the text. The strategy is an object which defines the strategy to use. The following strategies are available:

tokens

This strategy uses the [tokens] library to chunk a piece of text. The strategy config is as follows:

Note: In the future, this will likely be renamed to something more generic, like "openai_tokens".

strategy:
    type: tokens
    chunk_size: 1000 # Optional, The chunk size to use, default: 1000
    chunk_overlap: 300 # Optional, The chunk overlap to use, default: 300

sentence

This strategy uses the nltk library to chunk a piece of text into sentences. The strategy config is as follows:

strategy:
    type: sentence

Code

text_chunk.py