chunk
Chunk a piece of text into smaller pieces.
Usage
verb: text_chunk
args:
column: <column name> # The name of the column containing the text to chunk, this can either be a column with text, or a column with a list[tuple[doc_id, str]]
to: <column name> # The name of the column to output the chunks to
strategy: <strategy config> # The strategy to use to chunk the text, see below for more details
Strategies
The text chunk verb uses a strategy to chunk the text. The strategy is an object which defines the strategy to use. The following strategies are available:
tokens
This strategy uses the [tokens] library to chunk a piece of text. The strategy config is as follows:
Note: In the future, this will likely be renamed to something more generic, like "openai_tokens".
strategy:
type: tokens
chunk_size: 1000 # Optional, The chunk size to use, default: 1000
chunk_overlap: 300 # Optional, The chunk overlap to use, default: 300
sentence
This strategy uses the nltk library to chunk a piece of text into sentences. The strategy config is as follows:
strategy:
type: sentence