mirror of
https://github.com/microsoft/graphrag.git
synced 2026-02-10 04:54:17 +08:00
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Initial plan for issue * Implement standard logging module and integrate with existing loggers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Add test cases and improve documentation for standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Apply ruff formatting and add semversioner file for logging improvements Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove custom logger classes and refactor to use standard logging only Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Apply ruff formatting to resolve CI/CD test failures Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Add semversioner file and fix linting issues Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff fixes * fix spelling error * Remove StandardProgressLogger and refactor to use standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove LoggerFactory and custom loggers, refactor to use standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix pyright error: use logger.info() instead of calling logger as function in cosmosdb_pipeline_storage.py Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff fixes * Remove deprecated logger files that were marked as deprecated placeholders Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Replace custom get_logger with standard Python logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix linting issues found by ruff check --fix Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff check fixes * add word to dictionary * Fix type checker error in ModelManager.__new__ method Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Refactor multiple logging.getLogger() calls to use single logger per file Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove progress_logger parameter from build_index() and logger parameter from generate_indexing_prompts() Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove logger parameter from run_pipeline and standardize logger naming Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Replace logger parameter with log_level parameter in CLI commands Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix import ordering in notebook files to pass poetry poe check Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove --logger parameter from smoke test command Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix Windows CI/CD issue with log file cleanup in tests Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Add StreamHandler to root logger in __main__.py for CLI logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Only add StreamHandler if root logger doesn't have existing StreamHandler Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix import ordering in notebook files to pass ruff checks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Replace logging.StreamHandler with colorlog.StreamHandler for colorized log output Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Regenerate poetry.lock file after adding colorlog dependency Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix import ordering in notebook files to pass ruff checks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * move printing of dataframes to debug level * remove colorlog for now * Refactor workflow callbacks to inherit from logging.Handler Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix linting issues in workflow callback handlers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix pyright type errors in blob and file workflow callbacks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Refactor pipeline logging to use pure logging.Handler subclasses Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Rename workflow callback classes to workflow logger classes and move to logger directory Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * update dictionary * apply ruff fixes * fix function name * simplify logger code * update * Remove error, warning, and log methods from WorkflowCallbacks and replace with standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff fixes * Fix pyright errors by removing WorkflowCallbacks from strategy type signatures Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove ConsoleWorkflowLogger and apply consistent formatter to all handlers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * Refactor pipeline_logger.py to use standard FileHandler and remove FileWorkflowLogger Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove conditional azure import checks from blob_workflow_logger.py Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix pyright type checking errors in mock_provider.py and utils.py Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Run ruff check --fix to fix import ordering in notebooks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Merge configure_logging and create_pipeline_logger into init_loggers function Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove configure_logging and create_pipeline_logger functions, replace all usage with init_loggers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * cleanup unused code * Update init_loggers to accept GraphRagConfig instead of ReportingConfig Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff check fixes * Fix test failures by providing valid GraphRagConfig with required model configurations Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * remove logging_workflow_callback * cleanup logging messages * Add logging to track progress of pandas DataFrame apply operation in create_base_text_units Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * cleanup logger logic throughout codebase * update * more cleanup of old loggers * small logger cleanup * final code cleanup and added loggers to query * add verbose logging to query * minor code cleanup * Fix broken unit tests for chunk_text and standard_logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * Fix test_chunk_text by mocking progress_ticker function instead of ProgressTicker class Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * remove unnecessary logger * remove rich and fix type annotation * revert test formatting changes my by copilot * promote graphrag logs to root logger * add correct semversioner file * revert change to file * revert formatting changes that have no effect * fix changes after merge with main * revert unnecessary copilot changes * remove whitespace * cleanup docstring * simplify some logic with less code * update poetry lock file * ruff fixes --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
392 lines
20 KiB
Markdown
392 lines
20 KiB
Markdown
# Default Configuration Mode (using YAML/JSON)
|
|
|
|
The default configuration mode may be configured by using a `settings.yml` or `settings.json` file in the data project root. If a `.env` file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using `${ENV_VAR}` syntax. We initialize with YML by default in `graphrag init` but you may use the equivalent JSON form if preferred.
|
|
|
|
Many of these config values have defaults. Rather than replicate them here, please refer to the [constants in the code](https://github.com/microsoft/graphrag/blob/main/graphrag/config/defaults.py) directly.
|
|
|
|
For example:
|
|
|
|
```
|
|
# .env
|
|
GRAPHRAG_API_KEY=some_api_key
|
|
|
|
# settings.yml
|
|
llm:
|
|
api_key: ${GRAPHRAG_API_KEY}
|
|
```
|
|
|
|
# Config Sections
|
|
|
|
## Language Model Setup
|
|
|
|
### models
|
|
|
|
This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.
|
|
|
|
For example:
|
|
```yml
|
|
models:
|
|
default_chat_model:
|
|
api_key: ${GRAPHRAG_API_KEY}
|
|
type: openai_chat
|
|
model: gpt-4o
|
|
model_supports_json: true
|
|
default_embedding_model:
|
|
api_key: ${GRAPHRAG_API_KEY}
|
|
type: openai_embedding
|
|
model: text-embedding-ada-002
|
|
```
|
|
|
|
#### Fields
|
|
|
|
- `api_key` **str** - The OpenAI API key to use.
|
|
- `auth_type` **api_key|azure_managed_identity** - Indicate how you want to authenticate requests.
|
|
- `type` **openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding|mock_chat|mock_embeddings** - The type of LLM to use.
|
|
- `model` **str** - The model name.
|
|
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
|
|
- `api_base` **str** - The API base url to use.
|
|
- `api_version` **str** - The API version.
|
|
- `deployment_name` **str** - The deployment name to use (Azure).
|
|
- `organization` **str** - The client organization.
|
|
- `proxy` **str** - The proxy URL to use.
|
|
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
|
|
- `model_supports_json` **bool** - Whether the model supports JSON-mode output.
|
|
- `request_timeout` **float** - The per-request timeout.
|
|
- `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
|
|
- `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
|
|
- `retry_strategy` **str** - Retry strategy to use, "native" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include "exponential_backoff", "random_wait", and "incremental_wait".
|
|
- `max_retries` **int** - The maximum number of retries to use.
|
|
- `max_retry_wait` **float** - The maximum backoff time.
|
|
- `concurrent_requests` **int** The number of open requests to allow at once.
|
|
- `async_mode` **asyncio|threaded** The async mode to use. Either `asyncio` or `threaded`.
|
|
- `responses` **list[str]** - If this model type is mock, this is a list of response strings to return.
|
|
- `n` **int** - The number of completions to generate.
|
|
- `max_tokens` **int** - The maximum number of output tokens. Not valid for o-series models.
|
|
- `temperature` **float** - The temperature to use. Not valid for o-series models.
|
|
- `top_p` **float** - The top-p value to use. Not valid for o-series models.
|
|
- `frequency_penalty` **float** - Frequency penalty for token generation. Not valid for o-series models.
|
|
- `presence_penalty` **float** - Frequency penalty for token generation. Not valid for o-series models.
|
|
- `max_completion_tokens` **int** - Max number of tokens to consume for chat completion. Must be large enough to include an unknown amount for "reasoning" by the model. o-series models only.
|
|
- `reasoning_effort` **low|medium|high** - Amount of "thought" for the model to expend reasoning about a response. o-series models only.
|
|
|
|
## Input Files and Chunking
|
|
|
|
### input
|
|
|
|
Our pipeline can ingest .csv, .txt, or .json data from an input location. See the [inputs page](../index/inputs.md) for more details and examples.
|
|
|
|
#### Fields
|
|
|
|
- `storage` **StorageConfig**
|
|
- `type` **file|blob|cosmosdb** - The storage type to use. Default=`file`
|
|
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
|
|
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
|
|
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
|
|
- `storage_account_blob_url` **str** - (blob only) The storage account blob URL to use.
|
|
- `cosmosdb_account_blob_url` **str** - (cosmosdb only) The CosmosDB account blob URL to use.
|
|
- `file_type` **text|csv|json** - The type of input data to load. Default is `text`
|
|
- `encoding` **str** - The encoding of the input file. Default is `utf-8`
|
|
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$`, `.*\.txt$`, or `.*\.json$` depending on the specified `file_type`, but you can customize it if needed.
|
|
- `file_filter` **dict** - Key/value pairs to filter. Default is None.
|
|
- `text_column` **str** - (CSV/JSON only) The text column name. If unset we expect a column named `text`.
|
|
- `title_column` **str** - (CSV/JSON only) The title column name, filename will be used if unset.
|
|
- `metadata` **list[str]** - (CSV/JSON only) The additional document attributes fields to keep.
|
|
|
|
### chunks
|
|
|
|
These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the `metadata` setting in the input document config, which will replicate document metadata into each chunk.
|
|
|
|
#### Fields
|
|
|
|
- `size` **int** - The max chunk size in tokens.
|
|
- `overlap` **int** - The chunk overlap in tokens.
|
|
- `group_by_columns` **list[str]** - Group documents by these fields before chunking.
|
|
- `strategy` **str**[tokens|sentences] - How to chunk the text.
|
|
- `encoding_model` **str** - The text encoding model to use for splitting on token boundaries.
|
|
- `prepend_metadata` **bool** - Determines if metadata values should be added at the beginning of each chunk. Default=`False`.
|
|
- `chunk_size_includes_metadata` **bool** - Specifies whether the chunk size calculation should include metadata tokens. Default=`False`.
|
|
|
|
## Outputs and Storage
|
|
|
|
### output
|
|
|
|
This section controls the storage mechanism used by the pipeline used for exporting output tables.
|
|
|
|
#### Fields
|
|
|
|
- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
|
|
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
|
|
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
|
|
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
|
|
- `storage_account_blob_url` **str** - (blob only) The storage account blob URL to use.
|
|
- `cosmosdb_account_blob_url` **str** - (cosmosdb only) The CosmosDB account blob URL to use.
|
|
|
|
### update_index_output
|
|
|
|
The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.
|
|
|
|
#### Fields
|
|
|
|
- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
|
|
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
|
|
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
|
|
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
|
|
- `storage_account_blob_url` **str** - (blob only) The storage account blob URL to use.
|
|
- `cosmosdb_account_blob_url` **str** - (cosmosdb only) The CosmosDB account blob URL to use.
|
|
|
|
### cache
|
|
|
|
This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results for faster performance when re-running the indexing process.
|
|
|
|
#### Fields
|
|
|
|
- `type` **file|memory|blob|cosmosdb** - The storage type to use. Default=`file`
|
|
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
|
|
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
|
|
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
|
|
- `storage_account_blob_url` **str** - (blob only) The storage account blob URL to use.
|
|
- `cosmosdb_account_blob_url` **str** - (cosmosdb only) The CosmosDB account blob URL to use.
|
|
|
|
### reporting
|
|
|
|
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.
|
|
|
|
#### Fields
|
|
|
|
- `type` **file|blob** - The reporting type to use. Default=`file`
|
|
- `base_dir` **str** - The base directory to write reports to, relative to the root.
|
|
- `connection_string` **str** - (blob only) The Azure Storage connection string.
|
|
- `container_name` **str** - (blob only) The Azure Storage container name.
|
|
- `storage_account_blob_url` **str** - The storage account blob URL to use.
|
|
|
|
### vector_store
|
|
|
|
Where to put all vectors for the system. Configured for lancedb by default. This is a dict, with the key used to identify individual store parameters (e.g., for text embedding).
|
|
|
|
#### Fields
|
|
|
|
- `type` **lancedb|azure_ai_search|cosmosdb** - Type of vector store. Default=`lancedb`
|
|
- `db_uri` **str** (only for lancedb) - The database uri. Default=`storage.base_dir/lancedb`
|
|
- `url` **str** (only for AI Search) - AI Search endpoint
|
|
- `api_key` **str** (optional - only for AI Search) - The AI Search api key to use.
|
|
- `audience` **str** (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
|
|
- `container_name` **str** - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=`default`
|
|
- `database_name` **str** - (cosmosdb only) Name of the database.
|
|
- `overwrite` **bool** (only used at index creation time) - Overwrite collection if it exist. Default=`True`
|
|
|
|
## Workflow Configurations
|
|
|
|
These settings control each individual workflow as they execute.
|
|
|
|
### workflows
|
|
|
|
**list[str]** - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.
|
|
|
|
### embed_text
|
|
|
|
By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the `target` and `names` fields.
|
|
|
|
Supported embeddings names are:
|
|
|
|
- `text_unit.text`
|
|
- `document.text`
|
|
- `entity.title`
|
|
- `entity.description`
|
|
- `relationship.description`
|
|
- `community.title`
|
|
- `community.summary`
|
|
- `community.full_content`
|
|
|
|
#### Fields
|
|
|
|
- `model_id` **str** - Name of the model definition to use for text embedding.
|
|
- `vector_store_id` **str** - Name of vector store definition to write to.
|
|
- `batch_size` **int** - The maximum batch size to use.
|
|
- `batch_max_tokens` **int** - The maximum batch # of tokens.
|
|
- `names` **list[str]** - List of the embeddings names to run (must be in supported list).
|
|
|
|
### extract_graph
|
|
|
|
Tune the language model-based graph extraction process.
|
|
|
|
#### Fields
|
|
|
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `entity_types` **list[str]** - The entity types to identify.
|
|
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
|
|
|
### summarize_descriptions
|
|
|
|
#### Fields
|
|
|
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `max_length` **int** - The maximum number of output tokens per summarization.
|
|
- `max_input_length` **int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
|
|
|
|
### extract_graph_nlp
|
|
|
|
Defines settings for NLP-based graph extraction methods.
|
|
|
|
#### Fields
|
|
|
|
- `normalize_edge_weights` **bool** - Whether to normalize the edge weights during graph construction. Default=`True`.
|
|
- `text_analyzer` **dict** - Parameters for the NLP model.
|
|
- extractor_type **regex_english|syntactic_parser|cfg** - Default=`regex_english`.
|
|
- model_name **str** - Name of NLP model (for SpaCy-based models)
|
|
- max_word_length **int** - Longest word to allow. Default=`15`.
|
|
- word_delimiter **str** - Delimiter to split words. Default ' '.
|
|
- include_named_entities **bool** - Whether to include named entities in noun phrases. Default=`True`.
|
|
- exclude_nouns **list[str] | None** - List of nouns to exclude. If `None`, we use an internal stopword list.
|
|
- exclude_entity_tags **list[str]** - List of entity tags to ignore.
|
|
- exclude_pos_tags **list[str]** - List of part-of-speech tags to ignore.
|
|
- noun_phrase_tags **list[str]** - List of noun phrase tags to ignore.
|
|
- noun_phrase_grammars **dict[str, str]** - Noun phrase grammars for the model (cfg-only).
|
|
|
|
### prune_graph
|
|
|
|
Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.
|
|
|
|
#### Fields
|
|
|
|
- min_node_freq **int** - The minimum node frequency to allow.
|
|
- max_node_freq_std **float | None** - The maximum standard deviation of node frequency to allow.
|
|
- min_node_degree **int** - The minimum node degree to allow.
|
|
- max_node_degree_std **float | None** - The maximum standard deviation of node degree to allow.
|
|
- min_edge_weight_pct **float** - The minimum edge weight percentile to allow.
|
|
- remove_ego_nodes **bool** - Remove ego nodes.
|
|
- lcc_only **bool** - Only use largest connected component.
|
|
|
|
### cluster_graph
|
|
|
|
These are the settings used for Leiden hierarchical clustering of the graph to create communities.
|
|
|
|
#### Fields
|
|
|
|
- `max_cluster_size` **int** - The maximum cluster size to export.
|
|
- `use_lcc` **bool** - Whether to only use the largest connected component.
|
|
- `seed` **int** - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
|
|
|
|
### extract_claims
|
|
|
|
#### Fields
|
|
|
|
- `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
|
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `description` **str** - Describes the types of claims we want to extract.
|
|
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
|
|
|
### community_reports
|
|
|
|
#### Fields
|
|
|
|
- `model_id` **str** - Name of the model definition to use for API calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `max_length` **int** - The maximum number of output tokens per report.
|
|
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
|
|
|
|
### embed_graph
|
|
|
|
We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default.
|
|
|
|
#### Fields
|
|
|
|
- `enabled` **bool** - Whether to enable graph embeddings.
|
|
- `dimensions` **int** - Number of vector dimensions to produce.
|
|
- `num_walks` **int** - The node2vec number of walks.
|
|
- `walk_length` **int** - The node2vec walk length.
|
|
- `window_size` **int** - The node2vec window size.
|
|
- `iterations` **int** - The node2vec number of iterations.
|
|
- `random_seed` **int** - The node2vec random seed.
|
|
- `strategy` **dict** - Fully override the embed graph strategy.
|
|
|
|
### umap
|
|
|
|
Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you *must* enable graph embedding as well.
|
|
|
|
#### Fields
|
|
|
|
- `enabled` **bool** - Whether to enable UMAP layouts.
|
|
|
|
### snapshots
|
|
|
|
#### Fields
|
|
|
|
- `embeddings` **bool** - Export embeddings snapshots to parquet.
|
|
- `graphml` **bool** - Export graph snapshots to GraphML.
|
|
|
|
## Query
|
|
|
|
### local_search
|
|
|
|
#### Fields
|
|
|
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
|
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `text_unit_prop` **float** - The text unit proportion.
|
|
- `community_prop` **float** - The community proportion.
|
|
- `conversation_history_max_turns` **int** - The conversation history maximum turns.
|
|
- `top_k_entities` **int** - The top k mapped entities.
|
|
- `top_k_relationships` **int** - The top k mapped relations.
|
|
- `max_context_tokens` **int** - The maximum tokens to use building the request context.
|
|
|
|
### global_search
|
|
|
|
#### Fields
|
|
|
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
|
- `map_prompt` **str** - The mapper prompt file to use.
|
|
- `reduce_prompt` **str** - The reducer prompt file to use.
|
|
- `knowledge_prompt` **str** - The knowledge prompt file to use.
|
|
- `map_prompt` **str | None** - The global search mapper prompt to use.
|
|
- `reduce_prompt` **str | None** - The global search reducer to use.
|
|
- `knowledge_prompt` **str | None** - The global search general prompt to use.
|
|
- `max_context_tokens` **int** - The maximum context size to create, in tokens.
|
|
- `data_max_tokens` **int** - The maximum tokens to use constructing the final response from the reduces responses.
|
|
- `map_max_length` **int** - The maximum length to request for map responses, in words.
|
|
- `reduce_max_length` **int** - The maximum length to request for reduce responses, in words.
|
|
- `dynamic_search_threshold` **int** - Rating threshold in include a community report.
|
|
- `dynamic_search_keep_parent` **bool** - Keep parent community if any of the child communities are relevant.
|
|
- `dynamic_search_num_repeats` **int** - Number of times to rate the same community report.
|
|
- `dynamic_search_use_summary` **bool** - Use community summary instead of full_context.
|
|
- `dynamic_search_max_level` **int** - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
|
|
|
|
### drift_search
|
|
|
|
#### Fields
|
|
|
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
|
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `reduce_prompt` **str** - The reducer prompt file to use.
|
|
- `data_max_tokens` **int** - The data llm maximum tokens.
|
|
- `reduce_max_tokens` **int** - The maximum tokens for the reduce phase. Only use if a non-o-series model.
|
|
- `reduce_max_completion_tokens` **int** - The maximum tokens for the reduce phase. Only use for o-series models.
|
|
- `concurrency` **int** - The number of concurrent requests.
|
|
- `drift_k_followups` **int** - The number of top global results to retrieve.
|
|
- `primer_folds` **int** - The number of folds for search priming.
|
|
- `primer_llm_max_tokens` **int** - The maximum number of tokens for the LLM in primer.
|
|
- `n_depth` **int** - The number of drift search steps to take.
|
|
- `local_search_text_unit_prop` **float** - The proportion of search dedicated to text units.
|
|
- `local_search_community_prop` **float** - The proportion of search dedicated to community properties.
|
|
- `local_search_top_k_mapped_entities` **int** - The number of top K entities to map during local search.
|
|
- `local_search_top_k_relationships` **int** - The number of top K relationships to map during local search.
|
|
- `local_search_max_data_tokens` **int** - The maximum context size in tokens for local search.
|
|
- `local_search_temperature` **float** - The temperature to use for token generation in local search.
|
|
- `local_search_top_p` **float** - The top-p value to use for token generation in local search.
|
|
- `local_search_n` **int** - The number of completions to generate in local search.
|
|
- `local_search_llm_max_gen_tokens` **int** - The maximum number of generated tokens for the LLM in local search. Only use if a non-o-series model.
|
|
- `local_search_llm_max_gen_completion_tokens` **int** - The maximum number of generated tokens for the LLM in local search. Only use for o-series models.
|
|
|
|
### basic_search
|
|
|
|
#### Fields
|
|
|
|
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
|
|
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
|
|
- `prompt` **str** - The prompt file to use.
|
|
- `k` **int | None** - Number of text units to retrieve from the vector store for context building.
|