mirror of
https://github.com/microsoft/graphrag.git
synced 2026-01-14 00:57:23 +08:00
V3 docs and cleanup (#2100)
Some checks are pending
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Some checks are pending
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
* Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes
This commit is contained in:
parent
b732445535
commit
5ec49fd39c
@ -12,6 +12,35 @@ There are five surface areas that may be impacted on any given release. They are
|
||||
|
||||
> TL;DR: Always run `graphrag init --path [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
|
||||
|
||||
# v3
|
||||
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v3.ipynb) to convert older tables to the v3 format. Our main goals with v3 were to slim down the core library to minimize long-term maintenance of features that are either largely unused or should have been out of scope for a long time anyway.
|
||||
|
||||
## Data Model
|
||||
We made minimal data model changes that will affect your index for v3. The primary breaking change is that we removed a rarely-used document-grouping capability that resulted in the `text_units` table having a `document_ids` column with a list instead of a single entry in a column called `document_id`. v3 fixes that, and the migration notebook applies the change so you don't need to re-index.
|
||||
|
||||
Most of the other changes we made are removal of fields that are no longer used or are out of scope. For example, we removed the UMAP step that generates x/y coordinates for the entities - new indexes will not produce these columns, but they won't hurt anything if they are in your existing tables.
|
||||
|
||||
## API
|
||||
We have removed the multi-search variant from each search method in the API.
|
||||
|
||||
## Config
|
||||
|
||||
We did make several changes to the configuration model. The best way forward is to re-run `init`, which we always recommend for minor and major version bumps.
|
||||
|
||||
This is a summary of changes:
|
||||
- Removed fnllm as underlying model manager, so the model types "openai_chat", "azure_openai_chat", "openai_embedding", and "azure_openai_embedding" are all invalid. Use "chat" or "embedding".
|
||||
- fnllm also had an experimental rate limiting "auto" setting, which is no longer allowed. Use `null` in your config as a default, or set explicit limits to tpm/rpm.
|
||||
- LiteLLM does require a model_provider, so add yours as appropriate. For example, if you previously used "openai_chat" for your model type, this would be "openai", and for "azure_openai_chat" this would be "azure".
|
||||
- Collapsed the `vector_store` dict into a single root-level object. This is because we no longer support multi-search, and this dict required a lot of downstream complexity for that single use case.
|
||||
- Removed the `outputs` block that was also only used for multi-search.
|
||||
- Most workflows had an undocumented `strategy` config dict that allowed fine tuning of internal settings. These fine tunings are never used and had associated complexity, so we removed it.
|
||||
- Vector store configuration now allows custom schema per embedded field. This overrides the need for the `container_name` prefix, which caused confusion anyway. Now, the default container name will simply be the embedded field name - if you need something custom, add the `embeddings_schema` block and populate as needed.
|
||||
- We previously supported the ability to embed any text field in the data model. However, we only ever use text_unit_text, entity_description, and community_full_content, so all others have been removed.
|
||||
- Removed the `umap` and `embed_graph` blocks which were only used to add x/y fields to the entities. This fixed a long-standing dependency issue with graspologic. If you need x/y positions, see the [visualization guide](https://microsoft.github.io/graphrag/visualization_guide/) for using gephi.
|
||||
- Removed file filtering from input document loading. This was essentially unused.
|
||||
- Removed the groupby ability for text chunking. This was intended to allow short documents to be grouped before chunking, but is never used and added a bunch of complexity to the chunking process.
|
||||
|
||||
|
||||
# v2
|
||||
|
||||
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v2.ipynb) to convert older tables to the v2 format.
|
||||
|
||||
@ -31,9 +31,9 @@ To use LiteLLM one must
|
||||
- Set `type` to either `chat` or `embedding`.
|
||||
- Provide a `model_provider`, e.g., `openai`, `azure`, `gemini`, etc.
|
||||
- Set the `model` to a one supported by the `model_provider`'s API.
|
||||
- Provide a `deployment_name` if using `azure` as the `model_provider`.
|
||||
- Provide a `deployment_name` if using `azure` as the `model_provider` if your deployment name differs from the model name.
|
||||
|
||||
See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
|
||||
See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLM basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
|
||||
|
||||
## Model Selection Considerations
|
||||
|
||||
|
||||
@ -8,4 +8,3 @@ The default configuration mode is the simplest way to get started with the Graph
|
||||
|
||||
- [Init command](init.md) (recommended first step)
|
||||
- [Edit settings.yaml for deeper control](yaml.md)
|
||||
- [Purely using environment variables](env_vars.md) (not recommended)
|
||||
|
||||
@ -11,7 +11,7 @@ For example:
|
||||
GRAPHRAG_API_KEY=some_api_key
|
||||
|
||||
# settings.yml
|
||||
llm:
|
||||
default_chat_model:
|
||||
api_key: ${GRAPHRAG_API_KEY}
|
||||
```
|
||||
|
||||
@ -44,12 +44,12 @@ models:
|
||||
- `api_key` **str** - The OpenAI API key to use.
|
||||
- `auth_type` **api_key|azure_managed_identity** - Indicate how you want to authenticate requests.
|
||||
- `type` **chat**|**embedding**|mock_chat|mock_embeddings** - The type of LLM to use.
|
||||
- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. Required when `type == chat|embedding`. When `type == chat|embedding`, [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
|
||||
- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
|
||||
- `model` **str** - The model name.
|
||||
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
|
||||
- `api_base` **str** - The API base url to use.
|
||||
- `api_version` **str** - The API version.
|
||||
- `deployment_name` **str** - The deployment name to use (Azure).
|
||||
- `deployment_name` **str** - The deployment name to use if your model is hosted on Azure. Note that if your deployment name on Azure matches the model name, this is unnecessary.
|
||||
- `organization` **str** - The client organization.
|
||||
- `proxy` **str** - The proxy URL to use.
|
||||
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
|
||||
@ -57,7 +57,7 @@ models:
|
||||
- `request_timeout` **float** - The per-request timeout.
|
||||
- `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
|
||||
- `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
|
||||
- `retry_strategy` **str** - Retry strategy to use, "native" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include "exponential_backoff", "random_wait", and "incremental_wait".
|
||||
- `retry_strategy` **str** - Retry strategy to use, "exponential_backoff" is the default. Other allowable values include "native", "random_wait", and "incremental_wait".
|
||||
- `max_retries` **int** - The maximum number of retries to use.
|
||||
- `max_retry_wait` **float** - The maximum backoff time.
|
||||
- `concurrent_requests` **int** The number of open requests to allow at once.
|
||||
@ -201,7 +201,7 @@ Supported embeddings names are:
|
||||
#### Fields
|
||||
|
||||
- `model_id` **str** - Name of the model definition to use for text embedding.
|
||||
- `vector_store_id` **str** - Name of vector store definition to write to.
|
||||
- `model_instance_name` **str** - Name of the model singleton instance. Default is "text_embedding". This primarily affects the cache storage partitioning.
|
||||
- `batch_size` **int** - The maximum batch size to use.
|
||||
- `batch_max_tokens` **int** - The maximum batch # of tokens.
|
||||
- `names` **list[str]** - List of the embeddings names to run (must be in supported list).
|
||||
@ -213,6 +213,7 @@ Tune the language model-based graph extraction process.
|
||||
#### Fields
|
||||
|
||||
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||
- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_graph". This primarily affects the cache storage partitioning.
|
||||
- `prompt` **str** - The prompt file to use.
|
||||
- `entity_types` **list[str]** - The entity types to identify.
|
||||
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
||||
@ -222,6 +223,7 @@ Tune the language model-based graph extraction process.
|
||||
#### Fields
|
||||
|
||||
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||
- `model_instance_name` **str** - Name of the model singleton instance. Default is "summarize_descriptions". This primarily affects the cache storage partitioning.
|
||||
- `prompt` **str** - The prompt file to use.
|
||||
- `max_length` **int** - The maximum number of output tokens per summarization.
|
||||
- `max_input_length` **int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
|
||||
@ -275,6 +277,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
|
||||
|
||||
- `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
|
||||
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||
- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_claims". This primarily affects the cache storage partitioning.
|
||||
- `prompt` **str** - The prompt file to use.
|
||||
- `description` **str** - Describes the types of claims we want to extract.
|
||||
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
|
||||
@ -284,6 +287,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
|
||||
#### Fields
|
||||
|
||||
- `model_id` **str** - Name of the model definition to use for API calls.
|
||||
- `model_instance_name` **str** - Name of the model singleton instance. Default is "community_reporting". This primarily affects the cache storage partitioning.
|
||||
- `prompt` **str** - The prompt file to use.
|
||||
- `max_length` **int** - The maximum number of output tokens per report.
|
||||
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
|
||||
|
||||
175
docs/examples_notebooks/index_migration_to_v3.ipynb
Normal file
175
docs/examples_notebooks/index_migration_to_v3.ipynb
Normal file
@ -0,0 +1,175 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Copyright (c) 2024 Microsoft Corporation.\n",
|
||||
"# Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Index Migration (v2 to v3)\n",
|
||||
"\n",
|
||||
"This notebook is used to maintain data model parity with older indexes for version 3.0 of GraphRAG. If you have a pre-3.0 index and need to migrate without re-running the entire pipeline, you can use this notebook to only update the pieces necessary for alignment. If you have a pre-2.0 index, please run the v2 migration notebook first!\n",
|
||||
"\n",
|
||||
"NOTE: we recommend regenerating your settings.yml with the latest version of GraphRAG using `graphrag init`. Copy your LLM settings into it before running this notebook. This ensures your config is aligned with the latest version for the migration.\n",
|
||||
"\n",
|
||||
"This notebook will also update your settings.yaml to ensure compatibility with our newer vector store collection naming scheme in order to avoid re-ingesting.\n",
|
||||
"\n",
|
||||
"WARNING: This will overwrite your parquet files, you may want to make a backup!"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# This is the directory that has your settings.yaml\n",
|
||||
"PROJECT_DIRECTORY = \"/Users/naevans/graphrag/working/migration\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"from graphrag.config.load_config import load_config\n",
|
||||
"from graphrag.storage.factory import StorageFactory\n",
|
||||
"\n",
|
||||
"config = load_config(Path(PROJECT_DIRECTORY))\n",
|
||||
"storage_config = config.output.model_dump()\n",
|
||||
"storage = StorageFactory().create_storage(\n",
|
||||
" storage_type=storage_config[\"type\"],\n",
|
||||
" kwargs=storage_config,\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def remove_columns(df, columns):\n",
|
||||
" \"\"\"Remove columns from a DataFrame, suppressing errors.\"\"\"\n",
|
||||
" df.drop(labels=columns, axis=1, errors=\"ignore\", inplace=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from graphrag.utils.storage import (\n",
|
||||
" load_table_from_storage,\n",
|
||||
" write_table_to_storage,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"text_units = await load_table_from_storage(\"text_units\", storage)\n",
|
||||
"\n",
|
||||
"text_units[\"document_id\"] = text_units[\"document_ids\"].apply(lambda ids: ids[0])\n",
|
||||
"remove_columns(text_units, [\"document_ids\"])\n",
|
||||
"\n",
|
||||
"await write_table_to_storage(text_units, \"text_units\", storage)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Update settings.yaml\n",
|
||||
"This next section will attempt to insert index names for each vector index using our new schema structure. It depends on most things being default. If you have already customized your vector store schema it may not be necessary.\n",
|
||||
"\n",
|
||||
"The primary goal is to align v2 indexes using our old default naming schema with the new customizability. If don't need this done or you have a more complicated config, comment it out and update your config manually to ensure each index name is set.\n",
|
||||
"\n",
|
||||
"Old default index names:\n",
|
||||
"- default-text_unit-text\n",
|
||||
"- default-entity-description\n",
|
||||
"- default-community-full_content\n",
|
||||
"\n",
|
||||
"v3 versions are:\n",
|
||||
"- text_unit_text\n",
|
||||
"- entity_description\n",
|
||||
"- community_full_content\n",
|
||||
"\n",
|
||||
"Therefore, with a v2 index we will explicitly set the old index names so it connects correctly.\n",
|
||||
"\n",
|
||||
"NOTE: we are also setting the default vector_size for each index, under the assumption that you are using a prior default with 1536 dimensions. Our new default of text-embedding-3-large has 3072 dimensions, which will be populated as the default if unset. Again, if you have a more complicated situation you may want to manually configure this.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import yaml\n",
|
||||
"\n",
|
||||
"EMBEDDING_DIMENSIONS = 1536\n",
|
||||
"\n",
|
||||
"settings = Path(PROJECT_DIRECTORY) / \"settings.yaml\"\n",
|
||||
"with Path.open(settings) as f:\n",
|
||||
" conf = yaml.safe_load(f)\n",
|
||||
"\n",
|
||||
"vector_store = conf.get(\"vector_store\", {})\n",
|
||||
"container_name = vector_store.get(\"container_name\", \"default\")\n",
|
||||
"embeddings_schema = vector_store.get(\"embeddings_schema\", {})\n",
|
||||
"text_unit_schema = embeddings_schema.get(\"text_unit.text\", {})\n",
|
||||
"if \"index_name\" not in text_unit_schema:\n",
|
||||
" text_unit_schema[\"index_name\"] = f\"{container_name}-text_unit-text\"\n",
|
||||
"if \"vector_size\" not in text_unit_schema:\n",
|
||||
" text_unit_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
|
||||
"embeddings_schema[\"text_unit.text\"] = text_unit_schema\n",
|
||||
"entity_schema = embeddings_schema.get(\"entity.description\", {})\n",
|
||||
"if \"index_name\" not in entity_schema:\n",
|
||||
" entity_schema[\"index_name\"] = f\"{container_name}-entity-description\"\n",
|
||||
"if \"vector_size\" not in entity_schema:\n",
|
||||
" entity_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
|
||||
"embeddings_schema[\"entity.description\"] = entity_schema\n",
|
||||
"community_schema = embeddings_schema.get(\"community.full_content\", {})\n",
|
||||
"if \"index_name\" not in community_schema:\n",
|
||||
" community_schema[\"index_name\"] = f\"{container_name}-community-full_content\"\n",
|
||||
"if \"vector_size\" not in community_schema:\n",
|
||||
" community_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
|
||||
"embeddings_schema[\"community.full_content\"] = community_schema\n",
|
||||
"vector_store[\"embeddings_schema\"] = embeddings_schema\n",
|
||||
"conf[\"vector_store\"] = vector_store\n",
|
||||
"\n",
|
||||
"with Path.open(settings, \"w\") as f:\n",
|
||||
" yaml.safe_dump(conf, f)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "graphrag",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.10"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@ -1,5 +0,0 @@
|
||||
## Disclaimer
|
||||
|
||||
This folder contains community contributed notebooks that are not officially supported by the GraphRAG team. The notebooks are provided as-is and are not guaranteed to work with the latest version of GraphRAG. If you have any questions or issues, please reach out to the author of the notebook directly.
|
||||
|
||||
For more information on how to contribute to the GraphRAG project, please refer to the [contribution guidelines](https://github.com/microsoft/graphrag/blob/main/CONTRIBUTING.md)
|
||||
File diff suppressed because it is too large
Load Diff
@ -1,523 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Visualizing the knowledge graph with `yfiles-jupyter-graphs`\n",
|
||||
"\n",
|
||||
"This notebook is a partial copy of [local_search.ipynb](../../local_search.ipynb) that shows how to use `yfiles-jupyter-graphs` to add interactive graph visualizations of the parquet files and how to visualize the result context of `graphrag` queries (see at the end of this notebook)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Copyright (c) 2024 Microsoft Corporation.\n",
|
||||
"# Licensed under the MIT License."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"import tiktoken\n",
|
||||
"from graphrag.query.llm.oai.chat_openai import ChatOpenAI\n",
|
||||
"from graphrag.query.llm.oai.embedding import OpenAIEmbedding\n",
|
||||
"from graphrag.query.llm.oai.typing import OpenaiApiType\n",
|
||||
"\n",
|
||||
"from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey\n",
|
||||
"from graphrag.query.indexer_adapters import (\n",
|
||||
" read_indexer_covariates,\n",
|
||||
" read_indexer_entities,\n",
|
||||
" read_indexer_relationships,\n",
|
||||
" read_indexer_reports,\n",
|
||||
" read_indexer_text_units,\n",
|
||||
")\n",
|
||||
"from graphrag.query.structured_search.local_search.mixed_context import (\n",
|
||||
" LocalSearchMixedContext,\n",
|
||||
")\n",
|
||||
"from graphrag.query.structured_search.local_search.search import LocalSearch\n",
|
||||
"from graphrag.vector_stores.lancedb import LanceDBVectorStore"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Local Search Example\n",
|
||||
"\n",
|
||||
"Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Load text units and graph data tables as context for local search\n",
|
||||
"\n",
|
||||
"- In this test we first load indexing outputs from parquet files to dataframes, then convert these dataframes into collections of data objects aligning with the knowledge model."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Load tables to dataframes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"INPUT_DIR = \"../../inputs/operation dulce\"\n",
|
||||
"LANCEDB_URI = f\"{INPUT_DIR}/lancedb\"\n",
|
||||
"\n",
|
||||
"COMMUNITY_REPORT_TABLE = \"community_reports\"\n",
|
||||
"COMMUNITY_TABLE = \"communities\"\n",
|
||||
"ENTITY_TABLE = \"entities\"\n",
|
||||
"RELATIONSHIP_TABLE = \"relationships\"\n",
|
||||
"COVARIATE_TABLE = \"covariates\"\n",
|
||||
"TEXT_UNIT_TABLE = \"text_units\"\n",
|
||||
"COMMUNITY_LEVEL = 2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Read entities"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# read nodes table to get community and degree data\n",
|
||||
"entity_df = pd.read_parquet(f\"{INPUT_DIR}/{ENTITY_TABLE}.parquet\")\n",
|
||||
"community_df = pd.read_parquet(f\"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Read relationships"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"relationship_df = pd.read_parquet(f\"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet\")\n",
|
||||
"relationships = read_indexer_relationships(relationship_df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Visualizing nodes and relationships with `yfiles-jupyter-graphs`\n",
|
||||
"\n",
|
||||
"`yfiles-jupyter-graphs` is a graph visualization extension that provides interactive and customizable visualizations for structured node and relationship data.\n",
|
||||
"\n",
|
||||
"In this case, we use it to provide an interactive visualization for the knowledge graph of the [local_search.ipynb](../../local_search.ipynb) sample by passing node and relationship lists converted from the given parquet files. The requirements for the input data is an `id` attribute for the nodes and `start`/`end` properties for the relationships that correspond to the node ids. Additional attributes can be added in the `properties` of each node/relationship dict:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%pip install yfiles_jupyter_graphs --quiet\n",
|
||||
"from yfiles_jupyter_graphs import GraphWidget\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs\n",
|
||||
"def convert_entities_to_dicts(df):\n",
|
||||
" \"\"\"Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
|
||||
" nodes_dict = {}\n",
|
||||
" for _, row in df.iterrows():\n",
|
||||
" # Create a dictionary for each row and collect unique nodes\n",
|
||||
" node_id = row[\"title\"]\n",
|
||||
" if node_id not in nodes_dict:\n",
|
||||
" nodes_dict[node_id] = {\n",
|
||||
" \"id\": node_id,\n",
|
||||
" \"properties\": row.to_dict(),\n",
|
||||
" }\n",
|
||||
" return list(nodes_dict.values())\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs\n",
|
||||
"def convert_relationships_to_dicts(df):\n",
|
||||
" \"\"\"Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
|
||||
" relationships = []\n",
|
||||
" for _, row in df.iterrows():\n",
|
||||
" # Create a dictionary for each row\n",
|
||||
" relationships.append({\n",
|
||||
" \"start\": row[\"source\"],\n",
|
||||
" \"end\": row[\"target\"],\n",
|
||||
" \"properties\": row.to_dict(),\n",
|
||||
" })\n",
|
||||
" return relationships\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"w = GraphWidget()\n",
|
||||
"w.directed = True\n",
|
||||
"w.nodes = convert_entities_to_dicts(entity_df)\n",
|
||||
"w.edges = convert_relationships_to_dicts(relationship_df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Configure data-driven visualization\n",
|
||||
"\n",
|
||||
"The additional properties can be used to configure the visualization for different use cases."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# show title on the node\n",
|
||||
"w.node_label_mapping = \"title\"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# map community to a color\n",
|
||||
"def community_to_color(community):\n",
|
||||
" \"\"\"Map a community to a color.\"\"\"\n",
|
||||
" colors = [\n",
|
||||
" \"crimson\",\n",
|
||||
" \"darkorange\",\n",
|
||||
" \"indigo\",\n",
|
||||
" \"cornflowerblue\",\n",
|
||||
" \"cyan\",\n",
|
||||
" \"teal\",\n",
|
||||
" \"green\",\n",
|
||||
" ]\n",
|
||||
" return (\n",
|
||||
" colors[int(community) % len(colors)] if community is not None else \"lightgray\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def edge_to_source_community(edge):\n",
|
||||
" \"\"\"Get the community of the source node of an edge.\"\"\"\n",
|
||||
" source_node = next(\n",
|
||||
" (entry for entry in w.nodes if entry[\"properties\"][\"title\"] == edge[\"start\"]),\n",
|
||||
" None,\n",
|
||||
" )\n",
|
||||
" source_node_community = source_node[\"properties\"][\"community\"]\n",
|
||||
" return source_node_community if source_node_community is not None else None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"w.node_color_mapping = lambda node: community_to_color(node[\"properties\"][\"community\"])\n",
|
||||
"w.edge_color_mapping = lambda edge: community_to_color(edge_to_source_community(edge))\n",
|
||||
"# map size data to a reasonable factor\n",
|
||||
"w.node_scale_factor_mapping = lambda node: 0.5 + node[\"properties\"][\"size\"] * 1.5 / 20\n",
|
||||
"# use weight for edge thickness\n",
|
||||
"w.edge_thickness_factor_mapping = \"weight\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Automatic layouts\n",
|
||||
"\n",
|
||||
"The widget provides different automatic layouts that serve different purposes: `Circular`, `Hierarchic`, `Organic (interactiv or static)`, `Orthogonal`, `Radial`, `Tree`, `Geo-spatial`.\n",
|
||||
"\n",
|
||||
"For the knowledge graph, this sample uses the `Circular` layout, though `Hierarchic` or `Organic` are also suitable choices."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Use the circular layout for this visualization. For larger graphs, the default organic layout is often preferrable.\n",
|
||||
"w.circular_layout()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Display the graph"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"display(w)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Visualizing the result context of `graphrag` queries\n",
|
||||
"\n",
|
||||
"The result context of `graphrag` queries allow to inspect the context graph of the request. This data can similarly be visualized as graph with `yfiles-jupyter-graphs`.\n",
|
||||
"\n",
|
||||
"## Making the request\n",
|
||||
"\n",
|
||||
"The following cell recreates the sample queries from [local_search.ipynb](../../local_search.ipynb)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# setup (see also ../../local_search.ipynb)\n",
|
||||
"entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)\n",
|
||||
"\n",
|
||||
"description_embedding_store = LanceDBVectorStore(\n",
|
||||
" collection_name=\"default-entity-description\",\n",
|
||||
")\n",
|
||||
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
|
||||
"covariate_df = pd.read_parquet(f\"{INPUT_DIR}/{COVARIATE_TABLE}.parquet\")\n",
|
||||
"claims = read_indexer_covariates(covariate_df)\n",
|
||||
"covariates = {\"claims\": claims}\n",
|
||||
"report_df = pd.read_parquet(f\"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet\")\n",
|
||||
"reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)\n",
|
||||
"text_unit_df = pd.read_parquet(f\"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet\")\n",
|
||||
"text_units = read_indexer_text_units(text_unit_df)\n",
|
||||
"\n",
|
||||
"api_key = os.environ[\"GRAPHRAG_API_KEY\"]\n",
|
||||
"llm_model = os.environ[\"GRAPHRAG_LLM_MODEL\"]\n",
|
||||
"embedding_model = os.environ[\"GRAPHRAG_EMBEDDING_MODEL\"]\n",
|
||||
"\n",
|
||||
"llm = ChatOpenAI(\n",
|
||||
" api_key=api_key,\n",
|
||||
" model=llm_model,\n",
|
||||
" api_type=OpenaiApiType.OpenAI, # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI\n",
|
||||
" max_retries=20,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"token_encoder = tiktoken.get_encoding(\"cl100k_base\")\n",
|
||||
"\n",
|
||||
"text_embedder = OpenAIEmbedding(\n",
|
||||
" api_key=api_key,\n",
|
||||
" api_base=None,\n",
|
||||
" api_type=OpenaiApiType.OpenAI,\n",
|
||||
" model=embedding_model,\n",
|
||||
" deployment_name=embedding_model,\n",
|
||||
" max_retries=20,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"context_builder = LocalSearchMixedContext(\n",
|
||||
" community_reports=reports,\n",
|
||||
" text_units=text_units,\n",
|
||||
" entities=entities,\n",
|
||||
" relationships=relationships,\n",
|
||||
" covariates=covariates,\n",
|
||||
" entity_text_embeddings=description_embedding_store,\n",
|
||||
" embedding_vectorstore_key=EntityVectorStoreKey.ID, # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE\n",
|
||||
" text_embedder=text_embedder,\n",
|
||||
" token_encoder=token_encoder,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"local_context_params = {\n",
|
||||
" \"text_unit_prop\": 0.5,\n",
|
||||
" \"community_prop\": 0.1,\n",
|
||||
" \"conversation_history_max_turns\": 5,\n",
|
||||
" \"conversation_history_user_turns_only\": True,\n",
|
||||
" \"top_k_mapped_entities\": 10,\n",
|
||||
" \"top_k_relationships\": 10,\n",
|
||||
" \"include_entity_rank\": True,\n",
|
||||
" \"include_relationship_weight\": True,\n",
|
||||
" \"include_community_rank\": False,\n",
|
||||
" \"return_candidate_context\": False,\n",
|
||||
" \"embedding_vectorstore_key\": EntityVectorStoreKey.ID, # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids\n",
|
||||
" \"max_tokens\": 12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"llm_params = {\n",
|
||||
" \"max_tokens\": 2_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)\n",
|
||||
" \"temperature\": 0.0,\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"search_engine = LocalSearch(\n",
|
||||
" llm=llm,\n",
|
||||
" context_builder=context_builder,\n",
|
||||
" token_encoder=token_encoder,\n",
|
||||
" llm_params=llm_params,\n",
|
||||
" context_builder_params=local_context_params,\n",
|
||||
" response_type=\"multiple paragraphs\", # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Run local search on sample queries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"result = await search_engine.search(\"Tell me about Agent Mercer\")\n",
|
||||
"print(result.response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"question = \"Tell me about Dr. Jordan Hayes\"\n",
|
||||
"result = await search_engine.search(question)\n",
|
||||
"print(result.response)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Inspecting the context data used to generate the response"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"result.context_data[\"entities\"].head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"result.context_data[\"relationships\"].head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Visualizing the result context as graph"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\"\"\"\n",
|
||||
"Helper function to visualize the result context with `yfiles-jupyter-graphs`.\n",
|
||||
"\n",
|
||||
"The dataframes are converted into supported nodes and relationships lists and then passed to yfiles-jupyter-graphs.\n",
|
||||
"Additionally, some values are mapped to visualization properties.\n",
|
||||
"\"\"\"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def show_graph(result):\n",
|
||||
" \"\"\"Visualize the result context with yfiles-jupyter-graphs.\"\"\"\n",
|
||||
" from yfiles_jupyter_graphs import GraphWidget\n",
|
||||
"\n",
|
||||
" if (\n",
|
||||
" \"entities\" not in result.context_data\n",
|
||||
" or \"relationships\" not in result.context_data\n",
|
||||
" ):\n",
|
||||
" msg = \"The passed results do not contain 'entities' or 'relationships'\"\n",
|
||||
" raise ValueError(msg)\n",
|
||||
"\n",
|
||||
" # converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs\n",
|
||||
" def convert_entities_to_dicts(df):\n",
|
||||
" \"\"\"Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
|
||||
" nodes_dict = {}\n",
|
||||
" for _, row in df.iterrows():\n",
|
||||
" # Create a dictionary for each row and collect unique nodes\n",
|
||||
" node_id = row[\"entity\"]\n",
|
||||
" if node_id not in nodes_dict:\n",
|
||||
" nodes_dict[node_id] = {\n",
|
||||
" \"id\": node_id,\n",
|
||||
" \"properties\": row.to_dict(),\n",
|
||||
" }\n",
|
||||
" return list(nodes_dict.values())\n",
|
||||
"\n",
|
||||
" # converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs\n",
|
||||
" def convert_relationships_to_dicts(df):\n",
|
||||
" \"\"\"Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
|
||||
" relationships = []\n",
|
||||
" for _, row in df.iterrows():\n",
|
||||
" # Create a dictionary for each row\n",
|
||||
" relationships.append({\n",
|
||||
" \"start\": row[\"source\"],\n",
|
||||
" \"end\": row[\"target\"],\n",
|
||||
" \"properties\": row.to_dict(),\n",
|
||||
" })\n",
|
||||
" return relationships\n",
|
||||
"\n",
|
||||
" w = GraphWidget()\n",
|
||||
" # use the converted data to visualize the graph\n",
|
||||
" w.nodes = convert_entities_to_dicts(result.context_data[\"entities\"])\n",
|
||||
" w.edges = convert_relationships_to_dicts(result.context_data[\"relationships\"])\n",
|
||||
" w.directed = True\n",
|
||||
" # show title on the node\n",
|
||||
" w.node_label_mapping = \"entity\"\n",
|
||||
" # use weight for edge thickness\n",
|
||||
" w.edge_thickness_factor_mapping = \"weight\"\n",
|
||||
" display(w)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"show_graph(result)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.0"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
Binary file not shown.
@ -1 +0,0 @@
|
||||
$498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0²—id ÿÿÿÿÿÿÿÿÿ*string08text ÿÿÿÿÿÿÿÿÿ*string085vector ÿÿÿÿÿÿÿÿÿ*fixed_size_list:float:153608 title ÿÿÿÿÿÿÿÿÿ*string08
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -125,6 +125,7 @@ class CommunityReportDefaults:
|
||||
max_length: int = 2000
|
||||
max_input_length: int = 8000
|
||||
model_id: str = DEFAULT_CHAT_MODEL_ID
|
||||
model_instance_name: str = "community_reporting"
|
||||
|
||||
|
||||
@dataclass
|
||||
@ -161,6 +162,7 @@ class EmbedTextDefaults:
|
||||
"""Default values for embedding text."""
|
||||
|
||||
model_id: str = DEFAULT_EMBEDDING_MODEL_ID
|
||||
model_instance_name: str = "text_embedding"
|
||||
batch_size: int = 16
|
||||
batch_max_tokens: int = 8191
|
||||
names: list[str] = field(default_factory=lambda: default_embeddings)
|
||||
@ -179,6 +181,7 @@ class ExtractClaimsDefaults:
|
||||
max_gleanings: int = 1
|
||||
strategy: None = None
|
||||
model_id: str = DEFAULT_CHAT_MODEL_ID
|
||||
model_instance_name: str = "extract_claims"
|
||||
|
||||
|
||||
@dataclass
|
||||
@ -192,6 +195,7 @@ class ExtractGraphDefaults:
|
||||
max_gleanings: int = 1
|
||||
strategy: None = None
|
||||
model_id: str = DEFAULT_CHAT_MODEL_ID
|
||||
model_instance_name: str = "extract_graph"
|
||||
|
||||
|
||||
@dataclass
|
||||
@ -382,6 +386,7 @@ class SummarizeDescriptionsDefaults:
|
||||
max_input_tokens: int = 4_000
|
||||
strategy: None = None
|
||||
model_id: str = DEFAULT_CHAT_MODEL_ID
|
||||
model_instance_name: str = "summarize_descriptions"
|
||||
|
||||
|
||||
@dataclass
|
||||
|
||||
@ -30,6 +30,10 @@ class CommunityReportsConfig(BaseModel):
|
||||
description="The model ID to use for community reports.",
|
||||
default=graphrag_config_defaults.community_reports.model_id,
|
||||
)
|
||||
model_instance_name: str = Field(
|
||||
description="The model singleton instance name. This primarily affects the cache storage partitioning.",
|
||||
default=graphrag_config_defaults.community_reports.model_instance_name,
|
||||
)
|
||||
graph_prompt: str | None = Field(
|
||||
description="The community report extraction prompt to use for graph-based summarization.",
|
||||
default=graphrag_config_defaults.community_reports.graph_prompt,
|
||||
|
||||
@ -15,6 +15,10 @@ class EmbedTextConfig(BaseModel):
|
||||
description="The model ID to use for text embeddings.",
|
||||
default=graphrag_config_defaults.embed_text.model_id,
|
||||
)
|
||||
model_instance_name: str = Field(
|
||||
description="The model singleton instance name. This primarily affects the cache storage partitioning.",
|
||||
default=graphrag_config_defaults.embed_text.model_instance_name,
|
||||
)
|
||||
batch_size: int = Field(
|
||||
description="The batch size to use.",
|
||||
default=graphrag_config_defaults.embed_text.batch_size,
|
||||
|
||||
@ -30,6 +30,10 @@ class ExtractClaimsConfig(BaseModel):
|
||||
description="The model ID to use for claim extraction.",
|
||||
default=graphrag_config_defaults.extract_claims.model_id,
|
||||
)
|
||||
model_instance_name: str = Field(
|
||||
description="The model singleton instance name. This primarily affects the cache storage partitioning.",
|
||||
default=graphrag_config_defaults.extract_claims.model_instance_name,
|
||||
)
|
||||
prompt: str | None = Field(
|
||||
description="The claim extraction prompt to use.",
|
||||
default=graphrag_config_defaults.extract_claims.prompt,
|
||||
|
||||
@ -26,6 +26,10 @@ class ExtractGraphConfig(BaseModel):
|
||||
description="The model ID to use for text embeddings.",
|
||||
default=graphrag_config_defaults.extract_graph.model_id,
|
||||
)
|
||||
model_instance_name: str = Field(
|
||||
description="The model singleton instance name. This primarily affects the cache storage partitioning.",
|
||||
default=graphrag_config_defaults.extract_graph.model_instance_name,
|
||||
)
|
||||
prompt: str | None = Field(
|
||||
description="The entity extraction prompt to use.",
|
||||
default=graphrag_config_defaults.extract_graph.prompt,
|
||||
|
||||
@ -26,6 +26,10 @@ class SummarizeDescriptionsConfig(BaseModel):
|
||||
description="The model ID to use for summarization.",
|
||||
default=graphrag_config_defaults.summarize_descriptions.model_id,
|
||||
)
|
||||
model_instance_name: str = Field(
|
||||
description="The model singleton instance name. This primarily affects the cache storage partitioning.",
|
||||
default=graphrag_config_defaults.summarize_descriptions.model_instance_name,
|
||||
)
|
||||
prompt: str | None = Field(
|
||||
description="The description summarization prompt to use.",
|
||||
default=graphrag_config_defaults.summarize_descriptions.prompt,
|
||||
|
||||
@ -58,7 +58,7 @@ async def run_workflow(
|
||||
prompts = config.community_reports.resolved_prompts(config.root_dir)
|
||||
|
||||
model = ModelManager().get_or_create_chat_model(
|
||||
name="community_reporting",
|
||||
name=config.community_reports.model_instance_name,
|
||||
model_type=model_config.type,
|
||||
config=model_config,
|
||||
callbacks=context.callbacks,
|
||||
|
||||
@ -47,7 +47,7 @@ async def run_workflow(
|
||||
|
||||
model_config = config.get_language_model_config(config.community_reports.model_id)
|
||||
model = ModelManager().get_or_create_chat_model(
|
||||
name="community_reporting",
|
||||
name=config.community_reports.model_instance_name,
|
||||
model_type=model_config.type,
|
||||
config=model_config,
|
||||
callbacks=context.callbacks,
|
||||
|
||||
@ -38,7 +38,7 @@ async def run_workflow(
|
||||
model_config = config.get_language_model_config(config.extract_claims.model_id)
|
||||
|
||||
model = ModelManager().get_or_create_chat_model(
|
||||
name="extract_claims",
|
||||
name=config.extract_claims.model_instance_name,
|
||||
model_type=model_config.type,
|
||||
config=model_config,
|
||||
callbacks=context.callbacks,
|
||||
|
||||
@ -38,7 +38,7 @@ async def run_workflow(
|
||||
)
|
||||
extraction_prompts = config.extract_graph.resolved_prompts(config.root_dir)
|
||||
extraction_model = ModelManager().get_or_create_chat_model(
|
||||
name="extract_graph",
|
||||
name=config.extract_graph.model_instance_name,
|
||||
model_type=extraction_model_config.type,
|
||||
config=extraction_model_config,
|
||||
cache=context.cache,
|
||||
@ -51,7 +51,7 @@ async def run_workflow(
|
||||
config.root_dir
|
||||
)
|
||||
summarization_model = ModelManager().get_or_create_chat_model(
|
||||
name="summarize_descriptions",
|
||||
name=config.summarize_descriptions.model_instance_name,
|
||||
model_type=summarization_model_config.type,
|
||||
config=summarization_model_config,
|
||||
cache=context.cache,
|
||||
|
||||
@ -77,7 +77,7 @@ async def run_workflow(
|
||||
model_config = config.get_language_model_config(config.embed_text.model_id)
|
||||
|
||||
model = ModelManager().get_or_create_embedding_model(
|
||||
name="text_embedding",
|
||||
name=config.embed_text.model_instance_name,
|
||||
model_type=model_config.type,
|
||||
config=model_config,
|
||||
callbacks=context.callbacks,
|
||||
|
||||
@ -246,7 +246,7 @@ convention = "numpy"
|
||||
|
||||
# https://github.com/microsoft/pyright/blob/9f81564a4685ff5c55edd3959f9b39030f590b2f/docs/configuration.md#sample-pyprojecttoml-file
|
||||
[tool.pyright]
|
||||
include = ["graphrag", "tests", "examples_notebooks"]
|
||||
include = ["graphrag", "tests"]
|
||||
exclude = ["**/node_modules", "**/__pycache__"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
|
||||
2
tests/fixtures/min-csv/config.json
vendored
2
tests/fixtures/min-csv/config.json
vendored
@ -54,7 +54,7 @@
|
||||
"period",
|
||||
"size"
|
||||
],
|
||||
"max_runtime": 300,
|
||||
"max_runtime": 360,
|
||||
"expected_artifacts": ["community_reports.parquet"]
|
||||
},
|
||||
"create_final_text_units": {
|
||||
|
||||
4
tests/fixtures/text/config.json
vendored
4
tests/fixtures/text/config.json
vendored
@ -40,7 +40,7 @@
|
||||
"end_date",
|
||||
"source_text"
|
||||
],
|
||||
"max_runtime": 300,
|
||||
"max_runtime": 360,
|
||||
"expected_artifacts": ["covariates.parquet"]
|
||||
},
|
||||
"create_communities": {
|
||||
@ -67,7 +67,7 @@
|
||||
"period",
|
||||
"size"
|
||||
],
|
||||
"max_runtime": 300,
|
||||
"max_runtime": 360,
|
||||
"expected_artifacts": ["community_reports.parquet"]
|
||||
},
|
||||
"create_final_text_units": {
|
||||
|
||||
Loading…
Reference in New Issue
Block a user