diff --git a/breaking-changes.md b/breaking-changes.md
index 4c0505d6..b07aa6fc 100644
--- a/breaking-changes.md
+++ b/breaking-changes.md
@@ -12,6 +12,35 @@ There are five surface areas that may be impacted on any given release. They are
> TL;DR: Always run `graphrag init --path [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
+# v3
+Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v3.ipynb) to convert older tables to the v3 format. Our main goals with v3 were to slim down the core library to minimize long-term maintenance of features that are either largely unused or should have been out of scope for a long time anyway.
+
+## Data Model
+We made minimal data model changes that will affect your index for v3. The primary breaking change is that we removed a rarely-used document-grouping capability that resulted in the `text_units` table having a `document_ids` column with a list instead of a single entry in a column called `document_id`. v3 fixes that, and the migration notebook applies the change so you don't need to re-index.
+
+Most of the other changes we made are removal of fields that are no longer used or are out of scope. For example, we removed the UMAP step that generates x/y coordinates for the entities - new indexes will not produce these columns, but they won't hurt anything if they are in your existing tables.
+
+## API
+We have removed the multi-search variant from each search method in the API.
+
+## Config
+
+We did make several changes to the configuration model. The best way forward is to re-run `init`, which we always recommend for minor and major version bumps.
+
+This is a summary of changes:
+- Removed fnllm as underlying model manager, so the model types "openai_chat", "azure_openai_chat", "openai_embedding", and "azure_openai_embedding" are all invalid. Use "chat" or "embedding".
+- fnllm also had an experimental rate limiting "auto" setting, which is no longer allowed. Use `null` in your config as a default, or set explicit limits to tpm/rpm.
+- LiteLLM does require a model_provider, so add yours as appropriate. For example, if you previously used "openai_chat" for your model type, this would be "openai", and for "azure_openai_chat" this would be "azure".
+- Collapsed the `vector_store` dict into a single root-level object. This is because we no longer support multi-search, and this dict required a lot of downstream complexity for that single use case.
+- Removed the `outputs` block that was also only used for multi-search.
+- Most workflows had an undocumented `strategy` config dict that allowed fine tuning of internal settings. These fine tunings are never used and had associated complexity, so we removed it.
+- Vector store configuration now allows custom schema per embedded field. This overrides the need for the `container_name` prefix, which caused confusion anyway. Now, the default container name will simply be the embedded field name - if you need something custom, add the `embeddings_schema` block and populate as needed.
+- We previously supported the ability to embed any text field in the data model. However, we only ever use text_unit_text, entity_description, and community_full_content, so all others have been removed.
+- Removed the `umap` and `embed_graph` blocks which were only used to add x/y fields to the entities. This fixed a long-standing dependency issue with graspologic. If you need x/y positions, see the [visualization guide](https://microsoft.github.io/graphrag/visualization_guide/) for using gephi.
+- Removed file filtering from input document loading. This was essentially unused.
+- Removed the groupby ability for text chunking. This was intended to allow short documents to be grouped before chunking, but is never used and added a bunch of complexity to the chunking process.
+
+
# v2
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v2.ipynb) to convert older tables to the v2 format.
diff --git a/docs/config/models.md b/docs/config/models.md
index 686304a5..676cb4b2 100644
--- a/docs/config/models.md
+++ b/docs/config/models.md
@@ -31,9 +31,9 @@ To use LiteLLM one must
- Set `type` to either `chat` or `embedding`.
- Provide a `model_provider`, e.g., `openai`, `azure`, `gemini`, etc.
- Set the `model` to a one supported by the `model_provider`'s API.
-- Provide a `deployment_name` if using `azure` as the `model_provider`.
+- Provide a `deployment_name` if using `azure` as the `model_provider` if your deployment name differs from the model name.
-See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
+See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLM basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
## Model Selection Considerations
diff --git a/docs/config/overview.md b/docs/config/overview.md
index 025f5d71..278d939c 100644
--- a/docs/config/overview.md
+++ b/docs/config/overview.md
@@ -8,4 +8,3 @@ The default configuration mode is the simplest way to get started with the Graph
- [Init command](init.md) (recommended first step)
- [Edit settings.yaml for deeper control](yaml.md)
-- [Purely using environment variables](env_vars.md) (not recommended)
diff --git a/docs/config/yaml.md b/docs/config/yaml.md
index 059d99ea..ae4d4279 100644
--- a/docs/config/yaml.md
+++ b/docs/config/yaml.md
@@ -11,7 +11,7 @@ For example:
GRAPHRAG_API_KEY=some_api_key
# settings.yml
-llm:
+default_chat_model:
api_key: ${GRAPHRAG_API_KEY}
```
@@ -44,12 +44,12 @@ models:
- `api_key` **str** - The OpenAI API key to use.
- `auth_type` **api_key|azure_managed_identity** - Indicate how you want to authenticate requests.
- `type` **chat**|**embedding**|mock_chat|mock_embeddings** - The type of LLM to use.
-- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. Required when `type == chat|embedding`. When `type == chat|embedding`, [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
+- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
- `model` **str** - The model name.
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
- `api_base` **str** - The API base url to use.
- `api_version` **str** - The API version.
-- `deployment_name` **str** - The deployment name to use (Azure).
+- `deployment_name` **str** - The deployment name to use if your model is hosted on Azure. Note that if your deployment name on Azure matches the model name, this is unnecessary.
- `organization` **str** - The client organization.
- `proxy` **str** - The proxy URL to use.
- `audience` **str** - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if `api_key` is not defined. Default=`https://cognitiveservices.azure.com/.default`
@@ -57,7 +57,7 @@ models:
- `request_timeout` **float** - The per-request timeout.
- `tokens_per_minute` **int** - Set a leaky-bucket throttle on tokens-per-minute.
- `requests_per_minute` **int** - Set a leaky-bucket throttle on requests-per-minute.
-- `retry_strategy` **str** - Retry strategy to use, "native" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include "exponential_backoff", "random_wait", and "incremental_wait".
+- `retry_strategy` **str** - Retry strategy to use, "exponential_backoff" is the default. Other allowable values include "native", "random_wait", and "incremental_wait".
- `max_retries` **int** - The maximum number of retries to use.
- `max_retry_wait` **float** - The maximum backoff time.
- `concurrent_requests` **int** The number of open requests to allow at once.
@@ -201,7 +201,7 @@ Supported embeddings names are:
#### Fields
- `model_id` **str** - Name of the model definition to use for text embedding.
-- `vector_store_id` **str** - Name of vector store definition to write to.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "text_embedding". This primarily affects the cache storage partitioning.
- `batch_size` **int** - The maximum batch size to use.
- `batch_max_tokens` **int** - The maximum batch # of tokens.
- `names` **list[str]** - List of the embeddings names to run (must be in supported list).
@@ -213,6 +213,7 @@ Tune the language model-based graph extraction process.
#### Fields
- `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_graph". This primarily affects the cache storage partitioning.
- `prompt` **str** - The prompt file to use.
- `entity_types` **list[str]** - The entity types to identify.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
@@ -222,6 +223,7 @@ Tune the language model-based graph extraction process.
#### Fields
- `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "summarize_descriptions". This primarily affects the cache storage partitioning.
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per summarization.
- `max_input_length` **int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
@@ -275,6 +277,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
- `enabled` **bool** - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
- `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "extract_claims". This primarily affects the cache storage partitioning.
- `prompt` **str** - The prompt file to use.
- `description` **str** - Describes the types of claims we want to extract.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
@@ -284,6 +287,7 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
#### Fields
- `model_id` **str** - Name of the model definition to use for API calls.
+- `model_instance_name` **str** - Name of the model singleton instance. Default is "community_reporting". This primarily affects the cache storage partitioning.
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per report.
- `max_input_length` **int** - The maximum number of input tokens to use when generating reports.
diff --git a/docs/examples_notebooks/index_migration_to_v3.ipynb b/docs/examples_notebooks/index_migration_to_v3.ipynb
new file mode 100644
index 00000000..2b76133e
--- /dev/null
+++ b/docs/examples_notebooks/index_migration_to_v3.ipynb
@@ -0,0 +1,175 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Copyright (c) 2024 Microsoft Corporation.\n",
+ "# Licensed under the MIT License."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Index Migration (v2 to v3)\n",
+ "\n",
+ "This notebook is used to maintain data model parity with older indexes for version 3.0 of GraphRAG. If you have a pre-3.0 index and need to migrate without re-running the entire pipeline, you can use this notebook to only update the pieces necessary for alignment. If you have a pre-2.0 index, please run the v2 migration notebook first!\n",
+ "\n",
+ "NOTE: we recommend regenerating your settings.yml with the latest version of GraphRAG using `graphrag init`. Copy your LLM settings into it before running this notebook. This ensures your config is aligned with the latest version for the migration.\n",
+ "\n",
+ "This notebook will also update your settings.yaml to ensure compatibility with our newer vector store collection naming scheme in order to avoid re-ingesting.\n",
+ "\n",
+ "WARNING: This will overwrite your parquet files, you may want to make a backup!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# This is the directory that has your settings.yaml\n",
+ "PROJECT_DIRECTORY = \"/Users/naevans/graphrag/working/migration\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pathlib import Path\n",
+ "\n",
+ "from graphrag.config.load_config import load_config\n",
+ "from graphrag.storage.factory import StorageFactory\n",
+ "\n",
+ "config = load_config(Path(PROJECT_DIRECTORY))\n",
+ "storage_config = config.output.model_dump()\n",
+ "storage = StorageFactory().create_storage(\n",
+ " storage_type=storage_config[\"type\"],\n",
+ " kwargs=storage_config,\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def remove_columns(df, columns):\n",
+ " \"\"\"Remove columns from a DataFrame, suppressing errors.\"\"\"\n",
+ " df.drop(labels=columns, axis=1, errors=\"ignore\", inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from graphrag.utils.storage import (\n",
+ " load_table_from_storage,\n",
+ " write_table_to_storage,\n",
+ ")\n",
+ "\n",
+ "text_units = await load_table_from_storage(\"text_units\", storage)\n",
+ "\n",
+ "text_units[\"document_id\"] = text_units[\"document_ids\"].apply(lambda ids: ids[0])\n",
+ "remove_columns(text_units, [\"document_ids\"])\n",
+ "\n",
+ "await write_table_to_storage(text_units, \"text_units\", storage)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Update settings.yaml\n",
+ "This next section will attempt to insert index names for each vector index using our new schema structure. It depends on most things being default. If you have already customized your vector store schema it may not be necessary.\n",
+ "\n",
+ "The primary goal is to align v2 indexes using our old default naming schema with the new customizability. If don't need this done or you have a more complicated config, comment it out and update your config manually to ensure each index name is set.\n",
+ "\n",
+ "Old default index names:\n",
+ "- default-text_unit-text\n",
+ "- default-entity-description\n",
+ "- default-community-full_content\n",
+ "\n",
+ "v3 versions are:\n",
+ "- text_unit_text\n",
+ "- entity_description\n",
+ "- community_full_content\n",
+ "\n",
+ "Therefore, with a v2 index we will explicitly set the old index names so it connects correctly.\n",
+ "\n",
+ "NOTE: we are also setting the default vector_size for each index, under the assumption that you are using a prior default with 1536 dimensions. Our new default of text-embedding-3-large has 3072 dimensions, which will be populated as the default if unset. Again, if you have a more complicated situation you may want to manually configure this.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import yaml\n",
+ "\n",
+ "EMBEDDING_DIMENSIONS = 1536\n",
+ "\n",
+ "settings = Path(PROJECT_DIRECTORY) / \"settings.yaml\"\n",
+ "with Path.open(settings) as f:\n",
+ " conf = yaml.safe_load(f)\n",
+ "\n",
+ "vector_store = conf.get(\"vector_store\", {})\n",
+ "container_name = vector_store.get(\"container_name\", \"default\")\n",
+ "embeddings_schema = vector_store.get(\"embeddings_schema\", {})\n",
+ "text_unit_schema = embeddings_schema.get(\"text_unit.text\", {})\n",
+ "if \"index_name\" not in text_unit_schema:\n",
+ " text_unit_schema[\"index_name\"] = f\"{container_name}-text_unit-text\"\n",
+ "if \"vector_size\" not in text_unit_schema:\n",
+ " text_unit_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
+ "embeddings_schema[\"text_unit.text\"] = text_unit_schema\n",
+ "entity_schema = embeddings_schema.get(\"entity.description\", {})\n",
+ "if \"index_name\" not in entity_schema:\n",
+ " entity_schema[\"index_name\"] = f\"{container_name}-entity-description\"\n",
+ "if \"vector_size\" not in entity_schema:\n",
+ " entity_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
+ "embeddings_schema[\"entity.description\"] = entity_schema\n",
+ "community_schema = embeddings_schema.get(\"community.full_content\", {})\n",
+ "if \"index_name\" not in community_schema:\n",
+ " community_schema[\"index_name\"] = f\"{container_name}-community-full_content\"\n",
+ "if \"vector_size\" not in community_schema:\n",
+ " community_schema[\"vector_size\"] = EMBEDDING_DIMENSIONS\n",
+ "embeddings_schema[\"community.full_content\"] = community_schema\n",
+ "vector_store[\"embeddings_schema\"] = embeddings_schema\n",
+ "conf[\"vector_store\"] = vector_store\n",
+ "\n",
+ "with Path.open(settings, \"w\") as f:\n",
+ " yaml.safe_dump(conf, f)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "graphrag",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.10"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples_notebooks/community_contrib/README.md b/examples_notebooks/community_contrib/README.md
deleted file mode 100644
index 0915dc1a..00000000
--- a/examples_notebooks/community_contrib/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-## Disclaimer
-
-This folder contains community contributed notebooks that are not officially supported by the GraphRAG team. The notebooks are provided as-is and are not guaranteed to work with the latest version of GraphRAG. If you have any questions or issues, please reach out to the author of the notebook directly.
-
-For more information on how to contribute to the GraphRAG project, please refer to the [contribution guidelines](https://github.com/microsoft/graphrag/blob/main/CONTRIBUTING.md)
diff --git a/examples_notebooks/community_contrib/neo4j/graphrag_import_neo4j_cypher.ipynb b/examples_notebooks/community_contrib/neo4j/graphrag_import_neo4j_cypher.ipynb
deleted file mode 100644
index a8b30a81..00000000
--- a/examples_notebooks/community_contrib/neo4j/graphrag_import_neo4j_cypher.ipynb
+++ /dev/null
@@ -1,1215 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b4fea928",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Copyright (c) 2024 Microsoft Corporation.\n",
- "# Licensed under the MIT License."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0c4bc9ba",
- "metadata": {},
- "source": [
- "# Neo4j Import of GraphRAG Result Parquet files\n",
- "\n",
- "This notebook imports the results of the GraphRAG indexing process into the Neo4j Graph database for further processing, analysis or visualization. \n",
- "\n",
- "You can also build your own GenAI applications using Neo4j and a number of RAG strategies with LangChain, LlamaIndex, Haystack, and many other frameworks.\n",
- "See: https://neo4j.com/labs/genai-ecosystem\n",
- "\n",
- "Here is what the end result looks like:\n",
- "\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3924e246",
- "metadata": {},
- "source": [
- "## How does it work?\n",
- "\n",
- "The notebook loads the parquet files from the `output` folder of your indexing process and loads them into Pandas dataframes.\n",
- "It then uses a batching approach to send a slice of the data into Neo4j to create nodes and relationships and add relevant properties. The id-arrays on most entities are turned into relationships. \n",
- "\n",
- "All operations use MERGE, so they are idempotent, and you can run the script multiple times.\n",
- "\n",
- "If you need to clean out the database, you can run the following statement\n",
- "\n",
- "```cypher\n",
- "MATCH (n)\n",
- "CALL { WITH n DETACH DELETE n } IN TRANSACTIONS OF 25000 ROWS;\n",
- "```"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "id": "adca1803",
- "metadata": {},
- "outputs": [],
- "source": [
- "GRAPHRAG_FOLDER = \"PATH_TO_OUTPUT/artifacts\""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7fb27b941602401d91542211134fc71a",
- "metadata": {},
- "source": [
- "### Depedendencies\n",
- "\n",
- "We only need Pandas and the neo4j Python driver with the rust extension for faster network transport."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "b57beec0",
- "metadata": {},
- "outputs": [],
- "source": [
- "%pip install --quiet pandas neo4j-rust-ext"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 61,
- "id": "3eeee95f-e4f2-4052-94fb-a5dc8ab542ae",
- "metadata": {},
- "outputs": [],
- "source": [
- "import time\n",
- "\n",
- "import pandas as pd\n",
- "from neo4j import GraphDatabase"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "307dd2f4",
- "metadata": {},
- "source": [
- "## Neo4j Installation\n",
- "\n",
- "You can create a free instance of Neo4j [online](https://console.neo4j.io). You get a credentials file that you can use for the connection credentials. You can also get an instance in any of the cloud marketplaces.\n",
- "\n",
- "If you want to install Neo4j locally either use [Neo4j Desktop](https://neo4j.com/download) or \n",
- "the official Docker image: `docker run -e NEO4J_AUTH=neo4j/password -p 7687:7687 -p 7474:7474 neo4j` "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "id": "b6c15443-4acb-4f91-88ea-4e08abaa4c29",
- "metadata": {},
- "outputs": [],
- "source": [
- "NEO4J_URI = \"neo4j://localhost\" # or neo4j+s://xxxx.databases.neo4j.io\n",
- "NEO4J_USERNAME = \"neo4j\"\n",
- "NEO4J_PASSWORD = \"\" # your password\n",
- "NEO4J_DATABASE = \"neo4j\"\n",
- "\n",
- "# Create a Neo4j driver\n",
- "driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "70f37ab6",
- "metadata": {},
- "source": [
- "## Batched Import\n",
- "\n",
- "The batched import function takes a Cypher insert statement (needs to use the variable `value` for the row) and a dataframe to import.\n",
- "It will send by default 1k rows at a time as query parameter to the database to be inserted."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "id": "d787bf7b-ac9b-4bfb-b140-a50a3fd205c5",
- "metadata": {},
- "outputs": [],
- "source": [
- "def batched_import(statement, df, batch_size=1000):\n",
- " \"\"\"\n",
- " Import a dataframe into Neo4j using a batched approach.\n",
- "\n",
- " Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.\n",
- " \"\"\"\n",
- " total = len(df)\n",
- " start_s = time.time()\n",
- " for start in range(0, total, batch_size):\n",
- " batch = df.iloc[start : min(start + batch_size, total)]\n",
- " result = driver.execute_query(\n",
- " \"UNWIND $rows AS value \" + statement,\n",
- " rows=batch.to_dict(\"records\"),\n",
- " database_=NEO4J_DATABASE,\n",
- " )\n",
- " print(result.summary.counters)\n",
- " print(f\"{total} rows in {time.time() - start_s} s.\")\n",
- " return total"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0fb45f42",
- "metadata": {},
- "source": [
- "## Indexes and Constraints\n",
- "\n",
- "Indexes in Neo4j are only used to find the starting points for graph queries, e.g. quickly finding two nodes to connect.\n",
- "Constraints exist to avoid duplicates, we create them mostly on id's of Entity types.\n",
- "\n",
- "We use some Types as markers with two underscores before and after to distinguish them from the actual entity types.\n",
- "\n",
- "The default relationship type here is `RELATED` but we could also infer a real relationship-type from the description or the types of the start and end-nodes.\n",
- "\n",
- "* `__Entity__`\n",
- "* `__Document__`\n",
- "* `__Chunk__`\n",
- "* `__Community__`\n",
- "* `__Covariate__`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 64,
- "id": "ed7f212e-9148-424c-adc6-d81db9f8e5a5",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "create constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique\n",
- "\n",
- "create constraint document_id if not exists for (d:__Document__) require d.id is unique\n",
- "\n",
- "create constraint entity_id if not exists for (c:__Community__) require c.community is unique\n",
- "\n",
- "create constraint entity_id if not exists for (e:__Entity__) require e.id is unique\n",
- "\n",
- "create constraint entity_title if not exists for (e:__Entity__) require e.name is unique\n",
- "\n",
- "create constraint entity_title if not exists for (e:__Covariate__) require e.title is unique\n",
- "\n",
- "create constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique\n"
- ]
- }
- ],
- "source": [
- "# create constraints, idempotent operation\n",
- "\n",
- "statements = [\n",
- " \"\\ncreate constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique\",\n",
- " \"\\ncreate constraint document_id if not exists for (d:__Document__) require d.id is unique\",\n",
- " \"\\ncreate constraint entity_id if not exists for (c:__Community__) require c.community is unique\",\n",
- " \"\\ncreate constraint entity_id if not exists for (e:__Entity__) require e.id is unique\",\n",
- " \"\\ncreate constraint entity_title if not exists for (e:__Entity__) require e.name is unique\",\n",
- " \"\\ncreate constraint entity_title if not exists for (e:__Covariate__) require e.title is unique\",\n",
- " \"\\ncreate constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique\",\n",
- " \"\\n\",\n",
- "]\n",
- "\n",
- "for statement in statements:\n",
- " if len((statement or \"\").strip()) > 0:\n",
- " print(statement)\n",
- " driver.execute_query(statement)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "beea073b",
- "metadata": {},
- "source": [
- "## Import Process\n",
- "\n",
- "### Importing the Documents\n",
- "\n",
- "We're loading the parquet file for the documents and create nodes with their ids and add the title property.\n",
- "We don't need to store text_unit_ids as we can create the relationships and the text content is also contained in the chunks."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 65,
- "id": "1ba023e7",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " id | \n",
- " title | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " c305886e4aa2f6efcf64b57762777055 | \n",
- " book.txt | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " id title\n",
- "0 c305886e4aa2f6efcf64b57762777055 book.txt"
- ]
- },
- "execution_count": 65,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "doc_df = pd.read_parquet(\n",
- " f\"{GRAPHRAG_FOLDER}/documents.parquet\", columns=[\"id\", \"title\"]\n",
- ")\n",
- "doc_df.head(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 66,
- "id": "96391c15",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'labels_added': 1, 'nodes_created': 1, 'properties_set': 2}\n",
- "1 rows in 0.05211496353149414 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "1"
- ]
- },
- "execution_count": 66,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Import documents\n",
- "statement = \"\"\"\n",
- "MERGE (d:__Document__ {id:value.id})\n",
- "SET d += value {.title}\n",
- "\"\"\"\n",
- "\n",
- "batched_import(statement, doc_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f97bbadb",
- "metadata": {},
- "source": [
- "### Loading Text Units\n",
- "\n",
- "We load the text units, create a node per id and set the text and number of tokens.\n",
- "Then we connect them to the documents that we created before."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 67,
- "id": "0d825626",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " id | \n",
- " text | \n",
- " n_tokens | \n",
- " document_ids | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " 680dd6d2a970a49082fa4f34bf63a34e | \n",
- " The Project Gutenberg eBook of A Christmas Ca... | \n",
- " 300 | \n",
- " [c305886e4aa2f6efcf64b57762777055] | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " 95f1f8f5bdbf0bee3a2c6f2f4a4907f6 | \n",
- " THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL... | \n",
- " 300 | \n",
- " [c305886e4aa2f6efcf64b57762777055] | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " id \\\n",
- "0 680dd6d2a970a49082fa4f34bf63a34e \n",
- "1 95f1f8f5bdbf0bee3a2c6f2f4a4907f6 \n",
- "\n",
- " text n_tokens \\\n",
- "0 The Project Gutenberg eBook of A Christmas Ca... 300 \n",
- "1 THE PROJECT GUTENBERG EBOOK A CHRISTMAS CAROL... 300 \n",
- "\n",
- " document_ids \n",
- "0 [c305886e4aa2f6efcf64b57762777055] \n",
- "1 [c305886e4aa2f6efcf64b57762777055] "
- ]
- },
- "execution_count": 67,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text_df = pd.read_parquet(\n",
- " f\"{GRAPHRAG_FOLDER}/text_units.parquet\",\n",
- " columns=[\"id\", \"text\", \"n_tokens\", \"document_ids\"],\n",
- ")\n",
- "text_df.head(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 68,
- "id": "ffd3d380-8710-46f5-b90a-04ed8482192c",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'relationships_created': 231, 'properties_set': 462}\n",
- "231 rows in 0.05993008613586426 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "231"
- ]
- },
- "execution_count": 68,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "statement = \"\"\"\n",
- "MERGE (c:__Chunk__ {id:value.id})\n",
- "SET c += value {.text, .n_tokens}\n",
- "WITH c, value\n",
- "UNWIND value.document_ids AS document\n",
- "MATCH (d:__Document__ {id:document})\n",
- "MERGE (c)-[:PART_OF]->(d)\n",
- "\"\"\"\n",
- "\n",
- "batched_import(statement, text_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f01b2094",
- "metadata": {},
- "source": [
- "### Loading Nodes\n",
- "\n",
- "For the nodes we store id, name, description, embedding (if available), human readable id."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 78,
- "id": "2392f9e9",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " name | \n",
- " type | \n",
- " description | \n",
- " human_readable_id | \n",
- " id | \n",
- " description_embedding | \n",
- " text_unit_ids | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " \"PROJECT GUTENBERG\" | \n",
- " \"ORGANIZATION\" | \n",
- " Project Gutenberg is a pioneering organization... | \n",
- " 0 | \n",
- " b45241d70f0e43fca764df95b2b81f77 | \n",
- " [-0.020793898031115532, 0.02951139025390148, 0... | \n",
- " [01e84646075b255eab0a34d872336a89, 10bab8e9773... | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " \"UNITED STATES\" | \n",
- " \"GEO\" | \n",
- " The United States is prominently recognized fo... | \n",
- " 1 | \n",
- " 4119fd06010c494caa07f439b333f4c5 | \n",
- " [-0.009704762138426304, 0.013335365802049637, ... | \n",
- " [01e84646075b255eab0a34d872336a89, 28f242c4515... | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " name type \\\n",
- "0 \"PROJECT GUTENBERG\" \"ORGANIZATION\" \n",
- "1 \"UNITED STATES\" \"GEO\" \n",
- "\n",
- " description human_readable_id \\\n",
- "0 Project Gutenberg is a pioneering organization... 0 \n",
- "1 The United States is prominently recognized fo... 1 \n",
- "\n",
- " id \\\n",
- "0 b45241d70f0e43fca764df95b2b81f77 \n",
- "1 4119fd06010c494caa07f439b333f4c5 \n",
- "\n",
- " description_embedding \\\n",
- "0 [-0.020793898031115532, 0.02951139025390148, 0... \n",
- "1 [-0.009704762138426304, 0.013335365802049637, ... \n",
- "\n",
- " text_unit_ids \n",
- "0 [01e84646075b255eab0a34d872336a89, 10bab8e9773... \n",
- "1 [01e84646075b255eab0a34d872336a89, 28f242c4515... "
- ]
- },
- "execution_count": 78,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "entity_df = pd.read_parquet(\n",
- " f\"{GRAPHRAG_FOLDER}/entities.parquet\",\n",
- " columns=[\n",
- " \"name\",\n",
- " \"type\",\n",
- " \"description\",\n",
- " \"human_readable_id\",\n",
- " \"id\",\n",
- " \"description_embedding\",\n",
- " \"text_unit_ids\",\n",
- " ],\n",
- ")\n",
- "entity_df.head(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 81,
- "id": "1d038114-0714-48ee-a48a-c421cd539661",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'properties_set': 831}\n",
- "277 rows in 0.6978070735931396 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "277"
- ]
- },
- "execution_count": 81,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "entity_statement = \"\"\"\n",
- "MERGE (e:__Entity__ {id:value.id})\n",
- "SET e += value {.human_readable_id, .description, name:replace(value.name,'\"','')}\n",
- "WITH e, value\n",
- "CALL db.create.setNodeVectorProperty(e, \"description_embedding\", value.description_embedding)\n",
- "CALL apoc.create.addLabels(e, case when coalesce(value.type,\"\") = \"\" then [] else [apoc.text.upperCamelCase(replace(value.type,'\"',''))] end) yield node\n",
- "UNWIND value.text_unit_ids AS text_unit\n",
- "MATCH (c:__Chunk__ {id:text_unit})\n",
- "MERGE (c)-[:HAS_ENTITY]->(e)\n",
- "\"\"\"\n",
- "\n",
- "batched_import(entity_statement, entity_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "018d4f87",
- "metadata": {},
- "source": [
- "### Import Relationships\n",
- "\n",
- "For the relationships we find the source and target node by name, using the base `__Entity__` type.\n",
- "After creating the `RELATED` relationships, we set the description as attribute."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 71,
- "id": "b347a047",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " source | \n",
- " target | \n",
- " id | \n",
- " rank | \n",
- " weight | \n",
- " human_readable_id | \n",
- " description | \n",
- " text_unit_ids | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " \"PROJECT GUTENBERG\" | \n",
- " \"A CHRISTMAS CAROL\" | \n",
- " b84d71ed9c3b45819eb3205fd28e13a0 | \n",
- " 20 | \n",
- " 1.0 | \n",
- " 0 | \n",
- " \"Project Gutenberg is responsible for releasin... | \n",
- " [680dd6d2a970a49082fa4f34bf63a34e] | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " \"PROJECT GUTENBERG\" | \n",
- " \"SUZANNE SHELL\" | \n",
- " b0b464bc92a541e48547fe9738378dab | \n",
- " 15 | \n",
- " 1.0 | \n",
- " 1 | \n",
- " \"Suzanne Shell produced the eBook version of '... | \n",
- " [680dd6d2a970a49082fa4f34bf63a34e] | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " source target id \\\n",
- "0 \"PROJECT GUTENBERG\" \"A CHRISTMAS CAROL\" b84d71ed9c3b45819eb3205fd28e13a0 \n",
- "1 \"PROJECT GUTENBERG\" \"SUZANNE SHELL\" b0b464bc92a541e48547fe9738378dab \n",
- "\n",
- " rank weight human_readable_id \\\n",
- "0 20 1.0 0 \n",
- "1 15 1.0 1 \n",
- "\n",
- " description \\\n",
- "0 \"Project Gutenberg is responsible for releasin... \n",
- "1 \"Suzanne Shell produced the eBook version of '... \n",
- "\n",
- " text_unit_ids \n",
- "0 [680dd6d2a970a49082fa4f34bf63a34e] \n",
- "1 [680dd6d2a970a49082fa4f34bf63a34e] "
- ]
- },
- "execution_count": 71,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "rel_df = pd.read_parquet(\n",
- " f\"{GRAPHRAG_FOLDER}/relationships.parquet\",\n",
- " columns=[\n",
- " \"source\",\n",
- " \"target\",\n",
- " \"id\",\n",
- " \"rank\",\n",
- " \"weight\",\n",
- " \"human_readable_id\",\n",
- " \"description\",\n",
- " \"text_unit_ids\",\n",
- " ],\n",
- ")\n",
- "rel_df.head(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 72,
- "id": "27900c01-89e1-4dec-9d5c-c07317c68baf",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'properties_set': 1710}\n",
- "342 rows in 0.14740705490112305 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "342"
- ]
- },
- "execution_count": 72,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "rel_statement = \"\"\"\n",
- " MATCH (source:__Entity__ {name:replace(value.source,'\"','')})\n",
- " MATCH (target:__Entity__ {name:replace(value.target,'\"','')})\n",
- " // not necessary to merge on id as there is only one relationship per pair\n",
- " MERGE (source)-[rel:RELATED {id: value.id}]->(target)\n",
- " SET rel += value {.rank, .weight, .human_readable_id, .description, .text_unit_ids}\n",
- " RETURN count(*) as createdRels\n",
- "\"\"\"\n",
- "\n",
- "batched_import(rel_statement, rel_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e6365dd7",
- "metadata": {},
- "source": [
- "### Importing Communities\n",
- "\n",
- "For communities we import their id, title, level.\n",
- "We connect the `__Community__` nodes to the start and end nodes of the relationships they refer to.\n",
- "\n",
- "Connecting them to the chunks they orignate from is optional, as the entites are already connected to the chunks."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 73,
- "id": "c2fab66c",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " id | \n",
- " level | \n",
- " title | \n",
- " text_unit_ids | \n",
- " relationship_ids | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " 2 | \n",
- " 0 | \n",
- " Community 2 | \n",
- " [0546d296a4d3bb0486bd0c94c01dc9be,0d6bc6e701a0... | \n",
- " [ba481175ee1d4329bf07757a30abd3a1, 8d8da35190b... | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " 4 | \n",
- " 0 | \n",
- " Community 4 | \n",
- " [054bdcba0a3690b43609d9226a47f84d,3a450ed2b7fb... | \n",
- " [929f30875e1744b49e7b416eaf5a790c, 4920fda0318... | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " id level title text_unit_ids \\\n",
- "0 2 0 Community 2 [0546d296a4d3bb0486bd0c94c01dc9be,0d6bc6e701a0... \n",
- "1 4 0 Community 4 [054bdcba0a3690b43609d9226a47f84d,3a450ed2b7fb... \n",
- "\n",
- " relationship_ids \n",
- "0 [ba481175ee1d4329bf07757a30abd3a1, 8d8da35190b... \n",
- "1 [929f30875e1744b49e7b416eaf5a790c, 4920fda0318... "
- ]
- },
- "execution_count": 73,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "community_df = pd.read_parquet(\n",
- " f\"{GRAPHRAG_FOLDER}/communities.parquet\",\n",
- " columns=[\"id\", \"level\", \"title\", \"text_unit_ids\", \"relationship_ids\"],\n",
- ")\n",
- "\n",
- "community_df.head(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 74,
- "id": "1351f7e3",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'properties_set': 94}\n",
- "47 rows in 0.07877922058105469 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "47"
- ]
- },
- "execution_count": 74,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "statement = \"\"\"\n",
- "MERGE (c:__Community__ {community:value.id})\n",
- "SET c += value {.level, .title}\n",
- "/*\n",
- "UNWIND value.text_unit_ids as text_unit_id\n",
- "MATCH (t:__Chunk__ {id:text_unit_id})\n",
- "MERGE (c)-[:HAS_CHUNK]->(t)\n",
- "WITH distinct c, value\n",
- "*/\n",
- "WITH *\n",
- "UNWIND value.relationship_ids as rel_id\n",
- "MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)\n",
- "MERGE (start)-[:IN_COMMUNITY]->(c)\n",
- "MERGE (end)-[:IN_COMMUNITY]->(c)\n",
- "RETURn count(distinct c) as createdCommunities\n",
- "\"\"\"\n",
- "\n",
- "batched_import(statement, community_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "dd9adf50",
- "metadata": {},
- "source": [
- "### Importing Community Reports\n",
- "\n",
- "Fo the community reports we create nodes for each communitiy set the id, community, level, title, summary, rank, and rank_explanation and connect them to the entities they are about.\n",
- "For the findings we create the findings in context of the communities."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 75,
- "id": "1be9e7a9-69ee-406b-bce5-95a9c41ecffe",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " id | \n",
- " community | \n",
- " level | \n",
- " title | \n",
- " summary | \n",
- " findings | \n",
- " rank | \n",
- " rank_explanation | \n",
- " full_content | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " e7822326-4da8-4954-afa9-be7f4f5791a5 | \n",
- " 42 | \n",
- " 2 | \n",
- " Scrooge's Supernatural Encounters: Marley's Gh... | \n",
- " This report delves into the pivotal supernatur... | \n",
- " [{'explanation': 'Marley's Ghost plays a cruci... | \n",
- " 8.0 | \n",
- " The impact severity rating is high due to the ... | \n",
- " # Scrooge's Supernatural Encounters: Marley's ... | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " 8a5afac1-99ef-4f01-a1b1-f044ce392ff9 | \n",
- " 43 | \n",
- " 2 | \n",
- " The Ghost's Influence on Scrooge's Transformation | \n",
- " This report delves into the pivotal role of 'T... | \n",
- " [{'explanation': 'The Ghost, identified at tim... | \n",
- " 8.5 | \n",
- " The impact severity rating is high due to the ... | \n",
- " # The Ghost's Influence on Scrooge's Transform... | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " id community level \\\n",
- "0 e7822326-4da8-4954-afa9-be7f4f5791a5 42 2 \n",
- "1 8a5afac1-99ef-4f01-a1b1-f044ce392ff9 43 2 \n",
- "\n",
- " title \\\n",
- "0 Scrooge's Supernatural Encounters: Marley's Gh... \n",
- "1 The Ghost's Influence on Scrooge's Transformation \n",
- "\n",
- " summary \\\n",
- "0 This report delves into the pivotal supernatur... \n",
- "1 This report delves into the pivotal role of 'T... \n",
- "\n",
- " findings rank \\\n",
- "0 [{'explanation': 'Marley's Ghost plays a cruci... 8.0 \n",
- "1 [{'explanation': 'The Ghost, identified at tim... 8.5 \n",
- "\n",
- " rank_explanation \\\n",
- "0 The impact severity rating is high due to the ... \n",
- "1 The impact severity rating is high due to the ... \n",
- "\n",
- " full_content \n",
- "0 # Scrooge's Supernatural Encounters: Marley's ... \n",
- "1 # The Ghost's Influence on Scrooge's Transform... "
- ]
- },
- "execution_count": 75,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "community_report_df = pd.read_parquet(\n",
- " f\"{GRAPHRAG_FOLDER}/community_reports.parquet\",\n",
- " columns=[\n",
- " \"id\",\n",
- " \"community\",\n",
- " \"level\",\n",
- " \"title\",\n",
- " \"summary\",\n",
- " \"findings\",\n",
- " \"rank\",\n",
- " \"rank_explanation\",\n",
- " \"full_content\",\n",
- " ],\n",
- ")\n",
- "community_report_df.head(2)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 76,
- "id": "5c6ed591-f98c-4403-9fde-8d4cb4c01cca",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'properties_set': 729}\n",
- "47 rows in 0.02472519874572754 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "47"
- ]
- },
- "execution_count": 76,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Import communities\n",
- "community_statement = \"\"\"\n",
- "MERGE (c:__Community__ {community:value.community})\n",
- "SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary}\n",
- "WITH c, value\n",
- "UNWIND range(0, size(value.findings)-1) AS finding_idx\n",
- "WITH c, value, finding_idx, value.findings[finding_idx] as finding\n",
- "MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})\n",
- "SET f += finding\n",
- "\"\"\"\n",
- "batched_import(community_statement, community_report_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "50a1a24a",
- "metadata": {},
- "source": [
- "### Importing Covariates\n",
- "\n",
- "Covariates are for instance claims on entities, we connect them to the chunks where they originate from."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "523bed92-d12c-4fc4-aa44-6c62321b36bc",
- "metadata": {},
- "outputs": [],
- "source": [
- "cov_df = (pd.read_parquet(f\"{GRAPHRAG_FOLDER}/covariates.parquet\"),)\n",
- "# columns=[\"id\",\"text_unit_id\"])\n",
- "cov_df.head(2)\n",
- "# Subject id do not match entity ids"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3e064234-5fce-448e-8bb4-ab2f35699049",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'_contains_updates': True, 'labels_added': 89, 'relationships_created': 89, 'nodes_created': 89, 'properties_set': 1061}\n",
- "89 rows in 0.13370895385742188 s.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "89"
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "# Import covariates\n",
- "cov_statement = \"\"\"\n",
- "MERGE (c:__Covariate__ {id:value.id})\n",
- "SET c += apoc.map.clean(value, [\"text_unit_id\", \"document_ids\", \"n_tokens\"], [NULL, \"\"])\n",
- "WITH c, value\n",
- "MATCH (ch:__Chunk__ {id: value.text_unit_id})\n",
- "MERGE (ch)-[:HAS_COVARIATE]->(c)\n",
- "\"\"\"\n",
- "batched_import(cov_statement, cov_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "00340bae",
- "metadata": {},
- "source": [
- "### Visualize your data\n",
- "\n",
- "You can now [Open] Neo4j on Aura, you need to log in with either SSO or your credentials.\n",
- "\n",
- "Or open https://workspace-preview.neo4j.io and connect to your local instance, remember the URI is `neo4j://localhost` and `neo4j` as username and `password` as password.\n",
- "\n",
- "In \"Explore\" you can explore by using visual graph patterns and then explore and expand further.\n",
- "\n",
- "In \"Query\", you can open the left sidebar and explore by clicking on the nodes and relationships.\n",
- "You can also use the co-pilot to generate Cypher queries for your, here are some examples.\n",
- "\n",
- "#### Show a few `__Entity__` nodes and their relationships (Entity Graph)\n",
- "\n",
- "```cypher\n",
- "MATCH path = (:__Entity__)-[:RELATED]->(:__Entity__)\n",
- "RETURN path LIMIT 200\n",
- "```\n",
- "\n",
- "#### Show the Chunks and the Document (Lexical Graph)\n",
- "\n",
- "```cypher\n",
- "MATCH (d:__Document__) WITH d LIMIT 1\n",
- "MATCH path = (d)<-[:PART_OF]-(c:__Chunk__)\n",
- "RETURN path LIMIT 100\n",
- "```\n",
- "\n",
- "#### Show a Community and it's Entities\n",
- "\n",
- "```cypher\n",
- "MATCH (c:__Community__) WITH c LIMIT 1\n",
- "MATCH path = (c)<-[:IN_COMMUNITY]-()-[:RELATED]-(:__Entity__)\n",
- "RETURN path LIMIT 100\n",
- "```\n",
- "\n",
- "#### Show everything\n",
- "\n",
- "```cypher\n",
- "MATCH (d:__Document__) WITH d LIMIT 1\n",
- "MATCH path = (d)<-[:PART_OF]-(:__Chunk__)-[:HAS_ENTIY]->()-[:RELATED]-()-[:IN_COMMUNITY]->()\n",
- "RETURN path LIMIT 250\n",
- "```\n",
- "\n",
- "We showed the visualization of this last query at the beginning."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a0aa8529",
- "metadata": {},
- "source": [
- "If you have questions, feel free to reach out in the GraphRAG discord server: \n",
- "https://discord.gg/graphrag"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.11.8"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/examples_notebooks/community_contrib/yfiles-jupyter-graphs/graph-visualization.ipynb b/examples_notebooks/community_contrib/yfiles-jupyter-graphs/graph-visualization.ipynb
deleted file mode 100644
index fb53287b..00000000
--- a/examples_notebooks/community_contrib/yfiles-jupyter-graphs/graph-visualization.ipynb
+++ /dev/null
@@ -1,523 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Visualizing the knowledge graph with `yfiles-jupyter-graphs`\n",
- "\n",
- "This notebook is a partial copy of [local_search.ipynb](../../local_search.ipynb) that shows how to use `yfiles-jupyter-graphs` to add interactive graph visualizations of the parquet files and how to visualize the result context of `graphrag` queries (see at the end of this notebook)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Copyright (c) 2024 Microsoft Corporation.\n",
- "# Licensed under the MIT License."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "\n",
- "import pandas as pd\n",
- "import tiktoken\n",
- "from graphrag.query.llm.oai.chat_openai import ChatOpenAI\n",
- "from graphrag.query.llm.oai.embedding import OpenAIEmbedding\n",
- "from graphrag.query.llm.oai.typing import OpenaiApiType\n",
- "\n",
- "from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey\n",
- "from graphrag.query.indexer_adapters import (\n",
- " read_indexer_covariates,\n",
- " read_indexer_entities,\n",
- " read_indexer_relationships,\n",
- " read_indexer_reports,\n",
- " read_indexer_text_units,\n",
- ")\n",
- "from graphrag.query.structured_search.local_search.mixed_context import (\n",
- " LocalSearchMixedContext,\n",
- ")\n",
- "from graphrag.query.structured_search.local_search.search import LocalSearch\n",
- "from graphrag.vector_stores.lancedb import LanceDBVectorStore"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Local Search Example\n",
- "\n",
- "Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?)."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Load text units and graph data tables as context for local search\n",
- "\n",
- "- In this test we first load indexing outputs from parquet files to dataframes, then convert these dataframes into collections of data objects aligning with the knowledge model."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Load tables to dataframes"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "INPUT_DIR = \"../../inputs/operation dulce\"\n",
- "LANCEDB_URI = f\"{INPUT_DIR}/lancedb\"\n",
- "\n",
- "COMMUNITY_REPORT_TABLE = \"community_reports\"\n",
- "COMMUNITY_TABLE = \"communities\"\n",
- "ENTITY_TABLE = \"entities\"\n",
- "RELATIONSHIP_TABLE = \"relationships\"\n",
- "COVARIATE_TABLE = \"covariates\"\n",
- "TEXT_UNIT_TABLE = \"text_units\"\n",
- "COMMUNITY_LEVEL = 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Read entities"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# read nodes table to get community and degree data\n",
- "entity_df = pd.read_parquet(f\"{INPUT_DIR}/{ENTITY_TABLE}.parquet\")\n",
- "community_df = pd.read_parquet(f\"{INPUT_DIR}/{COMMUNITY_TABLE}.parquet\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Read relationships"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "relationship_df = pd.read_parquet(f\"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet\")\n",
- "relationships = read_indexer_relationships(relationship_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Visualizing nodes and relationships with `yfiles-jupyter-graphs`\n",
- "\n",
- "`yfiles-jupyter-graphs` is a graph visualization extension that provides interactive and customizable visualizations for structured node and relationship data.\n",
- "\n",
- "In this case, we use it to provide an interactive visualization for the knowledge graph of the [local_search.ipynb](../../local_search.ipynb) sample by passing node and relationship lists converted from the given parquet files. The requirements for the input data is an `id` attribute for the nodes and `start`/`end` properties for the relationships that correspond to the node ids. Additional attributes can be added in the `properties` of each node/relationship dict:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "%pip install yfiles_jupyter_graphs --quiet\n",
- "from yfiles_jupyter_graphs import GraphWidget\n",
- "\n",
- "\n",
- "# converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs\n",
- "def convert_entities_to_dicts(df):\n",
- " \"\"\"Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
- " nodes_dict = {}\n",
- " for _, row in df.iterrows():\n",
- " # Create a dictionary for each row and collect unique nodes\n",
- " node_id = row[\"title\"]\n",
- " if node_id not in nodes_dict:\n",
- " nodes_dict[node_id] = {\n",
- " \"id\": node_id,\n",
- " \"properties\": row.to_dict(),\n",
- " }\n",
- " return list(nodes_dict.values())\n",
- "\n",
- "\n",
- "# converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs\n",
- "def convert_relationships_to_dicts(df):\n",
- " \"\"\"Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
- " relationships = []\n",
- " for _, row in df.iterrows():\n",
- " # Create a dictionary for each row\n",
- " relationships.append({\n",
- " \"start\": row[\"source\"],\n",
- " \"end\": row[\"target\"],\n",
- " \"properties\": row.to_dict(),\n",
- " })\n",
- " return relationships\n",
- "\n",
- "\n",
- "w = GraphWidget()\n",
- "w.directed = True\n",
- "w.nodes = convert_entities_to_dicts(entity_df)\n",
- "w.edges = convert_relationships_to_dicts(relationship_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Configure data-driven visualization\n",
- "\n",
- "The additional properties can be used to configure the visualization for different use cases."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# show title on the node\n",
- "w.node_label_mapping = \"title\"\n",
- "\n",
- "\n",
- "# map community to a color\n",
- "def community_to_color(community):\n",
- " \"\"\"Map a community to a color.\"\"\"\n",
- " colors = [\n",
- " \"crimson\",\n",
- " \"darkorange\",\n",
- " \"indigo\",\n",
- " \"cornflowerblue\",\n",
- " \"cyan\",\n",
- " \"teal\",\n",
- " \"green\",\n",
- " ]\n",
- " return (\n",
- " colors[int(community) % len(colors)] if community is not None else \"lightgray\"\n",
- " )\n",
- "\n",
- "\n",
- "def edge_to_source_community(edge):\n",
- " \"\"\"Get the community of the source node of an edge.\"\"\"\n",
- " source_node = next(\n",
- " (entry for entry in w.nodes if entry[\"properties\"][\"title\"] == edge[\"start\"]),\n",
- " None,\n",
- " )\n",
- " source_node_community = source_node[\"properties\"][\"community\"]\n",
- " return source_node_community if source_node_community is not None else None\n",
- "\n",
- "\n",
- "w.node_color_mapping = lambda node: community_to_color(node[\"properties\"][\"community\"])\n",
- "w.edge_color_mapping = lambda edge: community_to_color(edge_to_source_community(edge))\n",
- "# map size data to a reasonable factor\n",
- "w.node_scale_factor_mapping = lambda node: 0.5 + node[\"properties\"][\"size\"] * 1.5 / 20\n",
- "# use weight for edge thickness\n",
- "w.edge_thickness_factor_mapping = \"weight\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Automatic layouts\n",
- "\n",
- "The widget provides different automatic layouts that serve different purposes: `Circular`, `Hierarchic`, `Organic (interactiv or static)`, `Orthogonal`, `Radial`, `Tree`, `Geo-spatial`.\n",
- "\n",
- "For the knowledge graph, this sample uses the `Circular` layout, though `Hierarchic` or `Organic` are also suitable choices."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Use the circular layout for this visualization. For larger graphs, the default organic layout is often preferrable.\n",
- "w.circular_layout()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Display the graph"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "display(w)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Visualizing the result context of `graphrag` queries\n",
- "\n",
- "The result context of `graphrag` queries allow to inspect the context graph of the request. This data can similarly be visualized as graph with `yfiles-jupyter-graphs`.\n",
- "\n",
- "## Making the request\n",
- "\n",
- "The following cell recreates the sample queries from [local_search.ipynb](../../local_search.ipynb)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# setup (see also ../../local_search.ipynb)\n",
- "entities = read_indexer_entities(entity_df, community_df, COMMUNITY_LEVEL)\n",
- "\n",
- "description_embedding_store = LanceDBVectorStore(\n",
- " collection_name=\"default-entity-description\",\n",
- ")\n",
- "description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
- "covariate_df = pd.read_parquet(f\"{INPUT_DIR}/{COVARIATE_TABLE}.parquet\")\n",
- "claims = read_indexer_covariates(covariate_df)\n",
- "covariates = {\"claims\": claims}\n",
- "report_df = pd.read_parquet(f\"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet\")\n",
- "reports = read_indexer_reports(report_df, community_df, COMMUNITY_LEVEL)\n",
- "text_unit_df = pd.read_parquet(f\"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet\")\n",
- "text_units = read_indexer_text_units(text_unit_df)\n",
- "\n",
- "api_key = os.environ[\"GRAPHRAG_API_KEY\"]\n",
- "llm_model = os.environ[\"GRAPHRAG_LLM_MODEL\"]\n",
- "embedding_model = os.environ[\"GRAPHRAG_EMBEDDING_MODEL\"]\n",
- "\n",
- "llm = ChatOpenAI(\n",
- " api_key=api_key,\n",
- " model=llm_model,\n",
- " api_type=OpenaiApiType.OpenAI, # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI\n",
- " max_retries=20,\n",
- ")\n",
- "\n",
- "token_encoder = tiktoken.get_encoding(\"cl100k_base\")\n",
- "\n",
- "text_embedder = OpenAIEmbedding(\n",
- " api_key=api_key,\n",
- " api_base=None,\n",
- " api_type=OpenaiApiType.OpenAI,\n",
- " model=embedding_model,\n",
- " deployment_name=embedding_model,\n",
- " max_retries=20,\n",
- ")\n",
- "\n",
- "context_builder = LocalSearchMixedContext(\n",
- " community_reports=reports,\n",
- " text_units=text_units,\n",
- " entities=entities,\n",
- " relationships=relationships,\n",
- " covariates=covariates,\n",
- " entity_text_embeddings=description_embedding_store,\n",
- " embedding_vectorstore_key=EntityVectorStoreKey.ID, # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE\n",
- " text_embedder=text_embedder,\n",
- " token_encoder=token_encoder,\n",
- ")\n",
- "\n",
- "local_context_params = {\n",
- " \"text_unit_prop\": 0.5,\n",
- " \"community_prop\": 0.1,\n",
- " \"conversation_history_max_turns\": 5,\n",
- " \"conversation_history_user_turns_only\": True,\n",
- " \"top_k_mapped_entities\": 10,\n",
- " \"top_k_relationships\": 10,\n",
- " \"include_entity_rank\": True,\n",
- " \"include_relationship_weight\": True,\n",
- " \"include_community_rank\": False,\n",
- " \"return_candidate_context\": False,\n",
- " \"embedding_vectorstore_key\": EntityVectorStoreKey.ID, # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids\n",
- " \"max_tokens\": 12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)\n",
- "}\n",
- "\n",
- "llm_params = {\n",
- " \"max_tokens\": 2_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)\n",
- " \"temperature\": 0.0,\n",
- "}\n",
- "\n",
- "search_engine = LocalSearch(\n",
- " llm=llm,\n",
- " context_builder=context_builder,\n",
- " token_encoder=token_encoder,\n",
- " llm_params=llm_params,\n",
- " context_builder_params=local_context_params,\n",
- " response_type=\"multiple paragraphs\", # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Run local search on sample queries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "result = await search_engine.search(\"Tell me about Agent Mercer\")\n",
- "print(result.response)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "question = \"Tell me about Dr. Jordan Hayes\"\n",
- "result = await search_engine.search(question)\n",
- "print(result.response)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Inspecting the context data used to generate the response"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "result.context_data[\"entities\"].head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "result.context_data[\"relationships\"].head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Visualizing the result context as graph"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "\"\"\"\n",
- "Helper function to visualize the result context with `yfiles-jupyter-graphs`.\n",
- "\n",
- "The dataframes are converted into supported nodes and relationships lists and then passed to yfiles-jupyter-graphs.\n",
- "Additionally, some values are mapped to visualization properties.\n",
- "\"\"\"\n",
- "\n",
- "\n",
- "def show_graph(result):\n",
- " \"\"\"Visualize the result context with yfiles-jupyter-graphs.\"\"\"\n",
- " from yfiles_jupyter_graphs import GraphWidget\n",
- "\n",
- " if (\n",
- " \"entities\" not in result.context_data\n",
- " or \"relationships\" not in result.context_data\n",
- " ):\n",
- " msg = \"The passed results do not contain 'entities' or 'relationships'\"\n",
- " raise ValueError(msg)\n",
- "\n",
- " # converts the entities dataframe to a list of dicts for yfiles-jupyter-graphs\n",
- " def convert_entities_to_dicts(df):\n",
- " \"\"\"Convert the entities dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
- " nodes_dict = {}\n",
- " for _, row in df.iterrows():\n",
- " # Create a dictionary for each row and collect unique nodes\n",
- " node_id = row[\"entity\"]\n",
- " if node_id not in nodes_dict:\n",
- " nodes_dict[node_id] = {\n",
- " \"id\": node_id,\n",
- " \"properties\": row.to_dict(),\n",
- " }\n",
- " return list(nodes_dict.values())\n",
- "\n",
- " # converts the relationships dataframe to a list of dicts for yfiles-jupyter-graphs\n",
- " def convert_relationships_to_dicts(df):\n",
- " \"\"\"Convert the relationships dataframe to a list of dicts for yfiles-jupyter-graphs.\"\"\"\n",
- " relationships = []\n",
- " for _, row in df.iterrows():\n",
- " # Create a dictionary for each row\n",
- " relationships.append({\n",
- " \"start\": row[\"source\"],\n",
- " \"end\": row[\"target\"],\n",
- " \"properties\": row.to_dict(),\n",
- " })\n",
- " return relationships\n",
- "\n",
- " w = GraphWidget()\n",
- " # use the converted data to visualize the graph\n",
- " w.nodes = convert_entities_to_dicts(result.context_data[\"entities\"])\n",
- " w.edges = convert_relationships_to_dicts(result.context_data[\"relationships\"])\n",
- " w.directed = True\n",
- " # show title on the node\n",
- " w.node_label_mapping = \"entity\"\n",
- " # use weight for edge thickness\n",
- " w.edge_thickness_factor_mapping = \"weight\"\n",
- " display(w)\n",
- "\n",
- "\n",
- "show_graph(result)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.0"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_latest.manifest b/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_latest.manifest
deleted file mode 100644
index b41640ff..00000000
Binary files a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_latest.manifest and /dev/null differ
diff --git a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/0-498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0.txn b/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/0-498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0.txn
deleted file mode 100644
index deb4cffe..00000000
--- a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/0-498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0.txn
+++ /dev/null
@@ -1 +0,0 @@
-$498c6e24-dd0a-42b9-8f7e-5e3d2ab258b0²—id ÿÿÿÿÿÿÿÿÿ*string08text ÿÿÿÿÿÿÿÿÿ*string085vector ÿÿÿÿÿÿÿÿÿ*fixed_size_list:float:153608 title ÿÿÿÿÿÿÿÿÿ*string08
\ No newline at end of file
diff --git a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/1-bf5aa024-a229-461f-8d78-699841a302fe.txn b/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/1-bf5aa024-a229-461f-8d78-699841a302fe.txn
deleted file mode 100644
index ba0b9ee5..00000000
Binary files a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_transactions/1-bf5aa024-a229-461f-8d78-699841a302fe.txn and /dev/null differ
diff --git a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/1.manifest b/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/1.manifest
deleted file mode 100644
index 6566b33f..00000000
Binary files a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/1.manifest and /dev/null differ
diff --git a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/2.manifest b/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/2.manifest
deleted file mode 100644
index b41640ff..00000000
Binary files a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/_versions/2.manifest and /dev/null differ
diff --git a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/data/fe64774f-5412-4c9c-8dea-f6ed55c81119.lance b/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/data/fe64774f-5412-4c9c-8dea-f6ed55c81119.lance
deleted file mode 100644
index a324ab99..00000000
Binary files a/examples_notebooks/inputs/operation dulce/lancedb/entity_description_embeddings.lance/data/fe64774f-5412-4c9c-8dea-f6ed55c81119.lance and /dev/null differ
diff --git a/graphrag/config/defaults.py b/graphrag/config/defaults.py
index e81c5ac3..6f36169a 100644
--- a/graphrag/config/defaults.py
+++ b/graphrag/config/defaults.py
@@ -125,6 +125,7 @@ class CommunityReportDefaults:
max_length: int = 2000
max_input_length: int = 8000
model_id: str = DEFAULT_CHAT_MODEL_ID
+ model_instance_name: str = "community_reporting"
@dataclass
@@ -161,6 +162,7 @@ class EmbedTextDefaults:
"""Default values for embedding text."""
model_id: str = DEFAULT_EMBEDDING_MODEL_ID
+ model_instance_name: str = "text_embedding"
batch_size: int = 16
batch_max_tokens: int = 8191
names: list[str] = field(default_factory=lambda: default_embeddings)
@@ -179,6 +181,7 @@ class ExtractClaimsDefaults:
max_gleanings: int = 1
strategy: None = None
model_id: str = DEFAULT_CHAT_MODEL_ID
+ model_instance_name: str = "extract_claims"
@dataclass
@@ -192,6 +195,7 @@ class ExtractGraphDefaults:
max_gleanings: int = 1
strategy: None = None
model_id: str = DEFAULT_CHAT_MODEL_ID
+ model_instance_name: str = "extract_graph"
@dataclass
@@ -382,6 +386,7 @@ class SummarizeDescriptionsDefaults:
max_input_tokens: int = 4_000
strategy: None = None
model_id: str = DEFAULT_CHAT_MODEL_ID
+ model_instance_name: str = "summarize_descriptions"
@dataclass
diff --git a/graphrag/config/models/community_reports_config.py b/graphrag/config/models/community_reports_config.py
index 5369dea2..1257124b 100644
--- a/graphrag/config/models/community_reports_config.py
+++ b/graphrag/config/models/community_reports_config.py
@@ -30,6 +30,10 @@ class CommunityReportsConfig(BaseModel):
description="The model ID to use for community reports.",
default=graphrag_config_defaults.community_reports.model_id,
)
+ model_instance_name: str = Field(
+ description="The model singleton instance name. This primarily affects the cache storage partitioning.",
+ default=graphrag_config_defaults.community_reports.model_instance_name,
+ )
graph_prompt: str | None = Field(
description="The community report extraction prompt to use for graph-based summarization.",
default=graphrag_config_defaults.community_reports.graph_prompt,
diff --git a/graphrag/config/models/embed_text_config.py b/graphrag/config/models/embed_text_config.py
index 5e596381..c33409d2 100644
--- a/graphrag/config/models/embed_text_config.py
+++ b/graphrag/config/models/embed_text_config.py
@@ -15,6 +15,10 @@ class EmbedTextConfig(BaseModel):
description="The model ID to use for text embeddings.",
default=graphrag_config_defaults.embed_text.model_id,
)
+ model_instance_name: str = Field(
+ description="The model singleton instance name. This primarily affects the cache storage partitioning.",
+ default=graphrag_config_defaults.embed_text.model_instance_name,
+ )
batch_size: int = Field(
description="The batch size to use.",
default=graphrag_config_defaults.embed_text.batch_size,
diff --git a/graphrag/config/models/extract_claims_config.py b/graphrag/config/models/extract_claims_config.py
index 78fe9267..77a633b0 100644
--- a/graphrag/config/models/extract_claims_config.py
+++ b/graphrag/config/models/extract_claims_config.py
@@ -30,6 +30,10 @@ class ExtractClaimsConfig(BaseModel):
description="The model ID to use for claim extraction.",
default=graphrag_config_defaults.extract_claims.model_id,
)
+ model_instance_name: str = Field(
+ description="The model singleton instance name. This primarily affects the cache storage partitioning.",
+ default=graphrag_config_defaults.extract_claims.model_instance_name,
+ )
prompt: str | None = Field(
description="The claim extraction prompt to use.",
default=graphrag_config_defaults.extract_claims.prompt,
diff --git a/graphrag/config/models/extract_graph_config.py b/graphrag/config/models/extract_graph_config.py
index b8dfce3e..8a61585e 100644
--- a/graphrag/config/models/extract_graph_config.py
+++ b/graphrag/config/models/extract_graph_config.py
@@ -26,6 +26,10 @@ class ExtractGraphConfig(BaseModel):
description="The model ID to use for text embeddings.",
default=graphrag_config_defaults.extract_graph.model_id,
)
+ model_instance_name: str = Field(
+ description="The model singleton instance name. This primarily affects the cache storage partitioning.",
+ default=graphrag_config_defaults.extract_graph.model_instance_name,
+ )
prompt: str | None = Field(
description="The entity extraction prompt to use.",
default=graphrag_config_defaults.extract_graph.prompt,
diff --git a/graphrag/config/models/summarize_descriptions_config.py b/graphrag/config/models/summarize_descriptions_config.py
index 3ab1fdae..3414db71 100644
--- a/graphrag/config/models/summarize_descriptions_config.py
+++ b/graphrag/config/models/summarize_descriptions_config.py
@@ -26,6 +26,10 @@ class SummarizeDescriptionsConfig(BaseModel):
description="The model ID to use for summarization.",
default=graphrag_config_defaults.summarize_descriptions.model_id,
)
+ model_instance_name: str = Field(
+ description="The model singleton instance name. This primarily affects the cache storage partitioning.",
+ default=graphrag_config_defaults.summarize_descriptions.model_instance_name,
+ )
prompt: str | None = Field(
description="The description summarization prompt to use.",
default=graphrag_config_defaults.summarize_descriptions.prompt,
diff --git a/graphrag/index/workflows/create_community_reports.py b/graphrag/index/workflows/create_community_reports.py
index e9f19533..0415cb3b 100644
--- a/graphrag/index/workflows/create_community_reports.py
+++ b/graphrag/index/workflows/create_community_reports.py
@@ -58,7 +58,7 @@ async def run_workflow(
prompts = config.community_reports.resolved_prompts(config.root_dir)
model = ModelManager().get_or_create_chat_model(
- name="community_reporting",
+ name=config.community_reports.model_instance_name,
model_type=model_config.type,
config=model_config,
callbacks=context.callbacks,
diff --git a/graphrag/index/workflows/create_community_reports_text.py b/graphrag/index/workflows/create_community_reports_text.py
index 80b23689..94d79ca5 100644
--- a/graphrag/index/workflows/create_community_reports_text.py
+++ b/graphrag/index/workflows/create_community_reports_text.py
@@ -47,7 +47,7 @@ async def run_workflow(
model_config = config.get_language_model_config(config.community_reports.model_id)
model = ModelManager().get_or_create_chat_model(
- name="community_reporting",
+ name=config.community_reports.model_instance_name,
model_type=model_config.type,
config=model_config,
callbacks=context.callbacks,
diff --git a/graphrag/index/workflows/extract_covariates.py b/graphrag/index/workflows/extract_covariates.py
index 76889a56..52da0745 100644
--- a/graphrag/index/workflows/extract_covariates.py
+++ b/graphrag/index/workflows/extract_covariates.py
@@ -38,7 +38,7 @@ async def run_workflow(
model_config = config.get_language_model_config(config.extract_claims.model_id)
model = ModelManager().get_or_create_chat_model(
- name="extract_claims",
+ name=config.extract_claims.model_instance_name,
model_type=model_config.type,
config=model_config,
callbacks=context.callbacks,
diff --git a/graphrag/index/workflows/extract_graph.py b/graphrag/index/workflows/extract_graph.py
index ce466379..1a28a397 100644
--- a/graphrag/index/workflows/extract_graph.py
+++ b/graphrag/index/workflows/extract_graph.py
@@ -38,7 +38,7 @@ async def run_workflow(
)
extraction_prompts = config.extract_graph.resolved_prompts(config.root_dir)
extraction_model = ModelManager().get_or_create_chat_model(
- name="extract_graph",
+ name=config.extract_graph.model_instance_name,
model_type=extraction_model_config.type,
config=extraction_model_config,
cache=context.cache,
@@ -51,7 +51,7 @@ async def run_workflow(
config.root_dir
)
summarization_model = ModelManager().get_or_create_chat_model(
- name="summarize_descriptions",
+ name=config.summarize_descriptions.model_instance_name,
model_type=summarization_model_config.type,
config=summarization_model_config,
cache=context.cache,
diff --git a/graphrag/index/workflows/generate_text_embeddings.py b/graphrag/index/workflows/generate_text_embeddings.py
index 848b57d2..c15ff3e0 100644
--- a/graphrag/index/workflows/generate_text_embeddings.py
+++ b/graphrag/index/workflows/generate_text_embeddings.py
@@ -77,7 +77,7 @@ async def run_workflow(
model_config = config.get_language_model_config(config.embed_text.model_id)
model = ModelManager().get_or_create_embedding_model(
- name="text_embedding",
+ name=config.embed_text.model_instance_name,
model_type=model_config.type,
config=model_config,
callbacks=context.callbacks,
diff --git a/pyproject.toml b/pyproject.toml
index 96f01b5d..48342e06 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -246,7 +246,7 @@ convention = "numpy"
# https://github.com/microsoft/pyright/blob/9f81564a4685ff5c55edd3959f9b39030f590b2f/docs/configuration.md#sample-pyprojecttoml-file
[tool.pyright]
-include = ["graphrag", "tests", "examples_notebooks"]
+include = ["graphrag", "tests"]
exclude = ["**/node_modules", "**/__pycache__"]
[tool.pytest.ini_options]
diff --git a/tests/fixtures/min-csv/config.json b/tests/fixtures/min-csv/config.json
index 7b1b4d61..64f90a3f 100644
--- a/tests/fixtures/min-csv/config.json
+++ b/tests/fixtures/min-csv/config.json
@@ -54,7 +54,7 @@
"period",
"size"
],
- "max_runtime": 300,
+ "max_runtime": 360,
"expected_artifacts": ["community_reports.parquet"]
},
"create_final_text_units": {
diff --git a/tests/fixtures/text/config.json b/tests/fixtures/text/config.json
index 5b5738b1..f7278a23 100644
--- a/tests/fixtures/text/config.json
+++ b/tests/fixtures/text/config.json
@@ -40,7 +40,7 @@
"end_date",
"source_text"
],
- "max_runtime": 300,
+ "max_runtime": 360,
"expected_artifacts": ["covariates.parquet"]
},
"create_communities": {
@@ -67,7 +67,7 @@
"period",
"size"
],
- "max_runtime": 300,
+ "max_runtime": 360,
"expected_artifacts": ["community_reports.parquet"]
},
"create_final_text_units": {