From 397358ad3a9269917693f2e1afb301a94dc3b614 Mon Sep 17 00:00:00 2001 From: AlonsoGuevara Date: Thu, 14 Aug 2025 01:01:05 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20@=20microsof?= =?UTF-8?q?t/graphrag@7c28c70d5c9a98074ce31512f215ac52e3ae2426=20?= =?UTF-8?q?=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- developing/index.html | 49 ++-- examples_notebooks/api_overview/index.html | 8 +- examples_notebooks/drift_search/index.html | 66 ++--- examples_notebooks/global_search/index.html | 30 +-- .../index.html | 70 +++--- .../index_migration_to_v1/index.html | 25 +- examples_notebooks/local_search/index.html | 228 +++++++++--------- .../multi_index_search/index.html | 32 +-- index/overview/index.html | 3 +- search/search_index.json | 2 +- sitemap.xml.gz | Bin 127 -> 127 bytes 11 files changed, 256 insertions(+), 257 deletions(-) diff --git a/developing/index.html b/developing/index.html index 00472a2b..07ce9f3c 100644 --- a/developing/index.html +++ b/developing/index.html @@ -1545,22 +1545,26 @@ The library is Python-based. -Poetry -Instructions -Poetry is used for package management and virtualenv management in Python codebases +uv +Instructions +uv is used for package management and virtualenv management in Python codebases

Getting Started

Install Dependencies

-
# Install Python dependencies.
-poetry install
+
# (optional) create virtual environment
+uv venv --python 3.10
+source .venv/bin/activate
+
+# install python dependencies
+uv sync --extra dev
 

Execute the Indexing Engine

-
poetry run poe index <...args>
+
uv run poe index <...args>
 

Executing Queries

-
poetry run poe query <...args>
+
uv run poe query <...args>
 

Azurite

Some unit and smoke tests use Azurite to emulate Azure resources. This can be started by running:

@@ -1568,36 +1572,33 @@

or by simply running azurite in the terminal if already installed globally. See the Azurite documentation for more information about how to install and use Azurite.

Lifecycle Scripts

-

Our Python package utilizes Poetry to manage dependencies and poethepoet to manage build scripts.

+

Our Python package utilize uv to manage dependencies and poethepoet to manage build scripts.

Available scripts are:

    -
  • poetry run poe index - Run the Indexing CLI
  • -
  • poetry run poe query - Run the Query CLI
  • -
  • poetry build - This invokes poetry build, which will build a wheel file and other distributable artifacts.
  • -
  • poetry run poe test - This will execute all tests.
  • -
  • poetry run poe test_unit - This will execute unit tests.
  • -
  • poetry run poe test_integration - This will execute integration tests.
  • -
  • poetry run poe test_smoke - This will execute smoke tests.
  • -
  • poetry run poe test_verbs - This will execute tests of the basic workflows.
  • -
  • poetry run poe check - This will perform a suite of static checks across the package, including:
  • +
  • uv run poe index - Run the Indexing CLI
  • +
  • uv run poe query - Run the Query CLI
  • +
  • uv build - This will build a wheel file and other distributable artifacts.
  • +
  • uv run poe test - This will execute all tests.
  • +
  • uv run poe test_unit - This will execute unit tests.
  • +
  • uv run poe test_integration - This will execute integration tests.
  • +
  • uv run poe test_smoke - This will execute smoke tests.
  • +
  • uv run poe test_verbs - This will execute tests of the basic workflows.
  • +
  • uv run poe check - This will perform a suite of static checks across the package, including:
  • formatting
  • documentation formatting
  • linting
  • security patterns
  • type-checking
  • -
  • poetry run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
  • -
  • poetry run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.
  • -
  • poetry run poe format - Explicitly run the formatter across the package.
  • +
  • uv run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
  • +
  • uv run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.
  • +
  • uv run poe format - Explicitly run the formatter across the package.

Troubleshooting

-

"RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running poetry install

+

"RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running uv install

Make sure llvm-9 and llvm-9-dev are installed:

sudo apt-get install llvm-9 llvm-9-dev

and then in your bashrc, add

export LLVM_CONFIG=/usr/bin/llvm-config-9

-

"numba/_pymodule.h:6:10: fatal error: Python.h: No such file or directory" when running poetry install

-

Make sure you have python3.10-dev installed or more generally python<version>-dev

-

sudo apt-get install python3.10-dev

LLM call constantly exceeds TPM, RPM or time limits

GRAPHRAG_LLM_THREAD_COUNT and GRAPHRAG_EMBEDDING_THREAD_COUNT are both set to 50 by default. You can modify these values to reduce concurrency. Please refer to the Configuration Documents

diff --git a/examples_notebooks/api_overview/index.html b/examples_notebooks/api_overview/index.html index e297921c..e64e69e2 100644 --- a/examples_notebooks/api_overview/index.html +++ b/examples_notebooks/api_overview/index.html @@ -2375,7 +2375,7 @@ response, context = await api.global_search( 4 f"{PROJECT_DIRECTORY}/output/community_reports.parquet" 5 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 666 use_nullable_dtypes = False 667 check_dtype_backend(dtype_backend) --> 669 return impl.read( @@ -2389,7 +2389,7 @@ response, context = await api.global_search( 677 **kwargs, 678 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) 256 if manager == "array": 257 to_pandas_kwargs["split_blocks"] = True --> 258 path_or_handle, handles, filesystem = _get_path_or_handle( @@ -2405,7 +2405,7 @@ response, context = await api.global_search( (...) 270 **kwargs, 271 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) 131 handles = None 132 if ( 133 not fs @@ -2418,7 +2418,7 @@ response, context = await api.global_search( 144 fs = None 145 path_or_handle = handles.handle -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 873 handle = open( 874 handle, 875 ioargs.mode, diff --git a/examples_notebooks/drift_search/index.html b/examples_notebooks/drift_search/index.html index 7fde4308..0be99ee9 100644 --- a/examples_notebooks/drift_search/index.html +++ b/examples_notebooks/drift_search/index.html @@ -2569,38 +2569,38 @@ search = DRIFTSearch( 83 else: 84 response = await self.model(prompt, history=history, **kwargs) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/llm/openai_chat_llm.py:94, in OpenAIChatLLMImpl.__call__(self, prompt, stream, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/llm/openai_chat_llm.py:94, in OpenAIChatLLMImpl.__call__(self, prompt, stream, **kwargs) 91 if stream: 92 return await self._streaming_chat_llm(prompt, **kwargs) ---> 94 return await self._text_chat_llm(prompt, **kwargs) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/services/openai_tools_parsing.py:130, in OpenAIParseToolsLLM.__call__(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/services/openai_tools_parsing.py:130, in OpenAIParseToolsLLM.__call__(self, prompt, **kwargs) 127 tools = kwargs.get("tools", []) 129 if not tools: --> 130 return await self._delegate(prompt, **kwargs) 132 completion_parameters = self._add_tools_to_parameters(kwargs, tools) 134 result = await self._delegate(prompt, **completion_parameters) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) 142 try: 143 prompt, kwargs = self._rewrite_input(prompt, kwargs) --> 144 return await self._decorated_target(prompt, **kwargs) 145 except BaseException as e: 146 stack_trace = traceback.format_exc() -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/json.py:78, in JsonReceiver.decorate.<locals>.invoke(prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/json.py:78, in JsonReceiver.decorate.<locals>.invoke(prompt, **kwargs) 76 if kwargs.get("json_model") is not None or kwargs.get("json"): 77 return await this.invoke_json(delegate, prompt, kwargs) ---> 78 return await delegate(prompt, **kwargs) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) 73 async with self._limiter.use(manifest): 74 await self._events.on_limit_acquired(manifest) ---> 75 result = await delegate(prompt, **args) 76 finally: 77 await self._events.on_limit_released(manifest) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) 121 """Target for the decorator chain. 122 123 Leave signature alone as prompt, kwargs. @@ -2610,22 +2610,22 @@ search = DRIFTSearch( 127 result = LLMOutput(output=output) 128 await self._inject_usage(result) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/llm/openai_text_chat_llm.py:166, in OpenAITextChatLLMImpl._execute_llm(self, prompt, kwargs) - 163 local_model_parameters = kwargs.get("model_parameters") - 164 parameters = self._build_completion_parameters(local_model_parameters) ---> 166 raw_response = await self._client.chat.completions.with_raw_response.create( - 167 messages=cast(Iterator[ChatCompletionMessageParam], messages), - 168 **parameters, - 169 ) - 170 completion = raw_response.parse() - 171 headers = raw_response.headers +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/llm/openai_text_chat_llm.py:173, in OpenAITextChatLLMImpl._execute_llm(self, prompt, kwargs) + 170 local_model_parameters = kwargs.get("model_parameters") + 171 parameters = self._build_completion_parameters(local_model_parameters) +--> 173 raw_response = await self._client.chat.completions.with_raw_response.create( + 174 messages=cast(Iterator[ChatCompletionMessageParam], messages), + 175 **parameters, + 176 ) + 177 completion = raw_response.parse() + 178 headers = raw_response.headers -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) 377 extra_headers[RAW_RESPONSE_HEADER] = "true" 379 kwargs["extra_headers"] = extra_headers --> 381 return cast(LegacyAPIResponse[R], await func(*args, **kwargs)) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:2454, in AsyncCompletions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, web_search_options, extra_headers, extra_query, extra_body, timeout) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:2454, in AsyncCompletions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, web_search_options, extra_headers, extra_query, extra_body, timeout) 2411 @required_args(["messages", "model"], ["messages", "model", "stream"]) 2412 async def create( 2413 self, @@ -2680,23 +2680,23 @@ search = DRIFTSearch( 2499 stream_cls=AsyncStream[ChatCompletionChunk], 2500 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1784, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) - 1770 async def post( - 1771 self, - 1772 path: str, - (...) 1779 stream_cls: type[_AsyncStreamT] | None = None, - 1780 ) -> ResponseT | _AsyncStreamT: - 1781 opts = FinalRequestOptions.construct( - 1782 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options - 1783 ) --> 1784 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1791, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) + 1777 async def post( + 1778 self, + 1779 path: str, + (...) 1786 stream_cls: type[_AsyncStreamT] | None = None, + 1787 ) -> ResponseT | _AsyncStreamT: + 1788 opts = FinalRequestOptions.construct( + 1789 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options + 1790 ) +-> 1791 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1584, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) - 1581 await err.response.aread() - 1583 log.debug("Re-raising status error") --> 1584 raise self._make_status_error_from_response(err.response) from None - 1586 break - 1588 assert response is not None, "could not resolve response (should never happen)" +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1591, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) + 1588 await err.response.aread() + 1590 log.debug("Re-raising status error") +-> 1591 raise self._make_status_error_from_response(err.response) from None + 1593 break + 1595 assert response is not None, "could not resolve response (should never happen)" AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************zWYA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
diff --git a/examples_notebooks/global_search/index.html b/examples_notebooks/global_search/index.html index eb5bdfe5..2eb3b0a9 100644 --- a/examples_notebooks/global_search/index.html +++ b/examples_notebooks/global_search/index.html @@ -2706,7 +2706,7 @@ print(result.response)
---> 77 return await this.invoke_json(delegate, prompt, kwargs) 78 return await delegate(prompt, **kwargs) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/json.py:96, in JsonReceiver.invoke_json(self, delegate, prompt, kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/json.py:96, in JsonReceiver.invoke_json(self, delegate, prompt, kwargs) 94 if attempt > 0: 95 kwargs["bust_cache"] = True ---> 96 return await self.try_receive_json(delegate, prompt, kwargs) 97 except FailedToGenerateValidJsonError as e: 98 error = e -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/json.py:162, in LooseModeJsonReceiver.try_receive_json(self, delegate, prompt, kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/json.py:162, in LooseModeJsonReceiver.try_receive_json(self, delegate, prompt, kwargs) 159 """Invoke the JSON decorator.""" 160 json_model = kwargs.get("json_model") --> 162 result = await delegate(prompt, **kwargs) 163 json_string = self._marshaler.extract_json_string(result) 164 try: -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) 73 async with self._limiter.use(manifest): 74 await self._events.on_limit_acquired(manifest) ---> 75 result = await delegate(prompt, **args) 76 finally: 77 await self._events.on_limit_released(manifest) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) 121 """Target for the decorator chain. 122 123 Leave signature alone as prompt, kwargs. @@ -2753,22 +2753,22 @@ print(result.response) 127 result = LLMOutput(output=output) 128 await self._inject_usage(result) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/llm/openai_text_chat_llm.py:166, in OpenAITextChatLLMImpl._execute_llm(self, prompt, kwargs) - 163 local_model_parameters = kwargs.get("model_parameters") - 164 parameters = self._build_completion_parameters(local_model_parameters) ---> 166 raw_response = await self._client.chat.completions.with_raw_response.create( - 167 messages=cast(Iterator[ChatCompletionMessageParam], messages), - 168 **parameters, - 169 ) - 170 completion = raw_response.parse() - 171 headers = raw_response.headers +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/llm/openai_text_chat_llm.py:173, in OpenAITextChatLLMImpl._execute_llm(self, prompt, kwargs) + 170 local_model_parameters = kwargs.get("model_parameters") + 171 parameters = self._build_completion_parameters(local_model_parameters) +--> 173 raw_response = await self._client.chat.completions.with_raw_response.create( + 174 messages=cast(Iterator[ChatCompletionMessageParam], messages), + 175 **parameters, + 176 ) + 177 completion = raw_response.parse() + 178 headers = raw_response.headers -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) 377 extra_headers[RAW_RESPONSE_HEADER] = "true" 379 kwargs["extra_headers"] = extra_headers --> 381 return cast(LegacyAPIResponse[R], await func(*args, **kwargs)) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:2454, in AsyncCompletions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, web_search_options, extra_headers, extra_query, extra_body, timeout) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:2454, in AsyncCompletions.create(self, messages, model, audio, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, reasoning_effort, response_format, seed, service_tier, stop, store, stream, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, web_search_options, extra_headers, extra_query, extra_body, timeout) 2411 @required_args(["messages", "model"], ["messages", "model", "stream"]) 2412 async def create( 2413 self, @@ -2823,23 +2823,23 @@ print(result.response) 2499 stream_cls=AsyncStream[ChatCompletionChunk], 2500 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1784, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) - 1770 async def post( - 1771 self, - 1772 path: str, - (...) 1779 stream_cls: type[_AsyncStreamT] | None = None, - 1780 ) -> ResponseT | _AsyncStreamT: - 1781 opts = FinalRequestOptions.construct( - 1782 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options - 1783 ) --> 1784 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1791, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) + 1777 async def post( + 1778 self, + 1779 path: str, + (...) 1786 stream_cls: type[_AsyncStreamT] | None = None, + 1787 ) -> ResponseT | _AsyncStreamT: + 1788 opts = FinalRequestOptions.construct( + 1789 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options + 1790 ) +-> 1791 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1584, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) - 1581 await err.response.aread() - 1583 log.debug("Re-raising status error") --> 1584 raise self._make_status_error_from_response(err.response) from None - 1586 break - 1588 assert response is not None, "could not resolve response (should never happen)" +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1591, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) + 1588 await err.response.aread() + 1590 log.debug("Re-raising status error") +-> 1591 raise self._make_status_error_from_response(err.response) from None + 1593 break + 1595 assert response is not None, "could not resolve response (should never happen)" AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************zWYA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}} diff --git a/examples_notebooks/index_migration_to_v1/index.html b/examples_notebooks/index_migration_to_v1/index.html index f1f010db..0d0e92c3 100644 --- a/examples_notebooks/index_migration_to_v1/index.html +++ b/examples_notebooks/index_migration_to_v1/index.html @@ -2485,10 +2485,11 @@ await write_table_to_storage( -
from graphrag.cache.factory import CacheFactory
+
from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings
+
+from graphrag.cache.factory import CacheFactory
 from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks
 from graphrag.config.embeddings import get_embedded_fields, get_embedding_settings
-from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings
 
 # We only need to re-run the embeddings workflow, to ensure that embeddings for all required search fields are in place
 # We'll construct the context and run this function flow directly to avoid everything else
@@ -2518,10 +2519,11 @@ await write_table_to_storage(
     snapshot_embeddings_enabled=False,
 )
 
-
from graphrag.cache.factory import CacheFactory +
from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings + +from graphrag.cache.factory import CacheFactory from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks from graphrag.config.embeddings import get_embedded_fields, get_embedding_settings -from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings # We only need to re-run the embeddings workflow, to ensure that embeddings for all required search fields are in place # We'll construct the context and run this function flow directly to avoid everything else @@ -2563,16 +2565,13 @@ await generate_text_embeddings(
diff --git a/examples_notebooks/local_search/index.html b/examples_notebooks/local_search/index.html index 3975d9e5..f042961d 100644 --- a/examples_notebooks/local_search/index.html +++ b/examples_notebooks/local_search/index.html @@ -3567,21 +3567,21 @@ print(result.response) 208 if response.output.embeddings is None: 209 msg = "No embeddings found in response" -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) 142 try: 143 prompt, kwargs = self._rewrite_input(prompt, kwargs) --> 144 return await self._decorated_target(prompt, **kwargs) 145 except BaseException as e: 146 stack_trace = traceback.format_exc() -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) 73 async with self._limiter.use(manifest): 74 await self._events.on_limit_acquired(manifest) ---> 75 result = await delegate(prompt, **args) 76 finally: 77 await self._events.on_limit_released(manifest) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) 121 """Target for the decorator chain. 122 123 Leave signature alone as prompt, kwargs. @@ -3591,7 +3591,7 @@ print(result.response) 127 result = LLMOutput(output=output) 128 await self._inject_usage(result) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/llm/openai_embeddings_llm.py:126, in OpenAIEmbeddingsLLMImpl._execute_llm(self, prompt, kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/llm/openai_embeddings_llm.py:126, in OpenAIEmbeddingsLLMImpl._execute_llm(self, prompt, kwargs) 121 local_model_parameters = kwargs.get("model_parameters") 122 embeddings_parameters = self._build_embeddings_parameters( 123 local_model_parameters @@ -3603,46 +3603,46 @@ print(result.response) 130 result = result_raw.parse() 131 headers = result_raw.headers -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) 377 extra_headers[RAW_RESPONSE_HEADER] = "true" 379 kwargs["extra_headers"] = extra_headers --> 381 return cast(LegacyAPIResponse[R], await func(*args, **kwargs)) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/resources/embeddings.py:245, in AsyncEmbeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout) - 239 embedding.embedding = np.frombuffer( # type: ignore[no-untyped-call] - 240 base64.b64decode(data), dtype="float32" - 241 ).tolist() - 243 return obj ---> 245 return await self._post( - 246 "/embeddings", - 247 body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams), - 248 options=make_request_options( - 249 extra_headers=extra_headers, - 250 extra_query=extra_query, - 251 extra_body=extra_body, - 252 timeout=timeout, - 253 post_parser=parser, - 254 ), - 255 cast_to=CreateEmbeddingResponse, - 256 ) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/resources/embeddings.py:251, in AsyncEmbeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout) + 245 embedding.embedding = np.frombuffer( # type: ignore[no-untyped-call] + 246 base64.b64decode(data), dtype="float32" + 247 ).tolist() + 249 return obj +--> 251 return await self._post( + 252 "/embeddings", + 253 body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams), + 254 options=make_request_options( + 255 extra_headers=extra_headers, + 256 extra_query=extra_query, + 257 extra_body=extra_body, + 258 timeout=timeout, + 259 post_parser=parser, + 260 ), + 261 cast_to=CreateEmbeddingResponse, + 262 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1784, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) - 1770 async def post( - 1771 self, - 1772 path: str, - (...) 1779 stream_cls: type[_AsyncStreamT] | None = None, - 1780 ) -> ResponseT | _AsyncStreamT: - 1781 opts = FinalRequestOptions.construct( - 1782 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options - 1783 ) --> 1784 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1791, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) + 1777 async def post( + 1778 self, + 1779 path: str, + (...) 1786 stream_cls: type[_AsyncStreamT] | None = None, + 1787 ) -> ResponseT | _AsyncStreamT: + 1788 opts = FinalRequestOptions.construct( + 1789 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options + 1790 ) +-> 1791 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1584, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) - 1581 await err.response.aread() - 1583 log.debug("Re-raising status error") --> 1584 raise self._make_status_error_from_response(err.response) from None - 1586 break - 1588 assert response is not None, "could not resolve response (should never happen)" +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1591, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) + 1588 await err.response.aread() + 1590 log.debug("Re-raising status error") +-> 1591 raise self._make_status_error_from_response(err.response) from None + 1593 break + 1595 assert response is not None, "could not resolve response (should never happen)" AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************zWYA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}} @@ -3805,21 +3805,21 @@ print(result.response) 208 if response.output.embeddings is None: 209 msg = "No embeddings found in response" -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) 142 try: 143 prompt, kwargs = self._rewrite_input(prompt, kwargs) --> 144 return await self._decorated_target(prompt, **kwargs) 145 except BaseException as e: 146 stack_trace = traceback.format_exc() -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) 73 async with self._limiter.use(manifest): 74 await self._events.on_limit_acquired(manifest) ---> 75 result = await delegate(prompt, **args) 76 finally: 77 await self._events.on_limit_released(manifest) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) 121 """Target for the decorator chain. 122 123 Leave signature alone as prompt, kwargs. @@ -3829,7 +3829,7 @@ print(result.response) 127 result = LLMOutput(output=output) 128 await self._inject_usage(result) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/llm/openai_embeddings_llm.py:126, in OpenAIEmbeddingsLLMImpl._execute_llm(self, prompt, kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/llm/openai_embeddings_llm.py:126, in OpenAIEmbeddingsLLMImpl._execute_llm(self, prompt, kwargs) 121 local_model_parameters = kwargs.get("model_parameters") 122 embeddings_parameters = self._build_embeddings_parameters( 123 local_model_parameters @@ -3841,46 +3841,46 @@ print(result.response) 130 result = result_raw.parse() 131 headers = result_raw.headers -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) 377 extra_headers[RAW_RESPONSE_HEADER] = "true" 379 kwargs["extra_headers"] = extra_headers --> 381 return cast(LegacyAPIResponse[R], await func(*args, **kwargs)) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/resources/embeddings.py:245, in AsyncEmbeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout) - 239 embedding.embedding = np.frombuffer( # type: ignore[no-untyped-call] - 240 base64.b64decode(data), dtype="float32" - 241 ).tolist() - 243 return obj ---> 245 return await self._post( - 246 "/embeddings", - 247 body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams), - 248 options=make_request_options( - 249 extra_headers=extra_headers, - 250 extra_query=extra_query, - 251 extra_body=extra_body, - 252 timeout=timeout, - 253 post_parser=parser, - 254 ), - 255 cast_to=CreateEmbeddingResponse, - 256 ) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/resources/embeddings.py:251, in AsyncEmbeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout) + 245 embedding.embedding = np.frombuffer( # type: ignore[no-untyped-call] + 246 base64.b64decode(data), dtype="float32" + 247 ).tolist() + 249 return obj +--> 251 return await self._post( + 252 "/embeddings", + 253 body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams), + 254 options=make_request_options( + 255 extra_headers=extra_headers, + 256 extra_query=extra_query, + 257 extra_body=extra_body, + 258 timeout=timeout, + 259 post_parser=parser, + 260 ), + 261 cast_to=CreateEmbeddingResponse, + 262 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1784, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) - 1770 async def post( - 1771 self, - 1772 path: str, - (...) 1779 stream_cls: type[_AsyncStreamT] | None = None, - 1780 ) -> ResponseT | _AsyncStreamT: - 1781 opts = FinalRequestOptions.construct( - 1782 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options - 1783 ) --> 1784 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1791, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) + 1777 async def post( + 1778 self, + 1779 path: str, + (...) 1786 stream_cls: type[_AsyncStreamT] | None = None, + 1787 ) -> ResponseT | _AsyncStreamT: + 1788 opts = FinalRequestOptions.construct( + 1789 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options + 1790 ) +-> 1791 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1584, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) - 1581 await err.response.aread() - 1583 log.debug("Re-raising status error") --> 1584 raise self._make_status_error_from_response(err.response) from None - 1586 break - 1588 assert response is not None, "could not resolve response (should never happen)" +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1591, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) + 1588 await err.response.aread() + 1590 log.debug("Re-raising status error") +-> 1591 raise self._make_status_error_from_response(err.response) from None + 1593 break + 1595 assert response is not None, "could not resolve response (should never happen)" AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************zWYA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}} @@ -4365,21 +4365,21 @@ print(candidate_questions.response) 208 if response.output.embeddings is None: 209 msg = "No embeddings found in response" -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:144, in BaseLLM.__call__(self, prompt, **kwargs) 142 try: 143 prompt, kwargs = self._rewrite_input(prompt, kwargs) --> 144 return await self._decorated_target(prompt, **kwargs) 145 except BaseException as e: 146 stack_trace = traceback.format_exc() -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/services/rate_limiter.py:75, in RateLimiter.decorate.<locals>.invoke(prompt, **args) 73 async with self._limiter.use(manifest): 74 await self._events.on_limit_acquired(manifest) ---> 75 result = await delegate(prompt, **args) 76 finally: 77 await self._events.on_limit_released(manifest) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/base/base_llm.py:126, in BaseLLM._decorator_target(self, prompt, **kwargs) 121 """Target for the decorator chain. 122 123 Leave signature alone as prompt, kwargs. @@ -4389,7 +4389,7 @@ print(candidate_questions.response) 127 result = LLMOutput(output=output) 128 await self._inject_usage(result) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/fnllm/openai/llm/openai_embeddings_llm.py:126, in OpenAIEmbeddingsLLMImpl._execute_llm(self, prompt, kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/fnllm/openai/llm/openai_embeddings_llm.py:126, in OpenAIEmbeddingsLLMImpl._execute_llm(self, prompt, kwargs) 121 local_model_parameters = kwargs.get("model_parameters") 122 embeddings_parameters = self._build_embeddings_parameters( 123 local_model_parameters @@ -4401,46 +4401,46 @@ print(candidate_questions.response) 130 result = result_raw.parse() 131 headers = result_raw.headers -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_legacy_response.py:381, in async_to_raw_response_wrapper.<locals>.wrapped(*args, **kwargs) 377 extra_headers[RAW_RESPONSE_HEADER] = "true" 379 kwargs["extra_headers"] = extra_headers --> 381 return cast(LegacyAPIResponse[R], await func(*args, **kwargs)) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/resources/embeddings.py:245, in AsyncEmbeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout) - 239 embedding.embedding = np.frombuffer( # type: ignore[no-untyped-call] - 240 base64.b64decode(data), dtype="float32" - 241 ).tolist() - 243 return obj ---> 245 return await self._post( - 246 "/embeddings", - 247 body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams), - 248 options=make_request_options( - 249 extra_headers=extra_headers, - 250 extra_query=extra_query, - 251 extra_body=extra_body, - 252 timeout=timeout, - 253 post_parser=parser, - 254 ), - 255 cast_to=CreateEmbeddingResponse, - 256 ) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/resources/embeddings.py:251, in AsyncEmbeddings.create(self, input, model, dimensions, encoding_format, user, extra_headers, extra_query, extra_body, timeout) + 245 embedding.embedding = np.frombuffer( # type: ignore[no-untyped-call] + 246 base64.b64decode(data), dtype="float32" + 247 ).tolist() + 249 return obj +--> 251 return await self._post( + 252 "/embeddings", + 253 body=maybe_transform(params, embedding_create_params.EmbeddingCreateParams), + 254 options=make_request_options( + 255 extra_headers=extra_headers, + 256 extra_query=extra_query, + 257 extra_body=extra_body, + 258 timeout=timeout, + 259 post_parser=parser, + 260 ), + 261 cast_to=CreateEmbeddingResponse, + 262 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1784, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) - 1770 async def post( - 1771 self, - 1772 path: str, - (...) 1779 stream_cls: type[_AsyncStreamT] | None = None, - 1780 ) -> ResponseT | _AsyncStreamT: - 1781 opts = FinalRequestOptions.construct( - 1782 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options - 1783 ) --> 1784 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1791, in AsyncAPIClient.post(self, path, cast_to, body, files, options, stream, stream_cls) + 1777 async def post( + 1778 self, + 1779 path: str, + (...) 1786 stream_cls: type[_AsyncStreamT] | None = None, + 1787 ) -> ResponseT | _AsyncStreamT: + 1788 opts = FinalRequestOptions.construct( + 1789 method="post", url=path, json_data=body, files=await async_to_httpx_files(files), **options + 1790 ) +-> 1791 return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/openai/_base_client.py:1584, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) - 1581 await err.response.aread() - 1583 log.debug("Re-raising status error") --> 1584 raise self._make_status_error_from_response(err.response) from None - 1586 break - 1588 assert response is not None, "could not resolve response (should never happen)" +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/openai/_base_client.py:1591, in AsyncAPIClient.request(self, cast_to, options, stream, stream_cls) + 1588 await err.response.aread() + 1590 log.debug("Re-raising status error") +-> 1591 raise self._make_status_error_from_response(err.response) from None + 1593 break + 1595 assert response is not None, "could not resolve response (should never happen)" AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************************************************************************************************************************zWYA. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}} diff --git a/examples_notebooks/multi_index_search/index.html b/examples_notebooks/multi_index_search/index.html index 6252d595..5016743d 100644 --- a/examples_notebooks/multi_index_search/index.html +++ b/examples_notebooks/multi_index_search/index.html @@ -2432,7 +2432,7 @@ results = await task 6 pd.read_parquet(f"inputs/{index}/community_reports.parquet") for index in indexes 7 ] -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 666 use_nullable_dtypes = False 667 check_dtype_backend(dtype_backend) --> 669 return impl.read( @@ -2446,7 +2446,7 @@ results = await task 677 **kwargs, 678 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) 256 if manager == "array": 257 to_pandas_kwargs["split_blocks"] = True --> 258 path_or_handle, handles, filesystem = _get_path_or_handle( @@ -2462,7 +2462,7 @@ results = await task (...) 270 **kwargs, 271 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) 131 handles = None 132 if ( 133 not fs @@ -2475,7 +2475,7 @@ results = await task 144 fs = None 145 path_or_handle = handles.handle -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 873 handle = open( 874 handle, 875 ioargs.mode, @@ -2775,7 +2775,7 @@ results = await task 6 pd.read_parquet(f"inputs/{index}/community_reports.parquet") for index in indexes 7 ] -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 666 use_nullable_dtypes = False 667 check_dtype_backend(dtype_backend) --> 669 return impl.read( @@ -2789,7 +2789,7 @@ results = await task 677 **kwargs, 678 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) 256 if manager == "array": 257 to_pandas_kwargs["split_blocks"] = True --> 258 path_or_handle, handles, filesystem = _get_path_or_handle( @@ -2805,7 +2805,7 @@ results = await task (...) 270 **kwargs, 271 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) 131 handles = None 132 if ( 133 not fs @@ -2818,7 +2818,7 @@ results = await task 144 fs = None 145 path_or_handle = handles.handle -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 873 handle = open( 874 handle, 875 ioargs.mode, @@ -3232,7 +3232,7 @@ results = await task 6 pd.read_parquet(f"inputs/{index}/community_reports.parquet") for index in indexes 7 ] -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 666 use_nullable_dtypes = False 667 check_dtype_backend(dtype_backend) --> 669 return impl.read( @@ -3246,7 +3246,7 @@ results = await task 677 **kwargs, 678 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) 256 if manager == "array": 257 to_pandas_kwargs["split_blocks"] = True --> 258 path_or_handle, handles, filesystem = _get_path_or_handle( @@ -3262,7 +3262,7 @@ results = await task (...) 270 **kwargs, 271 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) 131 handles = None 132 if ( 133 not fs @@ -3275,7 +3275,7 @@ results = await task 144 fs = None 145 path_or_handle = handles.handle -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 873 handle = open( 874 handle, 875 ioargs.mode, @@ -3585,7 +3585,7 @@ results = await task 9 ) 10 results = await task -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:669, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs) 666 use_nullable_dtypes = False 667 check_dtype_backend(dtype_backend) --> 669 return impl.read( @@ -3599,7 +3599,7 @@ results = await task 677 **kwargs, 678 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:258, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs) 256 if manager == "array": 257 to_pandas_kwargs["split_blocks"] = True --> 258 path_or_handle, handles, filesystem = _get_path_or_handle( @@ -3615,7 +3615,7 @@ results = await task (...) 270 **kwargs, 271 ) -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/parquet.py:141, in _get_path_or_handle(path, fs, storage_options, mode, is_dir) 131 handles = None 132 if ( 133 not fs @@ -3628,7 +3628,7 @@ results = await task 144 fs = None 145 path_or_handle = handles.handle -File ~/.cache/pypoetry/virtualenvs/graphrag-F2jvqev7-py3.11/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) +File ~/work/graphrag/graphrag/.venv/lib/python3.11/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options) 873 handle = open( 874 handle, 875 ioargs.mode, diff --git a/index/overview/index.html b/index/overview/index.html index 8de883ae..22289a65 100644 --- a/index/overview/index.html +++ b/index/overview/index.html @@ -1718,8 +1718,7 @@ After you have a config file you can run the pipeline using the CLI or the Python API.

Usage

CLI

-
# Via Poetry
-poetry run poe index --root <data_root> # default config mode
+
uv run poe index --root <data_root> # default config mode
 

Python API

Please see the indexing API python file for the recommended method to call directly from Python code.

diff --git a/search/search_index.json b/search/search_index.json index f70371f5..13dcac9f 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config": {"lang": ["en"], "separator": "[\\s\\-]+", "pipeline": ["stopWordFilter"]}, "docs": [{"location": "", "title": "Welcome to GraphRAG", "text": "

\ud83d\udc49 Microsoft Research Blog Post \ud83d\udc49 GraphRAG Arxiv

Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.

GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.

To learn more about GraphRAG and how it can be used to enhance your language model's ability to reason about your private data, please visit the Microsoft Research Blog Post.

"}, {"location": "#get-started-with-graphrag", "title": "Get Started with GraphRAG \ud83d\ude80", "text": "

To start using GraphRAG, check out the Get Started guide. For a deeper dive into the main sub-systems, please visit the docpages for the Indexer and Query packages.

"}, {"location": "#graphrag-vs-baseline-rag", "title": "GraphRAG vs Baseline RAG \ud83d\udd0d", "text": "

Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using real-world information. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique, which we call Baseline RAG. GraphRAG uses knowledge graphs to provide substantial improvements in question-and-answer performance when reasoning about complex information. RAG techniques have shown promise in helping LLMs to reason about private datasets - data that the LLM is not trained on and has never seen before, such as an enterprise\u2019s proprietary research, business documents, or communications. Baseline RAG was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:

  • Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
  • Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.

To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research\u2019s new approach, GraphRAG, creates a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.

"}, {"location": "#the-graphrag-process", "title": "The GraphRAG Process \ud83e\udd16", "text": "

GraphRAG builds upon our prior research and tooling using graph machine learning. The basic steps of the GraphRAG process are as follows:

"}, {"location": "#index", "title": "Index", "text": "
  • Slice up an input corpus into a series of TextUnits, which act as analyzable units for the rest of the process, and provide fine-grained references in our outputs.
  • Extract all entities, relationships, and key claims from the TextUnits.
  • Perform a hierarchical clustering of the graph using the Leiden technique. To see this visually, check out Figure 1 above. Each circle is an entity (e.g., a person, place, or organization), with the size representing the degree of the entity, and the color representing its community.
  • Generate summaries of each community and its constituents from the bottom-up. This aids in holistic understanding of the dataset.
"}, {"location": "#query", "title": "Query", "text": "

At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:

  • Global Search for reasoning about holistic questions about the corpus by leveraging the community summaries.
  • Local Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts.
  • DRIFT Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts, but with the added context of community information.
"}, {"location": "#prompt-tuning", "title": "Prompt Tuning", "text": "

Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.

"}, {"location": "#versioning", "title": "Versioning", "text": "

Please see the breaking changes document for notes on our approach to versioning the project.

Always run graphrag init --root [path] --force between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.

"}, {"location": "blog_posts/", "title": "Microsoft Research Blog", "text": "
  • GraphRAG: Unlocking LLM discovery on narrative private data

    Published February 13, 2024

    By Jonathan Larson, Senior Principal Data Architect; Steven Truitt, Principal Program Manager

  • GraphRAG: New tool for complex data discovery now on GitHub

    Published July 2, 2024

    By Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect

  • GraphRAG auto-tuning provides rapid adaptation to new domains

    Published September 9, 2024

    By Alonso Guevara Fern\u00e1ndez, Sr. Software Engineer; Katy Smith, Data Scientist II; Joshua Bradley, Senior Data Scientist; Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Sarah Smith, Senior Program Manager; Ben Cutler, Senior Director; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect

  • Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

    Published October 31, 2024

    By Julian Whiting, Senior Machine Learning Engineer; Zachary Hills , Senior Software Engineer; Alonso Guevara Fern\u00e1ndez, Sr. Software Engineer; Ha Trinh, Senior Data Scientist; Adam Bradley , Managing Partner, Strategic Research; Jonathan Larson, Senior Principal Data Architect

  • GraphRAG: Improving global search via dynamic community selection

    Published November 15, 2024

    By Bryan Li, Research Intern; Ha Trinh, Senior Data Scientist; Darren Edge, Senior Director; Jonathan Larson, Senior Principal Data Architect

  • LazyGraphRAG: Setting a new standard for quality and cost

    Published November 25, 2024

    By Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Jonathan Larson, Senior Principal Data Architect

  • Moving to GraphRAG 1.0 \u2013 Streamlining ergonomics for developers and users

    Published December 16, 2024

    By Nathan Evans, Principal Software Architect; Alonso Guevara Fern\u00e1ndez, Senior Software Engineer; Joshua Bradley, Senior Data Scientist

    "}, {"location": "cli/", "title": "CLI Reference", "text": "

    This page documents the command-line interface of the graphrag library.

    "}, {"location": "cli/#graphrag", "title": "graphrag", "text": "

    GraphRAG: A graph-based retrieval-augmented generation (RAG) system.

    Usage:

     [OPTIONS] COMMAND [ARGS]...\n

    Options:

      --install-completion  Install completion for the current shell.\n  --show-completion     Show completion for the current shell, to copy it or\n                        customize the installation.\n
    "}, {"location": "cli/#index", "title": "index", "text": "

    Build a knowledge graph index.

    Usage:

     index [OPTIONS]\n

    Options:

      -c, --config PATH               The configuration to use.\n  -r, --root PATH                 The project root directory.  \\[default: .]\n  -m, --method [standard|fast|standard-update|fast-update]\n                                  The indexing method to use.  \\[default:\n                                  standard]\n  -v, --verbose                   Run the indexing pipeline with verbose\n                                  logging\n  --memprofile                    Run the indexing pipeline with memory\n                                  profiling\n  --dry-run                       Run the indexing pipeline without executing\n                                  any steps to inspect and validate the\n                                  configuration.\n  --cache / --no-cache            Use LLM cache.  \\[default: cache]\n  --skip-validation               Skip any preflight validation. Useful when\n                                  running no LLM steps.\n  -o, --output PATH               Indexing pipeline output directory.\n                                  Overrides output.base_dir in the\n                                  configuration file.\n
    "}, {"location": "cli/#init", "title": "init", "text": "

    Generate a default configuration file.

    Usage:

     init [OPTIONS]\n

    Options:

      -r, --root PATH  The project root directory.  \\[default: .]\n  -f, --force      Force initialization even if the project already exists.\n
    "}, {"location": "cli/#prompt-tune", "title": "prompt-tune", "text": "

    Generate custom graphrag prompts with your own data (i.e. auto templating).

    Usage:

     prompt-tune [OPTIONS]\n

    Options:

      -r, --root PATH                 The project root directory.  \\[default: .]\n  -c, --config PATH               The configuration to use.\n  -v, --verbose                   Run the prompt tuning pipeline with verbose\n                                  logging.\n  --domain TEXT                   The domain your input data is related to.\n                                  For example 'space science', 'microbiology',\n                                  'environmental news'. If not defined, a\n                                  domain will be inferred from the input data.\n  --selection-method [all|random|top|auto]\n                                  The text chunk selection method.  \\[default:\n                                  random]\n  --n-subset-max INTEGER          The number of text chunks to embed when\n                                  --selection-method=auto.  \\[default: 300]\n  --k INTEGER                     The maximum number of documents to select\n                                  from each centroid when --selection-\n                                  method=auto.  \\[default: 15]\n  --limit INTEGER                 The number of documents to load when\n                                  --selection-method={random,top}.  \\[default:\n                                  15]\n  --max-tokens INTEGER            The max token count for prompt generation.\n                                  \\[default: 2000]\n  --min-examples-required INTEGER\n                                  The minimum number of examples to\n                                  generate/include in the entity extraction\n                                  prompt.  \\[default: 2]\n  --chunk-size INTEGER            The size of each example text chunk.\n                                  Overrides chunks.size in the configuration\n                                  file.  \\[default: 1200]\n  --overlap INTEGER               The overlap size for chunking documents.\n                                  Overrides chunks.overlap in the\n                                  configuration file.  \\[default: 100]\n  --language TEXT                 The primary language used for inputs and\n                                  outputs in graphrag prompts.\n  --discover-entity-types / --no-discover-entity-types\n                                  Discover and extract unspecified entity\n                                  types.  \\[default: discover-entity-types]\n  -o, --output PATH               The directory to save prompts to, relative\n                                  to the project root directory.  \\[default:\n                                  prompts]\n
    "}, {"location": "cli/#query", "title": "query", "text": "

    Query a knowledge graph index.

    Usage:

     query [OPTIONS]\n

    Options:

      -m, --method [local|global|drift|basic]\n                                  The query algorithm to use.  \\[required]\n  -q, --query TEXT                The query to execute.  \\[required]\n  -c, --config PATH               The configuration to use.\n  -v, --verbose                   Run the query with verbose logging.\n  -d, --data PATH                 Index output directory (contains the parquet\n                                  files).\n  -r, --root PATH                 The project root directory.  \\[default: .]\n  --community-level INTEGER       Leiden hierarchy level from which to load\n                                  community reports. Higher values represent\n                                  smaller communities.  \\[default: 2]\n  --dynamic-community-selection / --no-dynamic-selection\n                                  Use global search with dynamic community\n                                  selection.  \\[default: no-dynamic-selection]\n  --response-type TEXT            Free-form description of the desired\n                                  response format (e.g. 'Single Sentence',\n                                  'List of 3-7 Points', etc.).  \\[default:\n                                  Multiple Paragraphs]\n  --streaming / --no-streaming    Print the response in a streaming manner.\n                                  \\[default: no-streaming]\n
    "}, {"location": "cli/#update", "title": "update", "text": "

    Update an existing knowledge graph index.

    Applies a default output configuration (if not provided by config), saving the new index to the local file system in the update_output folder.

    Usage:

     update [OPTIONS]\n

    Options:

      -c, --config PATH               The configuration to use.\n  -r, --root PATH                 The project root directory.  \\[default: .]\n  -m, --method [standard|fast|standard-update|fast-update]\n                                  The indexing method to use.  \\[default:\n                                  standard]\n  -v, --verbose                   Run the indexing pipeline with verbose\n                                  logging.\n  --memprofile                    Run the indexing pipeline with memory\n                                  profiling.\n  --cache / --no-cache            Use LLM cache.  \\[default: cache]\n  --skip-validation               Skip any preflight validation. Useful when\n                                  running no LLM steps.\n  -o, --output PATH               Indexing pipeline output directory.\n                                  Overrides output.base_dir in the\n                                  configuration file.\n
    "}, {"location": "developing/", "title": "Development Guide", "text": ""}, {"location": "developing/#requirements", "title": "Requirements", "text": "Name Installation Purpose Python 3.10-3.12 Download The library is Python-based. Poetry Instructions Poetry is used for package management and virtualenv management in Python codebases"}, {"location": "developing/#getting-started", "title": "Getting Started", "text": ""}, {"location": "developing/#install-dependencies", "title": "Install Dependencies", "text": "
    # Install Python dependencies.\npoetry install\n
    "}, {"location": "developing/#execute-the-indexing-engine", "title": "Execute the Indexing Engine", "text": "
    poetry run poe index <...args>\n
    "}, {"location": "developing/#executing-queries", "title": "Executing Queries", "text": "
    poetry run poe query <...args>\n
    "}, {"location": "developing/#azurite", "title": "Azurite", "text": "

    Some unit and smoke tests use Azurite to emulate Azure resources. This can be started by running:

    ./scripts/start-azurite.sh\n

    or by simply running azurite in the terminal if already installed globally. See the Azurite documentation for more information about how to install and use Azurite.

    "}, {"location": "developing/#lifecycle-scripts", "title": "Lifecycle Scripts", "text": "

    Our Python package utilizes Poetry to manage dependencies and poethepoet to manage build scripts.

    Available scripts are:

    • poetry run poe index - Run the Indexing CLI
    • poetry run poe query - Run the Query CLI
    • poetry build - This invokes poetry build, which will build a wheel file and other distributable artifacts.
    • poetry run poe test - This will execute all tests.
    • poetry run poe test_unit - This will execute unit tests.
    • poetry run poe test_integration - This will execute integration tests.
    • poetry run poe test_smoke - This will execute smoke tests.
    • poetry run poe test_verbs - This will execute tests of the basic workflows.
    • poetry run poe check - This will perform a suite of static checks across the package, including:
    • formatting
    • documentation formatting
    • linting
    • security patterns
    • type-checking
    • poetry run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
    • poetry run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.
    • poetry run poe format - Explicitly run the formatter across the package.
    "}, {"location": "developing/#troubleshooting", "title": "Troubleshooting", "text": ""}, {"location": "developing/#runtimeerror-llvm-config-failed-executing-please-point-llvm_config-to-the-path-for-llvm-config-when-running-poetry-install", "title": "\"RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config\" when running poetry install", "text": "

    Make sure llvm-9 and llvm-9-dev are installed:

    sudo apt-get install llvm-9 llvm-9-dev

    and then in your bashrc, add

    export LLVM_CONFIG=/usr/bin/llvm-config-9

    "}, {"location": "developing/#numba_pymoduleh610-fatal-error-pythonh-no-such-file-or-directory-when-running-poetry-install", "title": "\"numba/_pymodule.h:6:10: fatal error: Python.h: No such file or directory\" when running poetry install", "text": "

    Make sure you have python3.10-dev installed or more generally python<version>-dev

    sudo apt-get install python3.10-dev

    "}, {"location": "developing/#llm-call-constantly-exceeds-tpm-rpm-or-time-limits", "title": "LLM call constantly exceeds TPM, RPM or time limits", "text": "

    GRAPHRAG_LLM_THREAD_COUNT and GRAPHRAG_EMBEDDING_THREAD_COUNT are both set to 50 by default. You can modify these values to reduce concurrency. Please refer to the Configuration Documents

    "}, {"location": "get_started/", "title": "Getting Started", "text": ""}, {"location": "get_started/#requirements", "title": "Requirements", "text": "

    Python 3.10-3.12

    To get started with the GraphRAG system, you have a few options:

    \ud83d\udc49 Install from pypi. \ud83d\udc49 Use it from source

    The following is a simple end-to-end example for using the GraphRAG system, using the install from pypi option.

    It shows how to use the system to index some text, and then use the indexed data to answer questions about the documents.

    "}, {"location": "get_started/#install-graphrag", "title": "Install GraphRAG", "text": "
    pip install graphrag\n
    "}, {"location": "get_started/#running-the-indexer", "title": "Running the Indexer", "text": "

    We need to set up a data project and some initial configuration. First let's get a sample dataset ready:

    mkdir -p ./ragtest/input\n

    Get a copy of A Christmas Carol by Charles Dickens from a trusted source:

    curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt\n
    "}, {"location": "get_started/#set-up-your-workspace-variables", "title": "Set Up Your Workspace Variables", "text": "

    To initialize your workspace, first run the graphrag init command. Since we have already configured a directory named ./ragtest in the previous step, run the following command:

    graphrag init --root ./ragtest\n

    This will create two files: .env and settings.yaml in the ./ragtest directory.

    • .env contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined, GRAPHRAG_API_KEY=<API_KEY>. Replace <API_KEY> with your own OpenAI or Azure API key.
    • settings.yaml contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.
    "}, {"location": "get_started/#using-openai", "title": "Using OpenAI", "text": "

    If running in OpenAI mode, you only need to update the value of GRAPHRAG_API_KEY in the .env file with your OpenAI API key.

    "}, {"location": "get_started/#using-azure-openai", "title": "Using Azure OpenAI", "text": "

    In addition to setting your API key, Azure OpenAI users should set the variables below in the settings.yaml file. To find the appropriate sections, just search for the models: root configuration; you should see two sections, one for the default chat endpoint and one for the default embeddings endpoint. Here is an example of what to add to the chat model config:

    type: azure_openai_chat # Or azure_openai_embedding for embeddings\napi_base: https://<instance>.openai.azure.com\napi_version: 2024-02-15-preview # You can customize this for other versions\ndeployment_name: <azure_model_deployment_name>\n
    "}, {"location": "get_started/#using-managed-auth-on-azure", "title": "Using Managed Auth on Azure", "text": "

    To use managed auth, add an additional value to your model config and comment out or remove the api_key line:

    auth_type: azure_managed_identity # Default auth_type is is api_key\n# api_key: ${GRAPHRAG_API_KEY}\n

    You will also need to login with az login and select the subscription with your endpoint.

    "}, {"location": "get_started/#running-the-indexing-pipeline", "title": "Running the Indexing pipeline", "text": "

    Finally we'll run the pipeline!

    graphrag index --root ./ragtest\n

    This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your settings.yaml file). Once the pipeline is complete, you should see a new folder called ./ragtest/output with a series of parquet files.

    "}, {"location": "get_started/#using-the-query-engine", "title": "Using the Query Engine", "text": "

    Now let's ask some questions using this dataset.

    Here is an example using Global search to ask a high-level question:

    graphrag query \\\n--root ./ragtest \\\n--method global \\\n--query \"What are the top themes in this story?\"\n

    Here is an example using Local search to ask a more specific question about a particular character:

    graphrag query \\\n--root ./ragtest \\\n--method local \\\n--query \"Who is Scrooge and what are his main relationships?\"\n

    Please refer to Query Engine docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.

    "}, {"location": "get_started/#going-deeper", "title": "Going Deeper", "text": "
    • For more details about configuring GraphRAG, see the configuration documentation.
    • To learn more about Initialization, refer to the Initialization documentation.
    • For more details about using the CLI, refer to the CLI documentation.
    • Check out our visualization guide for a more interactive experience in debugging and exploring the knowledge graph.
    "}, {"location": "visualization_guide/", "title": "Visualizing and Debugging Your Knowledge Graph", "text": "

    The following step-by-step guide walks through the process to visualize a knowledge graph after it's been constructed by graphrag. Note that some of the settings recommended below are based on our own experience of what works well. Feel free to change and explore other settings for a better visualization experience!

    "}, {"location": "visualization_guide/#1-run-the-pipeline", "title": "1. Run the Pipeline", "text": "

    Before building an index, please review your settings.yaml configuration file and ensure that graphml snapshots is enabled.

    snapshots:\n  graphml: true\n
    (Optional) To support other visualization tools and exploration, additional parameters can be enabled that provide access to vector embeddings.
    embed_graph:\n  enabled: true # will generate node2vec embeddings for nodes\numap:\n  enabled: true # will generate UMAP embeddings for nodes, giving the entities table an x/y position to plot\n
    After running the indexing pipeline over your data, there will be an output folder (defined by the storage.base_dir setting).

    • Output Folder: Contains artifacts from the LLM\u2019s indexing pass.
    "}, {"location": "visualization_guide/#2-locate-the-knowledge-graph", "title": "2. Locate the Knowledge Graph", "text": "

    In the output folder, look for a file named graph.graphml. graphml is a standard file format supported by many visualization tools. We recommend trying Gephi.

    "}, {"location": "visualization_guide/#3-open-the-graph-in-gephi", "title": "3. Open the Graph in Gephi", "text": "
    1. Install and open Gephi
    2. Navigate to the output folder containing the various parquet files.
    3. Import the graph.graphml file into Gephi. This will result in a fairly plain view of the undirected graph nodes and edges.
    "}, {"location": "visualization_guide/#4-install-the-leiden-algorithm-plugin", "title": "4. Install the Leiden Algorithm Plugin", "text": "
    1. Go to Tools -> Plugins.
    2. Search for \"Leiden Algorithm\".
    3. Click Install and restart Gephi.
    "}, {"location": "visualization_guide/#5-run-statistics", "title": "5. Run Statistics", "text": "
    1. In the Statistics tab on the right, click Run for Average Degree and Leiden Algorithm.
    1. For the Leiden Algorithm, adjust the settings:
    2. Quality function: Modularity
    3. Resolution: 1
    "}, {"location": "visualization_guide/#6-color-the-graph-by-clusters", "title": "6. Color the Graph by Clusters", "text": "
    1. Go to the Appearance pane in the upper left side of Gephi.
    1. Select Nodes, then Partition, and click the color palette icon in the upper right.
    2. Choose Cluster from the dropdown.
    3. Click the Palette... hyperlink, then Generate....
    4. Uncheck Limit number of colors, click Generate, and then Ok.
    5. Click Apply to color the graph. This will color the graph based on the partitions discovered by Leiden.
    "}, {"location": "visualization_guide/#7-resize-nodes-by-degree-centrality", "title": "7. Resize Nodes by Degree Centrality", "text": "
    1. In the Appearance pane in the upper left, select Nodes -> Ranking
    2. Select the Sizing icon in the upper right.
    3. Choose Degree and set:
    4. Min: 10
    5. Max: 150
    6. Click Apply.
    "}, {"location": "visualization_guide/#8-layout-the-graph", "title": "8. Layout the Graph", "text": "
    1. In the Layout tab in the lower left, select OpenORD.
    1. Set Liquid and Expansion stages to 50, and everything else to 0.
    2. Click Run and monitor the progress.
    "}, {"location": "visualization_guide/#9-run-forceatlas2", "title": "9. Run ForceAtlas2", "text": "
    1. Select Force Atlas 2 in the layout options.
    1. Adjust the settings:
    2. Scaling: 15
    3. Dissuade Hubs: checked
    4. LinLog mode: uncheck
    5. Prevent Overlap: checked
    6. Click Run and wait.
    7. Press Stop when it looks like the graph nodes have settled and no longer change position significantly.
    "}, {"location": "visualization_guide/#10-add-text-labels-optional", "title": "10. Add Text Labels (Optional)", "text": "
    1. Turn on text labels in the appropriate section.
    2. Configure and resize them as needed.

    Your final graph should now be visually organized and ready for analysis!

    "}, {"location": "config/init/", "title": "Configuring GraphRAG Indexing", "text": "

    To start using GraphRAG, you must generate a configuration file. The init command is the easiest way to get started. It will create a .env and settings.yaml files in the specified directory with the necessary configuration settings. It will also output the default LLM prompts used by GraphRAG.

    "}, {"location": "config/init/#usage", "title": "Usage", "text": "
    graphrag init [--root PATH] [--force, --no-force]\n
    "}, {"location": "config/init/#options", "title": "Options", "text": "
    • --root PATH - The project root directory to initialize graphrag at. Default is the current directory.
    • --force, --no-force - Optional, default is --no-force. Overwrite existing configuration and prompt files if they exist.
    "}, {"location": "config/init/#example", "title": "Example", "text": "
    graphrag init --root ./ragtest\n
    "}, {"location": "config/init/#output", "title": "Output", "text": "

    The init command will create the following files in the specified directory:

    • settings.yaml - The configuration settings file. This file contains the configuration settings for GraphRAG.
    • .env - The environment variables file. These are referenced in the settings.yaml file.
    • prompts/ - The LLM prompts folder. This contains the default prompts used by GraphRAG, you can modify them or run the Auto Prompt Tuning command to generate new prompts adapted to your data.
    "}, {"location": "config/init/#next-steps", "title": "Next Steps", "text": "

    After initializing your workspace, you can either run the Prompt Tuning command to adapt the prompts to your data or even start running the Indexing Pipeline to index your data. For more information on configuration options available, see the YAML details page.

    "}, {"location": "config/models/", "title": "Language Model Selection and Overriding", "text": "

    This page contains information on selecting a model to use and options to supply your own model for GraphRAG. Note that this is not a guide to finding the right model for your use case.

    "}, {"location": "config/models/#default-model-support", "title": "Default Model Support", "text": "

    GraphRAG was built and tested using OpenAI models, so this is the default model set we support. This is not intended to be a limiter or statement of quality or fitness for your use case, only that it's the set we are most familiar with for prompting, tuning, and debugging.

    GraphRAG also utilizes a language model wrapper library used by several projects within our team, called fnllm. fnllm provides two important functions for GraphRAG: rate limiting configuration to help us maximize throughput for large indexing jobs, and robust caching of API calls to minimize consumption on repeated indexes for testing, experimentation, or incremental ingest. fnllm uses the OpenAI Python SDK under the covers, so OpenAI-compliant endpoints are a base requirement out-of-the-box.

    "}, {"location": "config/models/#model-selection-considerations", "title": "Model Selection Considerations", "text": "

    GraphRAG has been most thoroughly tested with the gpt-4 series of models from OpenAI, including gpt-4 gpt-4-turbo, gpt-4o, and gpt-4o-mini. Our arXiv paper, for example, performed quality evaluation using gpt-4-turbo.

    Versions of GraphRAG before 2.2.0 made extensive use of max_tokens and logit_bias to control generated response length or content. The introduction of the o-series of models added new, non-compatible parameters because these models include a reasoning component that has different consumption patterns and response generation attributes than non-reasoning models. GraphRAG 2.2.0 now supports these models, but there are important differences that need to be understood before you switch.

    • Previously, GraphRAG used max_tokens to limit responses in a few locations. This is done so that we can have predictable content sizes when building downstream context windows for summarization. We have now switched from using max_tokens to use a prompted approach, which is working well in our tests. We suggest using max_tokens in your language model config only for budgetary reasons if you want to limit consumption, and not for expected response length control. We now also support the o-series equivalent max_completion_tokens, but if you use this keep in mind that there may be some unknown fixed reasoning consumption amount in addition to the response tokens, so it is not a good technique for response control.
    • Previously, GraphRAG used a combination of max_tokens and logit_bias to strictly control a binary yes/no question during gleanings. This is not possible with reasoning models, so again we have switched to a prompted approach. Our tests with gpt-4o, gpt-4o-mini, and o1 show that this works consistently, but could have issues if you have an older or smaller model.
    • The o-series models are much slower and more expensive. It may be useful to use an asymmetric approach to model use in your config: you can define as many models as you like in the models block of your settings.yaml and reference them by key for every workflow that requires a language model. You could use gpt-4o for indexing and o1 for query, for example. Experiment to find the right balance of cost, speed, and quality for your use case.
    • The o-series models contain a form of native native chain-of-thought reasoning that is absent in the non-o-series models. GraphRAG's prompts sometimes contain CoT because it was an effective technique with the gpt-4* series. It may be counterproductive with the o-series, so you may want to tune or even re-write large portions of the prompt templates (particularly for graph and claim extraction).

    Example config with asymmetric model use:

    models:\n  extraction_chat_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_chat\n    auth_type: api_key\n    model: gpt-4o\n    model_supports_json: true\n  query_chat_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_chat\n    auth_type: api_key\n    model: o1\n    model_supports_json: true\n\n...\n\nextract_graph:\n  model_id: extraction_chat_model\n  prompt: \"prompts/extract_graph.txt\"\n  entity_types: [organization,person,geo,event]\n  max_gleanings: 1\n\n...\n\n\nglobal_search:\n  chat_model_id: query_chat_model\n  map_prompt: \"prompts/global_search_map_system_prompt.txt\"\n  reduce_prompt: \"prompts/global_search_reduce_system_prompt.txt\"\n  knowledge_prompt: \"prompts/global_search_knowledge_system_prompt.txt\"\n

    Another option would be to avoid using a language model at all for the graph extraction, instead using the fast indexing method that uses NLP for portions of the indexing phase in lieu of LLM APIs.

    "}, {"location": "config/models/#using-non-openai-models", "title": "Using Non-OpenAI Models", "text": "

    As noted above, our primary experience and focus has been on OpenAI models, so this is what is supported out-of-the-box. Many users have requested support for additional model types, but it's out of the scope of our research to handle the many models available today. There are two approaches you can use to connect to a non-OpenAI model:

    "}, {"location": "config/models/#proxy-apis", "title": "Proxy APIs", "text": "

    Many users have used platforms such as ollama to proxy the underlying model HTTP calls to a different model provider. This seems to work reasonably well, but we frequently see issues with malformed responses (especially JSON), so if you do this please understand that your model needs to reliably return the specific response formats that GraphRAG expects. If you're having trouble with a model, you may need to try prompting to coax the format, or intercepting the response within your proxy to try and handle malformed responses.

    "}, {"location": "config/models/#model-protocol", "title": "Model Protocol", "text": "

    As of GraphRAG 2.0.0, we support model injection through the use of a standard chat and embedding Protocol and an accompanying ModelFactory that you can use to register your model implementation. This is not supported with the CLI, so you'll need to use GraphRAG as a library.

    • Our Protocol is defined here
    • Our base implementation, which wraps fnllm, is here
    • We have a simple mock implementation in our tests that you can reference here

    Once you have a model implementation, you need to register it with our ModelFactory:

    class MyCustomModel:\n    ...\n    # implementation\n\n# elsewhere...\nModelFactory.register_chat(\"my-custom-chat-model\", lambda **kwargs: MyCustomModel(**kwargs))\n

    Then in your config you can reference the type name you used:

    models:\n  default_chat_model:\n    type: my-custom-chat-model\n\n\nextract_graph:\n  model_id: default_chat_model\n  prompt: \"prompts/extract_graph.txt\"\n  entity_types: [organization,person,geo,event]\n  max_gleanings: 1\n

    Note that your custom model will be passed the same params for init and method calls that we use throughout GraphRAG. There is not currently any ability to define custom parameters, so you may need to use closure scope or a factory pattern within your implementation to get custom config values.

    "}, {"location": "config/overview/", "title": "Configuring GraphRAG Indexing", "text": "

    The GraphRAG system is highly configurable. This page provides an overview of the configuration options available for the GraphRAG indexing engine.

    "}, {"location": "config/overview/#default-configuration-mode", "title": "Default Configuration Mode", "text": "

    The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The main ways to set up GraphRAG in Default Configuration mode are via:

    • Init command (recommended first step)
    • Edit settings.yaml for deeper control
    • Purely using environment variables (not recommended)
    "}, {"location": "config/yaml/", "title": "Default Configuration Mode (using YAML/JSON)", "text": "

    The default configuration mode may be configured by using a settings.yml or settings.json file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax. We initialize with YML by default in graphrag init but you may use the equivalent JSON form if preferred.

    Many of these config values have defaults. Rather than replicate them here, please refer to the constants in the code directly.

    For example:

    # .env\nGRAPHRAG_API_KEY=some_api_key\n\n# settings.yml\nllm: \n  api_key: ${GRAPHRAG_API_KEY}\n
    "}, {"location": "config/yaml/#config-sections", "title": "Config Sections", "text": ""}, {"location": "config/yaml/#language-model-setup", "title": "Language Model Setup", "text": ""}, {"location": "config/yaml/#models", "title": "models", "text": "

    This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.

    For example:

    models:\n  default_chat_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_chat\n    model: gpt-4o\n    model_supports_json: true\n  default_embedding_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_embedding\n    model: text-embedding-ada-002\n

    "}, {"location": "config/yaml/#fields", "title": "Fields", "text": "
    • api_key str - The OpenAI API key to use.
    • auth_type api_key|azure_managed_identity - Indicate how you want to authenticate requests.
    • type openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding|mock_chat|mock_embeddings - The type of LLM to use.
    • model str - The model name.
    • encoding_model str - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
    • api_base str - The API base url to use.
    • api_version str - The API version.
    • deployment_name str - The deployment name to use (Azure).
    • organization str - The client organization.
    • proxy str - The proxy URL to use.
    • audience str - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if api_key is not defined. Default=https://cognitiveservices.azure.com/.default
    • model_supports_json bool - Whether the model supports JSON-mode output.
    • request_timeout float - The per-request timeout.
    • tokens_per_minute int - Set a leaky-bucket throttle on tokens-per-minute.
    • requests_per_minute int - Set a leaky-bucket throttle on requests-per-minute.
    • retry_strategy str - Retry strategy to use, \"native\" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include \"exponential_backoff\", \"random_wait\", and \"incremental_wait\".
    • max_retries int - The maximum number of retries to use.
    • max_retry_wait float - The maximum backoff time.
    • concurrent_requests int The number of open requests to allow at once.
    • async_mode asyncio|threaded The async mode to use. Either asyncio or threaded.
    • responses list[str] - If this model type is mock, this is a list of response strings to return.
    • n int - The number of completions to generate.
    • max_tokens int - The maximum number of output tokens. Not valid for o-series models.
    • temperature float - The temperature to use. Not valid for o-series models.
    • top_p float - The top-p value to use. Not valid for o-series models.
    • frequency_penalty float - Frequency penalty for token generation. Not valid for o-series models.
    • presence_penalty float - Frequency penalty for token generation. Not valid for o-series models.
    • max_completion_tokens int - Max number of tokens to consume for chat completion. Must be large enough to include an unknown amount for \"reasoning\" by the model. o-series models only.
    • reasoning_effort low|medium|high - Amount of \"thought\" for the model to expend reasoning about a response. o-series models only.
    "}, {"location": "config/yaml/#input-files-and-chunking", "title": "Input Files and Chunking", "text": ""}, {"location": "config/yaml/#input", "title": "input", "text": "

    Our pipeline can ingest .csv, .txt, or .json data from an input location. See the inputs page for more details and examples.

    "}, {"location": "config/yaml/#fields_1", "title": "Fields", "text": "
    • storage StorageConfig
    • type file|blob|cosmosdb - The storage type to use. Default=file
    • base_dir str - The base directory to write output artifacts to, relative to the root.
    • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
    • container_name str - (blob/cosmosdb only) The Azure Storage container name.
    • storage_account_blob_url str - (blob only) The storage account blob URL to use.
    • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
    • file_type text|csv|json - The type of input data to load. Default is text
    • encoding str - The encoding of the input file. Default is utf-8
    • file_pattern str - A regex to match input files. Default is .*\\.csv$, .*\\.txt$, or .*\\.json$ depending on the specified file_type, but you can customize it if needed.
    • file_filter dict - Key/value pairs to filter. Default is None.
    • text_column str - (CSV/JSON only) The text column name. If unset we expect a column named text.
    • title_column str - (CSV/JSON only) The title column name, filename will be used if unset.
    • metadata list[str] - (CSV/JSON only) The additional document attributes fields to keep.
    "}, {"location": "config/yaml/#chunks", "title": "chunks", "text": "

    These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the metadata setting in the input document config, which will replicate document metadata into each chunk.

    "}, {"location": "config/yaml/#fields_2", "title": "Fields", "text": "
    • size int - The max chunk size in tokens.
    • overlap int - The chunk overlap in tokens.
    • group_by_columns list[str] - Group documents by these fields before chunking.
    • strategy str[tokens|sentences] - How to chunk the text.
    • encoding_model str - The text encoding model to use for splitting on token boundaries.
    • prepend_metadata bool - Determines if metadata values should be added at the beginning of each chunk. Default=False.
    • chunk_size_includes_metadata bool - Specifies whether the chunk size calculation should include metadata tokens. Default=False.
    "}, {"location": "config/yaml/#outputs-and-storage", "title": "Outputs and Storage", "text": ""}, {"location": "config/yaml/#output", "title": "output", "text": "

    This section controls the storage mechanism used by the pipeline used for exporting output tables.

    "}, {"location": "config/yaml/#fields_3", "title": "Fields", "text": "
    • type file|memory|blob|cosmosdb - The storage type to use. Default=file
    • base_dir str - The base directory to write output artifacts to, relative to the root.
    • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
    • container_name str - (blob/cosmosdb only) The Azure Storage container name.
    • storage_account_blob_url str - (blob only) The storage account blob URL to use.
    • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
    "}, {"location": "config/yaml/#update_index_output", "title": "update_index_output", "text": "

    The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.

    "}, {"location": "config/yaml/#fields_4", "title": "Fields", "text": "
    • type file|memory|blob|cosmosdb - The storage type to use. Default=file
    • base_dir str - The base directory to write output artifacts to, relative to the root.
    • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
    • container_name str - (blob/cosmosdb only) The Azure Storage container name.
    • storage_account_blob_url str - (blob only) The storage account blob URL to use.
    • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
    "}, {"location": "config/yaml/#cache", "title": "cache", "text": "

    This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results for faster performance when re-running the indexing process.

    "}, {"location": "config/yaml/#fields_5", "title": "Fields", "text": "
    • type file|memory|blob|cosmosdb - The storage type to use. Default=file
    • base_dir str - The base directory to write output artifacts to, relative to the root.
    • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
    • container_name str - (blob/cosmosdb only) The Azure Storage container name.
    • storage_account_blob_url str - (blob only) The storage account blob URL to use.
    • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
    "}, {"location": "config/yaml/#reporting", "title": "reporting", "text": "

    This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.

    "}, {"location": "config/yaml/#fields_6", "title": "Fields", "text": "
    • type file|blob - The reporting type to use. Default=file
    • base_dir str - The base directory to write reports to, relative to the root.
    • connection_string str - (blob only) The Azure Storage connection string.
    • container_name str - (blob only) The Azure Storage container name.
    • storage_account_blob_url str - The storage account blob URL to use.
    "}, {"location": "config/yaml/#vector_store", "title": "vector_store", "text": "

    Where to put all vectors for the system. Configured for lancedb by default. This is a dict, with the key used to identify individual store parameters (e.g., for text embedding).

    "}, {"location": "config/yaml/#fields_7", "title": "Fields", "text": "
    • type lancedb|azure_ai_search|cosmosdb - Type of vector store. Default=lancedb
    • db_uri str (only for lancedb) - The database uri. Default=storage.base_dir/lancedb
    • url str (only for AI Search) - AI Search endpoint
    • api_key str (optional - only for AI Search) - The AI Search api key to use.
    • audience str (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
    • container_name str - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=default
    • database_name str - (cosmosdb only) Name of the database.
    • overwrite bool (only used at index creation time) - Overwrite collection if it exist. Default=True
    "}, {"location": "config/yaml/#workflow-configurations", "title": "Workflow Configurations", "text": "

    These settings control each individual workflow as they execute.

    "}, {"location": "config/yaml/#workflows", "title": "workflows", "text": "

    list[str] - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.

    "}, {"location": "config/yaml/#embed_text", "title": "embed_text", "text": "

    By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the target and names fields.

    Supported embeddings names are:

    • text_unit.text
    • document.text
    • entity.title
    • entity.description
    • relationship.description
    • community.title
    • community.summary
    • community.full_content
    "}, {"location": "config/yaml/#fields_8", "title": "Fields", "text": "
    • model_id str - Name of the model definition to use for text embedding.
    • vector_store_id str - Name of vector store definition to write to.
    • batch_size int - The maximum batch size to use.
    • batch_max_tokens int - The maximum batch # of tokens.
    • names list[str] - List of the embeddings names to run (must be in supported list).
    "}, {"location": "config/yaml/#extract_graph", "title": "extract_graph", "text": "

    Tune the language model-based graph extraction process.

    "}, {"location": "config/yaml/#fields_9", "title": "Fields", "text": "
    • model_id str - Name of the model definition to use for API calls.
    • prompt str - The prompt file to use.
    • entity_types list[str] - The entity types to identify.
    • max_gleanings int - The maximum number of gleaning cycles to use.
    "}, {"location": "config/yaml/#summarize_descriptions", "title": "summarize_descriptions", "text": ""}, {"location": "config/yaml/#fields_10", "title": "Fields", "text": "
    • model_id str - Name of the model definition to use for API calls.
    • prompt str - The prompt file to use.
    • max_length int - The maximum number of output tokens per summarization.
    • max_input_length int - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
    "}, {"location": "config/yaml/#extract_graph_nlp", "title": "extract_graph_nlp", "text": "

    Defines settings for NLP-based graph extraction methods.

    "}, {"location": "config/yaml/#fields_11", "title": "Fields", "text": "
    • normalize_edge_weights bool - Whether to normalize the edge weights during graph construction. Default=True.
    • text_analyzer dict - Parameters for the NLP model.
    • extractor_type regex_english|syntactic_parser|cfg - Default=regex_english.
    • model_name str - Name of NLP model (for SpaCy-based models)
    • max_word_length int - Longest word to allow. Default=15.
    • word_delimiter str - Delimiter to split words. Default ' '.
    • include_named_entities bool - Whether to include named entities in noun phrases. Default=True.
    • exclude_nouns list[str] | None - List of nouns to exclude. If None, we use an internal stopword list.
    • exclude_entity_tags list[str] - List of entity tags to ignore.
    • exclude_pos_tags list[str] - List of part-of-speech tags to ignore.
    • noun_phrase_tags list[str] - List of noun phrase tags to ignore.
    • noun_phrase_grammars dict[str, str] - Noun phrase grammars for the model (cfg-only).
    "}, {"location": "config/yaml/#prune_graph", "title": "prune_graph", "text": "

    Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.

    "}, {"location": "config/yaml/#fields_12", "title": "Fields", "text": "
    • min_node_freq int - The minimum node frequency to allow.
    • max_node_freq_std float | None - The maximum standard deviation of node frequency to allow.
    • min_node_degree int - The minimum node degree to allow.
    • max_node_degree_std float | None - The maximum standard deviation of node degree to allow.
    • min_edge_weight_pct float - The minimum edge weight percentile to allow.
    • remove_ego_nodes bool - Remove ego nodes.
    • lcc_only bool - Only use largest connected component.
    "}, {"location": "config/yaml/#cluster_graph", "title": "cluster_graph", "text": "

    These are the settings used for Leiden hierarchical clustering of the graph to create communities.

    "}, {"location": "config/yaml/#fields_13", "title": "Fields", "text": "
    • max_cluster_size int - The maximum cluster size to export.
    • use_lcc bool - Whether to only use the largest connected component.
    • seed int - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
    "}, {"location": "config/yaml/#extract_claims", "title": "extract_claims", "text": ""}, {"location": "config/yaml/#fields_14", "title": "Fields", "text": "
    • enabled bool - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
    • model_id str - Name of the model definition to use for API calls.
    • prompt str - The prompt file to use.
    • description str - Describes the types of claims we want to extract.
    • max_gleanings int - The maximum number of gleaning cycles to use.
    "}, {"location": "config/yaml/#community_reports", "title": "community_reports", "text": ""}, {"location": "config/yaml/#fields_15", "title": "Fields", "text": "
    • model_id str - Name of the model definition to use for API calls.
    • prompt str - The prompt file to use.
    • max_length int - The maximum number of output tokens per report.
    • max_input_length int - The maximum number of input tokens to use when generating reports.
    "}, {"location": "config/yaml/#embed_graph", "title": "embed_graph", "text": "

    We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default.

    "}, {"location": "config/yaml/#fields_16", "title": "Fields", "text": "
    • enabled bool - Whether to enable graph embeddings.
    • dimensions int - Number of vector dimensions to produce.
    • num_walks int - The node2vec number of walks.
    • walk_length int - The node2vec walk length.
    • window_size int - The node2vec window size.
    • iterations int - The node2vec number of iterations.
    • random_seed int - The node2vec random seed.
    • strategy dict - Fully override the embed graph strategy.
    "}, {"location": "config/yaml/#umap", "title": "umap", "text": "

    Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you must enable graph embedding as well.

    "}, {"location": "config/yaml/#fields_17", "title": "Fields", "text": "
    • enabled bool - Whether to enable UMAP layouts.
    "}, {"location": "config/yaml/#snapshots", "title": "snapshots", "text": ""}, {"location": "config/yaml/#fields_18", "title": "Fields", "text": "
    • embeddings bool - Export embeddings snapshots to parquet.
    • graphml bool - Export graph snapshots to GraphML.
    "}, {"location": "config/yaml/#query", "title": "Query", "text": ""}, {"location": "config/yaml/#local_search", "title": "local_search", "text": ""}, {"location": "config/yaml/#fields_19", "title": "Fields", "text": "
    • chat_model_id str - Name of the model definition to use for Chat Completion calls.
    • embedding_model_id str - Name of the model definition to use for Embedding calls.
    • prompt str - The prompt file to use.
    • text_unit_prop float - The text unit proportion.
    • community_prop float - The community proportion.
    • conversation_history_max_turns int - The conversation history maximum turns.
    • top_k_entities int - The top k mapped entities.
    • top_k_relationships int - The top k mapped relations.
    • max_context_tokens int - The maximum tokens to use building the request context.
    "}, {"location": "config/yaml/#global_search", "title": "global_search", "text": ""}, {"location": "config/yaml/#fields_20", "title": "Fields", "text": "
    • chat_model_id str - Name of the model definition to use for Chat Completion calls.
    • map_prompt str - The mapper prompt file to use.
    • reduce_prompt str - The reducer prompt file to use.
    • knowledge_prompt str - The knowledge prompt file to use.
    • map_prompt str | None - The global search mapper prompt to use.
    • reduce_prompt str | None - The global search reducer to use.
    • knowledge_prompt str | None - The global search general prompt to use.
    • max_context_tokens int - The maximum context size to create, in tokens.
    • data_max_tokens int - The maximum tokens to use constructing the final response from the reduces responses.
    • map_max_length int - The maximum length to request for map responses, in words.
    • reduce_max_length int - The maximum length to request for reduce responses, in words.
    • dynamic_search_threshold int - Rating threshold in include a community report.
    • dynamic_search_keep_parent bool - Keep parent community if any of the child communities are relevant.
    • dynamic_search_num_repeats int - Number of times to rate the same community report.
    • dynamic_search_use_summary bool - Use community summary instead of full_context.
    • dynamic_search_max_level int - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
    "}, {"location": "config/yaml/#drift_search", "title": "drift_search", "text": ""}, {"location": "config/yaml/#fields_21", "title": "Fields", "text": "
    • chat_model_id str - Name of the model definition to use for Chat Completion calls.
    • embedding_model_id str - Name of the model definition to use for Embedding calls.
    • prompt str - The prompt file to use.
    • reduce_prompt str - The reducer prompt file to use.
    • data_max_tokens int - The data llm maximum tokens.
    • reduce_max_tokens int - The maximum tokens for the reduce phase. Only use if a non-o-series model.
    • reduce_max_completion_tokens int - The maximum tokens for the reduce phase. Only use for o-series models.
    • concurrency int - The number of concurrent requests.
    • drift_k_followups int - The number of top global results to retrieve.
    • primer_folds int - The number of folds for search priming.
    • primer_llm_max_tokens int - The maximum number of tokens for the LLM in primer.
    • n_depth int - The number of drift search steps to take.
    • local_search_text_unit_prop float - The proportion of search dedicated to text units.
    • local_search_community_prop float - The proportion of search dedicated to community properties.
    • local_search_top_k_mapped_entities int - The number of top K entities to map during local search.
    • local_search_top_k_relationships int - The number of top K relationships to map during local search.
    • local_search_max_data_tokens int - The maximum context size in tokens for local search.
    • local_search_temperature float - The temperature to use for token generation in local search.
    • local_search_top_p float - The top-p value to use for token generation in local search.
    • local_search_n int - The number of completions to generate in local search.
    • local_search_llm_max_gen_tokens int - The maximum number of generated tokens for the LLM in local search. Only use if a non-o-series model.
    • local_search_llm_max_gen_completion_tokens int - The maximum number of generated tokens for the LLM in local search. Only use for o-series models.
    "}, {"location": "config/yaml/#basic_search", "title": "basic_search", "text": ""}, {"location": "config/yaml/#fields_22", "title": "Fields", "text": "
    • chat_model_id str - Name of the model definition to use for Chat Completion calls.
    • embedding_model_id str - Name of the model definition to use for Embedding calls.
    • prompt str - The prompt file to use.
    • k int | None - Number of text units to retrieve from the vector store for context building.
    "}, {"location": "data/operation_dulce/ABOUT/", "title": "About", "text": "

    This document (Operation Dulce) is an AI-generated science fiction novella, included here for the purposes of integration testing.

    "}, {"location": "index/byog/", "title": "Bring Your Own Graph", "text": "

    Several users have asked if they can bring their own existing graph and have it summarized for query with GraphRAG. There are many possible ways to do this, but here we'll describe a simple method that aligns with the existing GraphRAG workflows quite easily.

    To cover the basic use cases for GraphRAG query, you should have two or three tables derived from your data:

    • entities.parquet - this is the list of entities found in the dataset, which are the nodes of the graph.
    • relationships.parquet - this is the list of relationships found in the dataset, which are the edges of the graph.
    • text_units.parquet - this is the source text chunks the graph was extracted from. This is optional depending on the query method you intend to use (described later).

    The approach described here will be to run a custom GraphRAG workflow pipeline that assumes the text chunking, entity extraction, and relationship extraction has already occurred.

    "}, {"location": "index/byog/#tables", "title": "Tables", "text": ""}, {"location": "index/byog/#entities", "title": "Entities", "text": "

    See the full entities table schema. For graph summarization purposes, you only need id, title, description, and the list of text_unit_ids.

    The additional properties are used for optional graph visualization purposes.

    "}, {"location": "index/byog/#relationships", "title": "Relationships", "text": "

    See the full relationships table schema. For graph summarization purposes, you only need id, source, target, description, weight, and the list of text_unit_ids.

    Note: the weight field is important because it is used to properly compute Leiden communities!

    "}, {"location": "index/byog/#workflow-configuration", "title": "Workflow Configuration", "text": "

    GraphRAG includes the ability to specify only the specific workflow steps that you need. For basic graph summarization and query, you need the following config in your settings.yaml:

    workflows: [create_communities, create_community_reports]\n

    This will result in only the minimal workflows required for GraphRAG Global Search.

    "}, {"location": "index/byog/#optional-additional-config", "title": "Optional Additional Config", "text": "

    If you would like to run Local, DRIFT, or Basic Search, you will need to include text_units and some embeddings.

    "}, {"location": "index/byog/#text-units", "title": "Text Units", "text": "

    See the full text_units table schema. Text units are chunks of your documents that are sized to ensure they fit into the context window of your model. Some search methods use these, so you may want to include them if you have them.

    "}, {"location": "index/byog/#expanded-config", "title": "Expanded Config", "text": "

    To perform the other search types above, you need some of the content to be embedded. Simply add the embeddings workflow:

    workflows: [create_communities, create_community_reports, generate_text_embeddings]\n
    "}, {"location": "index/byog/#fastgraphrag", "title": "FastGraphRAG", "text": "

    FastGraphRAG uses text_units for the community reports instead of the entity and relationship descriptions. If your graph is sourced in such a way that it does not have descriptions, this might be a useful alternative. In this case, you would update your workflows list to include the text variant of the community reports workflow:

    workflows: [create_communities, create_community_reports_text, generate_text_embeddings]\n

    This method requires that your entities and relationships tables have valid links to a list of text_unit_ids. Also note that generate_text_embeddings is still only required if you are doing searches other than Global Search.

    "}, {"location": "index/byog/#setup", "title": "Setup", "text": "

    Putting it all together:

    • output: Create an output folder and put your entities and relationships (and optionally text_units) parquet files in it.
    • Update your config as noted above to only run the workflows subset you need.
    • Run graphrag index --root <your project root>
    "}, {"location": "index/default_dataflow/", "title": "Indexing Dataflow", "text": ""}, {"location": "index/default_dataflow/#the-graphrag-knowledge-model", "title": "The GraphRAG Knowledge Model", "text": "

    The knowledge model is a specification for data outputs that conform to our data-model definition. You can find these definitions in the python/graphrag/graphrag/model folder within the GraphRAG repository. The following entity types are provided. The fields here represent the fields that are text-embedded by default.

    • Document - An input document into the system. These either represent individual rows in a CSV or individual .txt file.
    • TextUnit - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set CHUNK_BY_COLUMNS to id so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.
    • Entity - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
    • Relationship - A relationship between two entities.
    • Covariate - Extracted claim information, which contains statements about entities which may be time-bound.
    • Community - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
    • Community Report - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
    "}, {"location": "index/default_dataflow/#the-default-configuration-workflow", "title": "The Default Configuration Workflow", "text": "

    Let's take a look at how the default-configuration workflow transforms text documents into the GraphRAG Knowledge Model. This page gives a general overview of the major steps in this process. To fully configure this workflow, check out the configuration documentation.

    ---\ntitle: Dataflow Overview\n---\nflowchart TB\n    subgraph phase1[Phase 1: Compose TextUnits]\n    documents[Documents] --> chunk[Chunk]\n    chunk --> textUnits[Text Units]\n    end\n    subgraph phase2[Phase 2: Graph Extraction]\n    textUnits --> graph_extract[Entity & Relationship Extraction]\n    graph_extract --> graph_summarize[Entity & Relationship Summarization]\n    graph_summarize --> claim_extraction[Claim Extraction]\n    claim_extraction --> graph_outputs[Graph Tables]\n    end\n    subgraph phase3[Phase 3: Graph Augmentation]\n    graph_outputs --> community_detect[Community Detection]\n    community_detect --> community_outputs[Communities Table]\n    end\n    subgraph phase4[Phase 4: Community Summarization]\n    community_outputs --> summarized_communities[Community Summarization]\n    summarized_communities --> community_report_outputs[Community Reports Table]\n    end\n    subgraph phase5[Phase 5: Document Processing]\n    documents --> link_to_text_units[Link to TextUnits]\n    textUnits --> link_to_text_units\n    link_to_text_units --> document_outputs[Documents Table]\n    end\n    subgraph phase6[Phase 6: Network Visualization]\n    graph_outputs --> graph_embed[Graph Embedding]\n    graph_embed --> umap_entities[Umap Entities]\n    umap_entities --> combine_nodes[Final Entities]\n    end\n    subgraph phase7[Phase 7: Text Embeddings]\n    textUnits --> text_embed[Text Embedding]\n    graph_outputs --> description_embed[Description Embedding]\n    community_report_outputs --> content_embed[Content Embedding]\n    end
    "}, {"location": "index/default_dataflow/#phase-1-compose-textunits", "title": "Phase 1: Compose TextUnits", "text": "

    The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.

    The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single \"glean\" step. (A \"glean\" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.

    The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)

    ---\ntitle: Documents into Text Chunks\n---\nflowchart LR\n    doc1[Document 1] --> tu1[TextUnit 1]\n    doc1 --> tu2[TextUnit 2]\n    doc2[Document 2] --> tu3[TextUnit 3]\n    doc2 --> tu4[TextUnit 4]\n
    "}, {"location": "index/default_dataflow/#phase-2-graph-extraction", "title": "Phase 2: Graph Extraction", "text": "

    In this phase, we analyze each text unit and extract our graph primitives: Entities, Relationships, and Claims. Entities and Relationships are extracted at once in our entity_extract verb, and claims are extracted in our claim_extract verb. Results are then combined and passed into following phases of the pipeline.

    ---\ntitle: Graph Extraction\n---\nflowchart LR\n    tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]\n    tu --> ce[Claim Extraction]
    "}, {"location": "index/default_dataflow/#entity-relationship-extraction", "title": "Entity & Relationship Extraction", "text": "

    In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a title, type, and description, and a list of relationships with a source, target, and description.

    These subgraphs are merged together - any entities with the same title and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.

    "}, {"location": "index/default_dataflow/#entity-relationship-summarization", "title": "Entity & Relationship Summarization", "text": "

    Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.

    "}, {"location": "index/default_dataflow/#claim-extraction-optional", "title": "Claim Extraction (optional)", "text": "

    Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called Covariates.

    Note: claim extraction is optional and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.

    "}, {"location": "index/default_dataflow/#phase-3-graph-augmentation", "title": "Phase 3: Graph Augmentation", "text": "

    Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the topological structure of our graph.

    ---\ntitle: Graph Augmentation\n---\nflowchart LR\n    cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]
    "}, {"location": "index/default_dataflow/#community-detection", "title": "Community Detection", "text": "

    In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.

    "}, {"location": "index/default_dataflow/#graph-tables", "title": "Graph Tables", "text": "

    Once our graph augmentation steps are complete, the final Entities, Relationships, and Communities tables are exported.

    "}, {"location": "index/default_dataflow/#phase-4-community-summarization", "title": "Phase 4: Community Summarization", "text": "
    ---\ntitle: Community Summarization\n---\nflowchart LR\n    sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]

    At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.

    Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.

    "}, {"location": "index/default_dataflow/#generate-community-reports", "title": "Generate Community Reports", "text": "

    In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.

    "}, {"location": "index/default_dataflow/#summarize-community-reports", "title": "Summarize Community Reports", "text": "

    In this step, each community report is then summarized via the LLM for shorthand use.

    "}, {"location": "index/default_dataflow/#community-reports-table", "title": "Community Reports Table", "text": "

    At this point, some bookkeeping work is performed and we export the Community Reports tables.

    "}, {"location": "index/default_dataflow/#phase-5-document-processing", "title": "Phase 5: Document Processing", "text": "

    In this phase of the workflow, we create the Documents table for the knowledge model.

    ---\ntitle: Document Processing\n---\nflowchart LR\n    aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]
    "}, {"location": "index/default_dataflow/#augment-with-columns-csv-only", "title": "Augment with Columns (CSV Only)", "text": "

    If the workflow is operating on CSV data, you may configure your workflow to add additional fields to Documents output. These fields should exist on the incoming CSV tables. Details about configuring this can be found in the configuration documentation.

    "}, {"location": "index/default_dataflow/#link-to-textunits", "title": "Link to TextUnits", "text": "

    In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.

    "}, {"location": "index/default_dataflow/#documents-table", "title": "Documents Table", "text": "

    At this point, we can export the Documents table into the knowledge Model.

    "}, {"location": "index/default_dataflow/#phase-6-network-visualization-optional", "title": "Phase 6: Network Visualization (optional)", "text": "

    In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the Entity-Relationship graph and the Document graph.

    ---\ntitle: Network Visualization Workflows\n---\nflowchart LR\n    ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Entities Table]
    "}, {"location": "index/default_dataflow/#graph-embedding", "title": "Graph Embedding", "text": "

    In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.

    "}, {"location": "index/default_dataflow/#dimensionality-reduction", "title": "Dimensionality Reduction", "text": "

    For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are reduced to two dimensions as x/y coordinates.

    "}, {"location": "index/default_dataflow/#phase-7-text-embedding", "title": "Phase 7: Text Embedding", "text": "

    For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.

    ---\ntitle: Text Embedding Workflows\n---\nflowchart LR\n    textUnits[Text Units] --> text_embed[Text Embedding]\n    graph_outputs[Graph Tables] --> description_embed[Description Embedding]\n    community_report_outputs[Community Reports] --> content_embed[Content Embedding]
    "}, {"location": "index/inputs/", "title": "Inputs", "text": "

    GraphRAG supports several input formats to simplify ingesting your data. The mechanics and features available for input files and text chunking are discussed here.

    "}, {"location": "index/inputs/#input-loading-and-schema", "title": "Input Loading and Schema", "text": "

    All input formats are loaded within GraphRAG and passed to the indexing pipeline as a documents DataFrame. This DataFrame has a row for each document using a shared column schema:

    name type description id str ID of the document. This is generated using a hash of the text content to ensure stability across runs. text str The full text of the document. title str Name of the document. Some formats allow this to be configured. creation_date str The creation date of the document, represented as an ISO8601 string. This is harvested from the source file system. metadata dict Optional additional document metadata. More details below.

    Also see the outputs documentation for the final documents table schema saved to parquet after pipeline completion.

    "}, {"location": "index/inputs/#formats", "title": "Formats", "text": "

    We support three file formats out-of-the-box. This covers the overwhelming majority of use cases we have encountered. If you have a different format, we recommend writing a script to convert to one of these, which are widely used and supported by many tools and libraries.

    "}, {"location": "index/inputs/#plain-text", "title": "Plain Text", "text": "

    Plain text files (typically ending in .txt file extension). With plain text files we import the entire file contents as the text field, and the title is always the filename.

    "}, {"location": "index/inputs/#comma-delimited", "title": "Comma-delimited", "text": "

    CSV files (typically ending in a .csv extension). These are loaded using pandas' read_csv method with default options. Each row in a CSV file is treated as a single document. If you have multiple CSV files in your input folder, they will be concatenated into a single resulting documents DataFrame.

    With the CSV format you can configure the text_column, and title_column if your data has structured content you would prefer to use. If you do not configure these within the input block of your settings.yaml, the title will be the filename as described in the schema above. The text_column is assumed to be \"text\" in your file if not configured specifically. We will also look for and use an \"id\" column if present, otherwise the ID will be generated as described above.

    "}, {"location": "index/inputs/#json", "title": "JSON", "text": "

    JSON files (typically ending in a .json extension) contain structured objects. These are loaded using python's json.loads method, so your files must be properly compliant. JSON files may contain a single object in the file or the file may contain an array of objects at the root. We will check for and handle either of these cases. As with CSV, multiple files will be concatenated into a final table, and the text_column and title_column config options will be applied to the properties of each loaded object. Note that the specialized jsonl format produced by some libraries (one full JSON object on each line, not in an array) is not currently supported.

    "}, {"location": "index/inputs/#metadata", "title": "Metadata", "text": "

    With the structured file formats (CSV and JSON) you can configure any number of columns to be added to a persisted metadata field in the DataFrame. This is configured by supplying a list of columns name to collect. If this is configured, the output metadata column will have a dict containing a key for each column, and the value of the column for that document. This metadata can optionally be used later in the GraphRAG pipeline.

    "}, {"location": "index/inputs/#example", "title": "Example", "text": "

    software.csv

    text,title,tag\nMy first program,Hello World,tutorial\nAn early space shooter game,Space Invaders,arcade\n

    settings.yaml

    input:\n    metadata: [title,tag]\n

    Documents DataFrame

    id title text creation_date metadata (generated from text) Hello World My first program (create date of software.csv) { \"title\": \"Hello World\", \"tag\": \"tutorial\" } (generated from text) Space Invaders An early space shooter game (create date of software.csv) { \"title\": \"Space Invaders\", \"tag\": \"arcade\" }"}, {"location": "index/inputs/#chunking-and-metadata", "title": "Chunking and Metadata", "text": "

    As described on the default dataflow page, documents are chunked into smaller \"text units\" for processing. This is done because document content size often exceeds the available context window for a given language model. There are a handful of settings you can adjust for this chunking, the most relevant being the chunk_size and overlap. We now also support a metadata processing scheme that can improve indexing results for some use cases. We will describe this feature in detail here.

    Imagine the following scenario: you are indexing a collection of news articles. Each article text starts with a headline and author, and then proceeds with the content. When documents are chunked, they are split evenly according to your configured chunk size. In other words, the first n tokens are read into a text unit, and then the next n, until the end of the content. This means that front matter at the beginning of the document (such as the headline and author in this example) is not copied to each chunk. It only exists in the first chunk. When we later retrieve those chunks for summarization, they may therefore be missing shared information about the source document that should always be provided to the model. We have configuration options to copy repeated content into each text unit to address this issue.

    "}, {"location": "index/inputs/#input-config", "title": "Input Config", "text": "

    As described above, when documents are imported you can specify a list of metadata columns to include with each row. This must be configured for the per-chunk copying to work.

    "}, {"location": "index/inputs/#chunking-config", "title": "Chunking Config", "text": "

    Next, the chunks block needs to instruct the chunker how to handle this metadata when creating text units. By default, it is ignored. We have two settings to include it:

    • prepend_metadata. This instructs the importer to copy the contents of the metadata column for each row into the start of every single text chunk. This metadata is copied as key: value pairs on new lines.
    • chunk_size_includes_metadata: This tells the chunker how to compute the chunk size when metadata is included. By default, we create the text units using your specified chunk_size and then prepend the metadata. This means that the final text unit lengths may be longer than your configured chunk_size, and it will vary based on the length of the metadata for each document. When this setting is True, we will compute the raw text using the remainder after measuring the metadata length so that the resulting text units always comply with your configured chunk_size.
    "}, {"location": "index/inputs/#examples", "title": "Examples", "text": "

    The following are several examples to help illustrate how chunking config and metadate prepending works for each file format. Note that we are using word count here as \"tokens\" for the illustration, but language model tokens are not equivalent to words.

    "}, {"location": "index/inputs/#text-files", "title": "Text files", "text": "

    This example uses two individual news article text files.

    --

    File: US to lift most federal COVID-19 vaccine mandates.txt

    Content:

    WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday.

    --

    File: NY lawmakers begin debating budget 1 month after due date.txt

    Content:

    ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.

    --

    settings.yaml

    input:\n    file_type: text\n    metadata: [title]\n\nchunks:\n    size: 100\n    overlap: 0\n    prepend_metadata: true\n    chunk_size_includes_metadata: false\n

    Documents DataFrame

    id title text creation_date metadata (generated from text) US to lift most federal COVID-19 vaccine mandates.txt (full content of text file) (create date of article txt file) { \"title\": \"US to lift most federal COVID-19 vaccine mandates.txt\" } (generated from text) NY lawmakers begin debating budget 1 month after due date.txt (full content of text file) (create date of article txt file) { \"title\": \"NY lawmakers begin debating budget 1 month after due date.txt\" }

    Raw Text Chunks

    content length title: US to lift most federal COVID-19 vaccine mandates.txtWASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as 109 title: US to lift most federal COVID-19 vaccine mandates.txtthe deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. 82 title: NY lawmakers begin debating budget 1 month after due date.txtALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to 111 title: NY lawmakers begin debating budget 1 month after due date.txtbe wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it 111 title: NY lawmakers begin debating budget 1 month after due date.txtwould undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate. 89

    In this example we can see that the two input documents were parsed into five output text chunks. The title (filename) of each document is prepended but not included in the computed chunk size. Also note that the final text chunk for each document is usually smaller than the chunk size because it contains the last tokens.

    "}, {"location": "index/inputs/#csv-files", "title": "CSV files", "text": "

    This example uses a single CSV file with the same two articles as rows (note that the text content is not properly escaped for actual CSV use).

    --

    File: articles.csv

    Content

    headline,article

    US to lift most federal COVID-19 vaccine mandates,WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday.

    NY lawmakers begin debating budget 1 month after due date,ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.

    --

    settings.yaml

    input:\n    file_type: csv\n    title_column: headline\n    text_column: article\n    metadata: [headline]\n\nchunks:\n    size: 50\n    overlap: 5\n    prepend_metadata: true\n    chunk_size_includes_metadata: true\n

    Documents DataFrame

    id title text creation_date metadata (generated from text) US to lift most federal COVID-19 vaccine mandates (article column content) (create date of articles.csv) { \"headline\": \"US to lift most federal COVID-19 vaccine mandates\" } (generated from text) NY lawmakers begin debating budget 1 month after due date (article column content) (create date of articles.csv) { \"headline\": \"NY lawmakers begin debating budget 1 month after due date\" }

    Raw Text Chunks

    content length title: US to lift most federal COVID-19 vaccine mandatesWASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, 50 title: US to lift most federal COVID-19 vaccine mandatesfederal workers and federal contractors as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. 50 title: US to lift most federal COVID-19 vaccine mandatesnoncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how 50 title: US to lift most federal COVID-19 vaccine mandatesthe latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that 50 title: US to lift most federal COVID-19 vaccine mandatespoint where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. 38 title: NY lawmakers begin debating budget 1 month after due dateALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new 50 title: NY lawmakers begin debating budget 1 month after due datestoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and 50 title: NY lawmakers begin debating budget 1 month after due dateto the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget 50 title: NY lawmakers begin debating budget 1 month after due dateup Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been 50 title: NY lawmakers begin debating budget 1 month after due datevoting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges 50 title: NY lawmakers begin debating budget 1 month after due datethe standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 50 title: NY lawmakers begin debating budget 1 month after due datebail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum 50 title: NY lawmakers begin debating budget 1 month after due dateto state officials. The minimum wage would be raised to $17 in be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 50 title: NY lawmakers begin debating budget 1 month after due date2026. That's up from $15 in the city and $14.20 upstate. 22

    In this example we can see that the two input documents were parsed into fourteen output text chunks. The title (headline) of each document is prepended and included in the computed chunk size, so each chunk matches the configured chunk size (except the last one for each document). We've also configured some overlap in these text chunks, so the last five tokens are shared. Why would you use overlap in your text chunks? Consider that when you are splitting documents based on tokens, it is highly likely that sentences or even related concepts will be split into separate chunks. Each text chunk is processed separately by the language model, so this may result in incomplete \"ideas\" at the boundaries of the chunk. Overlap ensures that these split concepts are fully contained in at least one of the chunks.

    "}, {"location": "index/inputs/#json-files", "title": "JSON files", "text": "

    This final example uses a JSON file for each of the same two articles. In this example we'll set the object fields to read, but we will not add metadata to the text chunks.

    --

    File: article1.json

    Content

    {\n    \"headline\": \"US to lift most federal COVID-19 vaccine mandates\",\n    \"content\": \"WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday.\"\n}\n

    File: article2.json

    Content

    {\n    \"headline\": \"NY lawmakers begin debating budget 1 month after due date\",\n    \"content\": \"ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.\"\n}\n

    --

    settings.yaml

    input:\n    file_type: json\n    title_column: headline\n    text_column: content\n\nchunks:\n    size: 100\n    overlap: 10\n

    Documents DataFrame

    id title text creation_date metadata (generated from text) US to lift most federal COVID-19 vaccine mandates (article column content) (create date of article1.json) { } (generated from text) NY lawmakers begin debating budget 1 month after due date (article column content) (create date of article2.json) { }

    Raw Text Chunks

    content length WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as 100 measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. 83 ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to 100 Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges 100 means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate. 98

    In this example the two input documents were parsed into five output text chunks. There is no metadata prepended, so each chunk matches the configured chunk size (except the last one for each document). We've also configured some overlap in these text chunks, so the last ten tokens are shared.

    "}, {"location": "index/methods/", "title": "Indexing Methods", "text": "

    GraphRAG is a platform for our research into RAG indexing methods that produce optimal context window content for language models. We have a standard indexing pipeline that uses a language model to extract the graph that our memory model is based upon. We may introduce additional indexing methods from time to time. This page documents those options.

    "}, {"location": "index/methods/#standard-graphrag", "title": "Standard GraphRAG", "text": "

    This is the method described in the original blog post. Standard uses a language model for all reasoning tasks:

    • entity extraction: LLM is prompted to extract named entities and provide a description from each text unit.
    • relationship extraction: LLM is prompted to describe the relationship between each pair of entities in each text unit.
    • entity summarization: LLM is prompted to combine the descriptions for every instance of an entity found across the text units into a single summary.
    • relationship summarization: LLM is prompted to combine the descriptions for every instance of a relationship found across the text units into a single summary.
    • claim extraction (optional): LLM is prompted to extract and describe claims from each text unit.
    • community report generation: entity and relationship descriptions (and optionally claims) for each community are collected and used to prompt the LLM to generate a summary report.

    graphrag index --method standard. This is the default method, so the method param can actual be omitted.

    "}, {"location": "index/methods/#fastgraphrag", "title": "FastGraphRAG", "text": "

    FastGraphRAG is a method that substitutes some of the language model reasoning for traditional natural language processing (NLP) methods. This is a hybrid technique that we developed as a faster and cheaper indexing alternative:

    • entity extraction: entities are noun phrases extracted using NLP libraries such as NLTK and spaCy. There is no description; the source text unit is used for this.
    • relationship extraction: relationships are defined as text unit co-occurrence between entity pairs. There is no description.
    • entity summarization: not necessary.
    • relationship summarization: not necessary.
    • claim extraction (optional): unused.
    • community report generation: The direct text unit content containing each entity noun phrase is collected and used to prompt the LLM to generate a summary report.

    graphrag index --method fast

    FastGraphRAG has a handful of NLP options built in. By default we use NLTK + regular expressions for the noun phrase extraction, which is very fast but primarily suitable for English. We have built in two additional methods using spaCy: semantic parsing and CFG. We use the en_core_web_md model by default for spaCy, but note that you can reference any supported model that you have installed.

    Note that we also generally configure the text chunking to produce much smaller chunks (50-100 tokens). This results in a better co-occurrence graph.

    \u26a0\ufe0f Note on SpaCy models:

    This package requires SpaCy models to function correctly. If the required model is not installed, the package will automatically download and install it the first time it is used.

    You can install it manually by running python -m spacy download <model_name>, for example python -m spacy download en_core_web_md.

    "}, {"location": "index/methods/#choosing-a-method", "title": "Choosing a Method", "text": "

    Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG provides high quality summarization at much less LLM cost.

    "}, {"location": "index/outputs/", "title": "Outputs", "text": "

    The default pipeline produces a series of output tables that align with the conceptual knowledge model. This page describes the detailed output table schemas. By default we write these tables out as parquet files on disk.

    "}, {"location": "index/outputs/#shared-fields", "title": "Shared fields", "text": "

    All tables have two identifier fields:

    name type description id str Generated UUID, assuring global uniqueness human_readable_id int This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually."}, {"location": "index/outputs/#communities", "title": "communities", "text": "

    This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.

    name type description community int Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment. parent int Parent community ID. children int[] List of child community IDs. level int Depth of the community in the hierarchy. title str Friendly name of the community. entity_ids str[] List of entities that are members of the community. relationship_ids str[] List of relationships that are wholly within the community (source and target are both in the community). text_unit_ids str[] List of text units represented within the community. period str Date of ingest, used for incremental update merges. ISO8601 size int Size of the community (entity count), used for incremental update merges."}, {"location": "index/outputs/#community_reports", "title": "community_reports", "text": "

    This is the list of summarized reports for each community.

    name type description community int Short ID of the community this report applies to. parent int Parent community ID. children int[] List of child community IDs. level int Level of the community this report applies to. title str LM-generated title for the report. summary str LM-generated summary of the report. full_content str LM-generated full report. rank float LM-derived relevance ranking of the report based on member entity salience rating_explanation str LM-derived explanation of the rank. findings dict LM-derived list of the top 5-10 insights from the community. Contains summary and explanation values. full_content_json json Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users. period str Date of ingest, used for incremental update merges. ISO8601 size int Size of the community (entity count), used for incremental update merges."}, {"location": "index/outputs/#covariates", "title": "covariates", "text": "

    (Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.

    name type description covariate_type str This is always \"claim\" with our default covariates. type str Nature of the claim type. description str LM-generated description of the behavior. subject_id str Name of the source entity (that is performing the claimed behavior). object_id str Name of the target entity (that the claimed behavior is performed on). status str LM-derived assessment of the correctness of the claim. One of [TRUE, FALSE, SUSPECTED] start_date str LM-derived start of the claimed activity. ISO8601 end_date str LM-derived end of the claimed activity. ISO8601 source_text str Short string of text containing the claimed behavior. text_unit_id str ID of the text unit the claim text was extracted from."}, {"location": "index/outputs/#documents", "title": "documents", "text": "

    List of document content after import.

    name type description title str Filename, unless otherwise configured during CSV import. text str Full text of the document. text_unit_ids str[] List of text units (chunks) that were parsed from the document. metadata dict If specified during CSV import, this is a dict of metadata for the document."}, {"location": "index/outputs/#entities", "title": "entities", "text": "

    List of all entities found in the data by the LM.

    name type description title str Name of the entity. type str Type of the entity. By default this will be \"organization\", \"person\", \"geo\", or \"event\" unless configured differently or auto-tuning is used. description str Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions. text_unit_ids str[] List of the text units containing the entity. frequency int Count of text units the entity was found within. degree int Node degree (connectedness) in the graph. x float X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. y float Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0."}, {"location": "index/outputs/#relationships", "title": "relationships", "text": "

    List of all entity-to-entity relationships found in the data by the LM. This is also the edge list for the graph.

    name type description source str Name of the source entity. target str Name of the target entity. description str LM-derived description of the relationship. Also see note for entity descriptions. weight float Weight of the edge in the graph. This is summed from an LM-derived \"strength\" measure for each relationship instance. combined_degree int Sum of source and target node degrees. text_unit_ids str[] List of text units the relationship was found within."}, {"location": "index/outputs/#text_units", "title": "text_units", "text": "

    List of all text chunks parsed from the input documents.

    name type description text str Raw full text of the chunk. n_tokens int Number of tokens in the chunk. This should normally match the chunk_size config parameter, except for the last chunk which is often shorter. document_ids str[] List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents. entity_ids str[] List of entities found in the text unit. relationships_ids str[] List of relationships found in the text unit. covariate_ids str[] Optional list of covariates found in the text unit."}, {"location": "index/overview/", "title": "GraphRAG Indexing \ud83e\udd16", "text": "

    The GraphRAG indexing package is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using LLMs.

    Indexing Pipelines are configurable. They are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. Our standard pipeline is designed to:

    • extract entities, relationships and claims from raw text
    • perform community detection in entities
    • generate community summaries and reports at multiple levels of granularity
    • embed entities into a graph vector space
    • embed text chunks into a textual vector space

    The outputs of the pipeline are stored as Parquet tables by default, and embeddings are written to your configured vector store.

    "}, {"location": "index/overview/#getting-started", "title": "Getting Started", "text": ""}, {"location": "index/overview/#requirements", "title": "Requirements", "text": "

    See the requirements section in Get Started for details on setting up a development environment.

    To configure GraphRAG, see the configuration documentation. After you have a config file you can run the pipeline using the CLI or the Python API.

    "}, {"location": "index/overview/#usage", "title": "Usage", "text": ""}, {"location": "index/overview/#cli", "title": "CLI", "text": "
    # Via Poetry\npoetry run poe index --root <data_root> # default config mode\n
    "}, {"location": "index/overview/#python-api", "title": "Python API", "text": "

    Please see the indexing API python file for the recommended method to call directly from Python code.

    "}, {"location": "index/overview/#further-reading", "title": "Further Reading", "text": "
    • To start developing within the GraphRAG project, see getting started
    • To understand the underlying concepts and execution model of the indexing library, see the architecture documentation
    • To read more about configuring the indexing engine, see the configuration documentation
    "}, {"location": "prompt_tuning/auto_prompt_tuning/", "title": "Auto Prompt Tuning \u2699\ufe0f", "text": "

    GraphRAG provides the ability to create domain adapted prompts for the generation of the knowledge graph. This step is optional, though it is highly encouraged to run it as it will yield better results when executing an Index Run.

    These are generated by loading the inputs, splitting them into chunks (text units) and then running a series of LLM invocations and template substitutions to generate the final prompts. We suggest using the default values provided by the script, but in this page you'll find the detail of each in case you want to further explore and tweak the prompt tuning algorithm.

    Figure 1: Auto Tuning Conceptual Diagram.

    "}, {"location": "prompt_tuning/auto_prompt_tuning/#prerequisites", "title": "Prerequisites", "text": "

    Before running auto tuning, ensure you have already initialized your workspace with the graphrag init command. This will create the necessary configuration files and the default prompts. Refer to the Init Documentation for more information about the initialization process.

    "}, {"location": "prompt_tuning/auto_prompt_tuning/#usage", "title": "Usage", "text": "

    You can run the main script from the command line with various options:

    graphrag prompt-tune [--root ROOT] [--config CONFIG] [--domain DOMAIN]  [--selection-method METHOD] [--limit LIMIT] [--language LANGUAGE] \\\n[--max-tokens MAX_TOKENS] [--chunk-size CHUNK_SIZE] [--n-subset-max N_SUBSET_MAX] [--k K] \\\n[--min-examples-required MIN_EXAMPLES_REQUIRED] [--discover-entity-types] [--output OUTPUT]\n
    "}, {"location": "prompt_tuning/auto_prompt_tuning/#command-line-options", "title": "Command-Line Options", "text": "
    • --config (required): The path to the configuration file. This is required to load the data and model settings.

    • --root (optional): The data project root directory, including the config files (YML, JSON, or .env). Defaults to the current directory.

    • --domain (optional): The domain related to your input data, such as 'space science', 'microbiology', or 'environmental news'. If left empty, the domain will be inferred from the input data.

    • --selection-method (optional): The method to select documents. Options are all, random, auto or top. Default is random.

    • --limit (optional): The limit of text units to load when using random or top selection. Default is 15.

    • --language (optional): The language to use for input processing. If it is different from the inputs' language, the LLM will translate. Default is \"\" meaning it will be automatically detected from the inputs.

    • --max-tokens (optional): Maximum token count for prompt generation. Default is 2000.

    • --chunk-size (optional): The size in tokens to use for generating text units from input documents. Default is 200.

    • --n-subset-max (optional): The number of text chunks to embed when using auto selection method. Default is 300.

    • --k (optional): The number of documents to select when using auto selection method. Default is 15.

    • --min-examples-required (optional): The minimum number of examples required for entity extraction prompts. Default is 2.

    • --discover-entity-types (optional): Allow the LLM to discover and extract entities automatically. We recommend using this when your data covers a lot of topics or it is highly randomized.

    • --output (optional): The folder to save the generated prompts. Default is \"prompts\".

    "}, {"location": "prompt_tuning/auto_prompt_tuning/#example-usage", "title": "Example Usage", "text": "
    python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --domain \"environmental news\" \\\n--selection-method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \\\n--no-entity-types --output /path/to/output\n

    or, with minimal configuration (suggested):

    python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types\n
    "}, {"location": "prompt_tuning/auto_prompt_tuning/#document-selection-methods", "title": "Document Selection Methods", "text": "

    The auto tuning feature ingests the input data and then divides it into text units the size of the chunk size parameter. After that, it uses one of the following selection methods to pick a sample to work with for prompt generation:

    • random: Select text units randomly. This is the default and recommended option.
    • top: Select the head n text units.
    • all: Use all text units for the generation. Use only with small datasets; this option is not usually recommended.
    • auto: Embed text units in a lower-dimensional space and select the k nearest neighbors to the centroid. This is useful when you have a large dataset and want to select a representative sample.
    "}, {"location": "prompt_tuning/auto_prompt_tuning/#modify-env-vars", "title": "Modify Env Vars", "text": "

    After running auto tuning, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default \"prompts\" path.

    • GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE = \"prompts/entity_extraction.txt\"

    • GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE = \"prompts/community_report.txt\"

    • GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE = \"prompts/summarize_descriptions.txt\"

    or in your yaml config file:

    entity_extraction:\n  prompt: \"prompts/entity_extraction.txt\"\n\nsummarize_descriptions:\n  prompt: \"prompts/summarize_descriptions.txt\"\n\ncommunity_reports:\n  prompt: \"prompts/community_report.txt\"\n
    "}, {"location": "prompt_tuning/manual_prompt_tuning/", "title": "Manual Prompt Tuning \u2699\ufe0f", "text": "

    The GraphRAG indexer, by default, will run with a handful of prompts that are designed to work well in the broad context of knowledge discovery. However, it is quite common to want to tune the prompts to better suit your specific use case. We provide a means for you to do this by allowing you to specify a custom prompt file, which will each use a series of token-replacements internally.

    Each of these prompts may be overridden by writing a custom prompt file in plaintext. We use token-replacements in the form of {token_name}, and the descriptions for the available tokens can be found below.

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#indexing-prompts", "title": "Indexing Prompts", "text": ""}, {"location": "prompt_tuning/manual_prompt_tuning/#entityrelationship-extraction", "title": "Entity/Relationship Extraction", "text": "

    Prompt Source

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens", "title": "Tokens", "text": "
    • {input_text} - The input text to be processed.
    • {entity_types} - A list of entity types
    • {tuple_delimiter} - A delimiter for separating values within a tuple. A single tuple is used to represent an individual entity or relationship.
    • {record_delimiter} - A delimiter for separating tuple instances.
    • {completion_delimiter} - An indicator for when generation is complete.
    "}, {"location": "prompt_tuning/manual_prompt_tuning/#summarize-entityrelationship-descriptions", "title": "Summarize Entity/Relationship Descriptions", "text": "

    Prompt Source

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_1", "title": "Tokens", "text": "
    • {entity_name} - The name of the entity or the source/target pair of the relationship.
    • {description_list} - A list of descriptions for the entity or relationship.
    "}, {"location": "prompt_tuning/manual_prompt_tuning/#claim-extraction", "title": "Claim Extraction", "text": "

    Prompt Source

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_2", "title": "Tokens", "text": "
    • {input_text} - The input text to be processed.
    • {tuple_delimiter} - A delimiter for separating values within a tuple. A single tuple is used to represent an individual entity or relationship.
    • {record_delimiter} - A delimiter for separating tuple instances.
    • {completion_delimiter} - An indicator for when generation is complete.
    • {entity_specs} - A list of entity types.
    • {claim_description} - Description of what claims should look like. Default is: \"Any claims or facts that could be relevant to information discovery.\"

    See the configuration documentation for details on how to change this.

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#generate-community-reports", "title": "Generate Community Reports", "text": "

    Prompt Source

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_3", "title": "Tokens", "text": "
    • {input_text} - The input text to generate the report with. This will contain tables of entities and relationships.
    "}, {"location": "prompt_tuning/manual_prompt_tuning/#query-prompts", "title": "Query Prompts", "text": ""}, {"location": "prompt_tuning/manual_prompt_tuning/#local-search", "title": "Local Search", "text": "

    Prompt Source

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_4", "title": "Tokens", "text": "
    • {response_type} - Describe how the response should look. We default to \"multiple paragraphs\".
    • {context_data} - The data tables from GraphRAG's index.
    "}, {"location": "prompt_tuning/manual_prompt_tuning/#global-search", "title": "Global Search", "text": "

    Mapper Prompt Source

    Reducer Prompt Source

    Knowledge Prompt Source

    Global search uses a map/reduce approach to summarization. You can tune these prompts independently. This search also includes the ability to adjust the use of general knowledge from the model's training.

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_5", "title": "Tokens", "text": "
    • {response_type} - Describe how the response should look (reducer only). We default to \"multiple paragraphs\".
    • {context_data} - The data tables from GraphRAG's index.
    "}, {"location": "prompt_tuning/manual_prompt_tuning/#drift-search", "title": "Drift Search", "text": "

    Prompt Source

    "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_6", "title": "Tokens", "text": "
    • {response_type} - Describe how the response should look. We default to \"multiple paragraphs\".
    • {context_data} - The data tables from GraphRAG's index.
    • {community_reports} - The most relevant community reports to include in the summarization.
    • {query} - The query text as injected into the context.
    "}, {"location": "prompt_tuning/overview/", "title": "Prompt Tuning \u2699\ufe0f", "text": "

    This page provides an overview of the prompt tuning options available for the GraphRAG indexing engine.

    "}, {"location": "prompt_tuning/overview/#default-prompts", "title": "Default Prompts", "text": "

    The default prompts are the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. More details about each of the default prompts for indexing and query can be found on the manual tuning page.

    "}, {"location": "prompt_tuning/overview/#auto-tuning", "title": "Auto Tuning", "text": "

    Auto Tuning leverages your input data and LLM interactions to create domain adapted prompts for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the Auto Tuning documentation.

    "}, {"location": "prompt_tuning/overview/#manual-tuning", "title": "Manual Tuning", "text": "

    Manual tuning is an advanced use-case. Most users will want to use the Auto Tuning feature instead. Details about how to use manual configuration are available in the manual tuning documentation.

    "}, {"location": "query/drift_search/", "title": "DRIFT Search \ud83d\udd0e", "text": ""}, {"location": "query/drift_search/#combining-local-and-global-search", "title": "Combining Local and Global Search", "text": "

    GraphRAG is a technique that uses large language models (LLMs) to create knowledge graphs and summaries from unstructured text documents and leverages them to improve retrieval-augmented generation (RAG) operations on private datasets. It offers comprehensive global overviews of large, private troves of unstructured text documents while also enabling exploration of detailed, localized information. By using LLMs to create comprehensive knowledge graphs that connect and describe entities and relationships contained in those documents, GraphRAG leverages semantic structuring of the data to generate responses to a wide variety of complex user queries.

    DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal) builds upon Microsoft\u2019s GraphRAG technique, combining characteristics of both global and local search to generate detailed responses in a method that balances computational costs with quality outcomes using our drift search method.

    "}, {"location": "query/drift_search/#methodology", "title": "Methodology", "text": "

    Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user\u2019s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step. C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive.

    DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query\u2019s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions.

    "}, {"location": "query/drift_search/#configuration", "title": "Configuration", "text": "

    Below are the key parameters of the DRIFTSearch class:

    • llm: OpenAI model object to be used for response generation
    • context_builder: context builder object to be used for preparing context data from community reports and query information
    • config: model to define the DRIFT Search hyperparameters. DRIFT Config model
    • token_encoder: token encoder for tracking the budget for the algorithm.
    • query_state: a state object as defined in Query State that allows to track execution of a DRIFT Search instance, alongside follow ups and DRIFT actions.
    "}, {"location": "query/drift_search/#how-to-use", "title": "How to Use", "text": "

    An example of a drift search scenario can be found in the following notebook.

    "}, {"location": "query/drift_search/#learn-more", "title": "Learn More", "text": "

    For a more in-depth look at the DRIFT search method, please refer to our DRIFT Search blog post

    "}, {"location": "query/global_search/", "title": "Global Search \ud83d\udd0e", "text": ""}, {"location": "query/global_search/#whole-dataset-reasoning", "title": "Whole Dataset Reasoning", "text": "

    Baseline RAG struggles with queries that require aggregation of information across the dataset to compose an answer. Queries such as \u201cWhat are the top 5 themes in the data?\u201d perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information.

    However, with GraphRAG we can answer such questions, because the structure of the LLM-generated knowledge graph tells us about the structure (and thus themes) of the dataset as a whole. This allows the private dataset to be organized into meaningful semantic clusters that are pre-summarized. Using our global search method, the LLM uses these clusters to summarize these themes when responding to a user query.

    "}, {"location": "query/global_search/#methodology", "title": "Methodology", "text": "
    ---\ntitle: Global Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n    uq[User Query] --- .1\n    ch1[Conversation History] --- .1\n\n    subgraph RIR\n        direction TB\n        ri1[Rated Intermediate<br/>Response 1]~~~ri2[Rated Intermediate<br/>Response 2] -.\"{1..N}\".-rin[Rated Intermediate<br/>Response N]\n    end\n\n    .1--Shuffled Community<br/>Report Batch 1-->RIR\n    .1--Shuffled Community<br/>Report Batch 2-->RIR---.2\n    .1--Shuffled Community<br/>Report Batch N-->RIR\n\n    .2--Ranking +<br/>Filtering-->agr[Aggregated Intermediate<br/>Responses]-->res[Response]\n\n\n\n     classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n     classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n     classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n     classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n     classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n     classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n     class uq,ch1 turquoise;\n     class ri1,ri2,rin rose;\n     class agr orange;\n     class res purple;\n     class .1,.2 invisible;\n

    Given a user query and, optionally, the conversation history, the global search method uses a collection of LLM-generated community reports from a specified level of the graph's community hierarchy as context data to generate response in a map-reduce manner. At the map step, community reports are segmented into text chunks of pre-defined size. Each text chunk is then used to produce an intermediate response containing a list of point, each of which is accompanied by a numerical rating indicating the importance of the point. At the reduce step, a filtered set of the most important points from the intermediate responses are aggregated and used as the context to generate the final response.

    The quality of the global search\u2019s response can be heavily influenced by the level of the community hierarchy chosen for sourcing community reports. Lower hierarchy levels, with their detailed reports, tend to yield more thorough responses, but may also increase the time and LLM resources needed to generate the final response due to the volume of reports.

    "}, {"location": "query/global_search/#configuration", "title": "Configuration", "text": "

    Below are the key parameters of the GlobalSearch class:

    • llm: OpenAI model object to be used for response generation
    • context_builder: context builder object to be used for preparing context data from community reports
    • map_system_prompt: prompt template used in the map stage. Default template can be found at map_system_prompt
    • reduce_system_prompt: prompt template used in the reduce stage, default template can be found at reduce_system_prompt
    • response_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)
    • allow_general_knowledge: setting this to True will include additional instructions to the reduce_system_prompt to prompt the LLM to incorporate relevant real-world knowledge outside of the dataset. Note that this may increase hallucinations, but can be useful for certain scenarios. Default is False *general_knowledge_inclusion_prompt: instruction to add to the reduce_system_prompt if allow_general_knowledge is enabled. Default instruction can be found at general_knowledge_instruction
    • max_data_tokens: token budget for the context data
    • map_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call at the map stage
    • reduce_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to passed to the LLM call at the reduce stage
    • context_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context window for the map stage.
    • concurrent_coroutines: controls the degree of parallelism in the map stage.
    • callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming events
    "}, {"location": "query/global_search/#how-to-use", "title": "How to Use", "text": "

    An example of a global search scenario can be found in the following notebook.

    "}, {"location": "query/local_search/", "title": "Local Search \ud83d\udd0e", "text": ""}, {"location": "query/local_search/#entity-based-reasoning", "title": "Entity-based Reasoning", "text": "

    The local search method combines structured data from the knowledge graph with unstructured data from the input documents to augment the LLM context with relevant entity information at query time. It is well-suited for answering questions that require an understanding of specific entities mentioned in the input documents (e.g., \u201cWhat are the healing properties of chamomile?\u201d).

    "}, {"location": "query/local_search/#methodology", "title": "Methodology", "text": "
    ---\ntitle: Local Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n    uq[User Query] ---.1\n    ch1[Conversation<br/>History]---.1\n\n    .1--Entity<br/>Description<br/>Embedding--> ee[Extracted Entities]\n\n    ee[Extracted Entities] ---.2--Entity-Text<br/>Unit Mapping--> ctu[Candidate<br/>Text Units]--Ranking + <br/>Filtering -->ptu[Prioritized<br/>Text Units]---.3\n    .2--Entity-Report<br/>Mapping--> ccr[Candidate<br/>Community Reports]--Ranking + <br/>Filtering -->pcr[Prioritized<br/>Community Reports]---.3\n    .2--Entity-Entity<br/>Relationships--> ce[Candidate<br/>Entities]--Ranking + <br/>Filtering -->pe[Prioritized<br/>Entities]---.3\n    .2--Entity-Entity<br/>Relationships--> cr[Candidate<br/>Relationships]--Ranking + <br/>Filtering -->pr[Prioritized<br/>Relationships]---.3\n    .2--Entity-Covariate<br/>Mappings--> cc[Candidate<br/>Covariates]--Ranking + <br/>Filtering -->pc[Prioritized<br/>Covariates]---.3\n    ch1 -->ch2[Conversation History]---.3\n    .3-->res[Response]\n\n     classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n     classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n     classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n     classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n     classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n     classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n     class uq,ch1 turquoise\n     class ee green\n     class ctu,ccr,ce,cr,cc rose\n     class ptu,pcr,pe,pr,pc,ch2 orange\n     class res purple\n     class .1,.2,.3 invisible\n\n

    Given a user query and, optionally, the conversation history, the local search method identifies a set of entities from the knowledge graph that are semantically-related to the user input. These entities serve as access points into the knowledge graph, enabling the extraction of further relevant details such as connected entities, relationships, entity covariates, and community reports. Additionally, it also extracts relevant text chunks from the raw input documents that are associated with the identified entities. These candidate data sources are then prioritized and filtered to fit within a single context window of pre-defined size, which is used to generate a response to the user query.

    "}, {"location": "query/local_search/#configuration", "title": "Configuration", "text": "

    Below are the key parameters of the LocalSearch class:

    • llm: OpenAI model object to be used for response generation
    • context_builder: context builder object to be used for preparing context data from collections of knowledge model objects
    • system_prompt: prompt template used to generate the search response. Default template can be found at system_prompt
    • response_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)
    • llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call
    • context_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the search prompt
    • callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming events
    "}, {"location": "query/local_search/#how-to-use", "title": "How to Use", "text": "

    An example of a local search scenario can be found in the following notebook.

    "}, {"location": "query/overview/", "title": "Query Engine \ud83d\udd0e", "text": "

    The Query Engine is the retrieval module of the Graph RAG Library. It is one of the two main components of the Graph RAG library, the other being the Indexing Pipeline (see Indexing Pipeline). It is responsible for the following tasks:

    • Local Search
    • Global Search
    • DRIFT Search
    • Question Generation
    "}, {"location": "query/overview/#local-search", "title": "Local Search", "text": "

    Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?).

    For more details about how Local Search works please refer to the Local Search documentation.

    "}, {"location": "query/overview/#global-search", "title": "Global Search", "text": "

    Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole (e.g. What are the most significant values of the herbs mentioned in this notebook?).

    More about this can be checked at the Global Search documentation.

    "}, {"location": "query/overview/#drift-search", "title": "DRIFT Search", "text": "

    DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query\u2019s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions.

    To learn more about DRIFT Search, please refer to the DRIFT Search documentation.

    "}, {"location": "query/overview/#basic-search", "title": "Basic Search", "text": "

    GraphRAG includes a rudimentary implementation of basic vector RAG to make it easy to compare different search results based on the type of question you are asking. You can specify the top k txt unit chunks to include in the summarization context.

    "}, {"location": "query/overview/#question-generation", "title": "Question Generation", "text": "

    This functionality takes a list of user queries and generates the next candidate questions. This is useful for generating follow-up questions in a conversation or for generating a list of questions for the investigator to dive deeper into the dataset.

    Information about how question generation works can be found at the Question Generation documentation page.

    "}, {"location": "query/question_generation/", "title": "Question Generation \u2754", "text": ""}, {"location": "query/question_generation/#entity-based-question-generation", "title": "Entity-based Question Generation", "text": "

    The question generation method combines structured data from the knowledge graph with unstructured data from the input documents to generate candidate questions related to specific entities.

    "}, {"location": "query/question_generation/#methodology", "title": "Methodology", "text": "

    Given a list of prior user questions, the question generation method uses the same context-building approach employed in local search to extract and prioritize relevant structured and unstructured data, including entities, relationships, covariates, community reports and raw text chunks. These data records are then fitted into a single LLM prompt to generate candidate follow-up questions that represent the most important or urgent information content or themes in the data.

    "}, {"location": "query/question_generation/#configuration", "title": "Configuration", "text": "

    Below are the key parameters of the Question Generation class:

    • llm: OpenAI model object to be used for response generation
    • context_builder: context builder object to be used for preparing context data from collections of knowledge model objects, using the same context builder class as in local search
    • system_prompt: prompt template used to generate candidate questions. Default template can be found at system_prompt
    • llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call
    • context_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the question generation prompt
    • callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming events
    "}, {"location": "query/question_generation/#how-to-use", "title": "How to Use", "text": "

    An example of the question generation function can be found in the following notebook.

    "}, {"location": "query/notebooks/overview/", "title": "API Notebooks", "text": "
    • API Overview Notebook
    "}, {"location": "query/notebooks/overview/#query-engine-notebooks", "title": "Query Engine Notebooks", "text": "

    For examples about running Query please refer to the following notebooks:

    • Global Search Notebook
    • Local Search Notebook
    • DRIFT Search Notebook

    The test dataset for these notebooks can be found in dataset.zip.

    "}]} \ No newline at end of file +{"config": {"lang": ["en"], "separator": "[\\s\\-]+", "pipeline": ["stopWordFilter"]}, "docs": [{"location": "", "title": "Welcome to GraphRAG", "text": "

    \ud83d\udc49 Microsoft Research Blog Post \ud83d\udc49 GraphRAG Arxiv

    Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.

    GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.

    To learn more about GraphRAG and how it can be used to enhance your language model's ability to reason about your private data, please visit the Microsoft Research Blog Post.

    "}, {"location": "#get-started-with-graphrag", "title": "Get Started with GraphRAG \ud83d\ude80", "text": "

    To start using GraphRAG, check out the Get Started guide. For a deeper dive into the main sub-systems, please visit the docpages for the Indexer and Query packages.

    "}, {"location": "#graphrag-vs-baseline-rag", "title": "GraphRAG vs Baseline RAG \ud83d\udd0d", "text": "

    Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using real-world information. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique, which we call Baseline RAG. GraphRAG uses knowledge graphs to provide substantial improvements in question-and-answer performance when reasoning about complex information. RAG techniques have shown promise in helping LLMs to reason about private datasets - data that the LLM is not trained on and has never seen before, such as an enterprise\u2019s proprietary research, business documents, or communications. Baseline RAG was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:

    • Baseline RAG struggles to connect the dots. This happens when answering a question requires traversing disparate pieces of information through their shared attributes in order to provide new synthesized insights.
    • Baseline RAG performs poorly when being asked to holistically understand summarized semantic concepts over large data collections or even singular large documents.

    To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research\u2019s new approach, GraphRAG, creates a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.

    "}, {"location": "#the-graphrag-process", "title": "The GraphRAG Process \ud83e\udd16", "text": "

    GraphRAG builds upon our prior research and tooling using graph machine learning. The basic steps of the GraphRAG process are as follows:

    "}, {"location": "#index", "title": "Index", "text": "
    • Slice up an input corpus into a series of TextUnits, which act as analyzable units for the rest of the process, and provide fine-grained references in our outputs.
    • Extract all entities, relationships, and key claims from the TextUnits.
    • Perform a hierarchical clustering of the graph using the Leiden technique. To see this visually, check out Figure 1 above. Each circle is an entity (e.g., a person, place, or organization), with the size representing the degree of the entity, and the color representing its community.
    • Generate summaries of each community and its constituents from the bottom-up. This aids in holistic understanding of the dataset.
    "}, {"location": "#query", "title": "Query", "text": "

    At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:

    • Global Search for reasoning about holistic questions about the corpus by leveraging the community summaries.
    • Local Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts.
    • DRIFT Search for reasoning about specific entities by fanning-out to their neighbors and associated concepts, but with the added context of community information.
    "}, {"location": "#prompt-tuning", "title": "Prompt Tuning", "text": "

    Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.

    "}, {"location": "#versioning", "title": "Versioning", "text": "

    Please see the breaking changes document for notes on our approach to versioning the project.

    Always run graphrag init --root [path] --force between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.

    "}, {"location": "blog_posts/", "title": "Microsoft Research Blog", "text": "
    • GraphRAG: Unlocking LLM discovery on narrative private data

      Published February 13, 2024

      By Jonathan Larson, Senior Principal Data Architect; Steven Truitt, Principal Program Manager

    • GraphRAG: New tool for complex data discovery now on GitHub

      Published July 2, 2024

      By Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect

    • GraphRAG auto-tuning provides rapid adaptation to new domains

      Published September 9, 2024

      By Alonso Guevara Fern\u00e1ndez, Sr. Software Engineer; Katy Smith, Data Scientist II; Joshua Bradley, Senior Data Scientist; Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Sarah Smith, Senior Program Manager; Ben Cutler, Senior Director; Steven Truitt, Principal Program Manager; Jonathan Larson, Senior Principal Data Architect

    • Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

      Published October 31, 2024

      By Julian Whiting, Senior Machine Learning Engineer; Zachary Hills , Senior Software Engineer; Alonso Guevara Fern\u00e1ndez, Sr. Software Engineer; Ha Trinh, Senior Data Scientist; Adam Bradley , Managing Partner, Strategic Research; Jonathan Larson, Senior Principal Data Architect

    • GraphRAG: Improving global search via dynamic community selection

      Published November 15, 2024

      By Bryan Li, Research Intern; Ha Trinh, Senior Data Scientist; Darren Edge, Senior Director; Jonathan Larson, Senior Principal Data Architect

    • LazyGraphRAG: Setting a new standard for quality and cost

      Published November 25, 2024

      By Darren Edge, Senior Director; Ha Trinh, Senior Data Scientist; Jonathan Larson, Senior Principal Data Architect

    • Moving to GraphRAG 1.0 \u2013 Streamlining ergonomics for developers and users

      Published December 16, 2024

      By Nathan Evans, Principal Software Architect; Alonso Guevara Fern\u00e1ndez, Senior Software Engineer; Joshua Bradley, Senior Data Scientist

      "}, {"location": "cli/", "title": "CLI Reference", "text": "

      This page documents the command-line interface of the graphrag library.

      "}, {"location": "cli/#graphrag", "title": "graphrag", "text": "

      GraphRAG: A graph-based retrieval-augmented generation (RAG) system.

      Usage:

       [OPTIONS] COMMAND [ARGS]...\n

      Options:

        --install-completion  Install completion for the current shell.\n  --show-completion     Show completion for the current shell, to copy it or\n                        customize the installation.\n
      "}, {"location": "cli/#index", "title": "index", "text": "

      Build a knowledge graph index.

      Usage:

       index [OPTIONS]\n

      Options:

        -c, --config PATH               The configuration to use.\n  -r, --root PATH                 The project root directory.  \\[default: .]\n  -m, --method [standard|fast|standard-update|fast-update]\n                                  The indexing method to use.  \\[default:\n                                  standard]\n  -v, --verbose                   Run the indexing pipeline with verbose\n                                  logging\n  --memprofile                    Run the indexing pipeline with memory\n                                  profiling\n  --dry-run                       Run the indexing pipeline without executing\n                                  any steps to inspect and validate the\n                                  configuration.\n  --cache / --no-cache            Use LLM cache.  \\[default: cache]\n  --skip-validation               Skip any preflight validation. Useful when\n                                  running no LLM steps.\n  -o, --output PATH               Indexing pipeline output directory.\n                                  Overrides output.base_dir in the\n                                  configuration file.\n
      "}, {"location": "cli/#init", "title": "init", "text": "

      Generate a default configuration file.

      Usage:

       init [OPTIONS]\n

      Options:

        -r, --root PATH  The project root directory.  \\[default: .]\n  -f, --force      Force initialization even if the project already exists.\n
      "}, {"location": "cli/#prompt-tune", "title": "prompt-tune", "text": "

      Generate custom graphrag prompts with your own data (i.e. auto templating).

      Usage:

       prompt-tune [OPTIONS]\n

      Options:

        -r, --root PATH                 The project root directory.  \\[default: .]\n  -c, --config PATH               The configuration to use.\n  -v, --verbose                   Run the prompt tuning pipeline with verbose\n                                  logging.\n  --domain TEXT                   The domain your input data is related to.\n                                  For example 'space science', 'microbiology',\n                                  'environmental news'. If not defined, a\n                                  domain will be inferred from the input data.\n  --selection-method [all|random|top|auto]\n                                  The text chunk selection method.  \\[default:\n                                  random]\n  --n-subset-max INTEGER          The number of text chunks to embed when\n                                  --selection-method=auto.  \\[default: 300]\n  --k INTEGER                     The maximum number of documents to select\n                                  from each centroid when --selection-\n                                  method=auto.  \\[default: 15]\n  --limit INTEGER                 The number of documents to load when\n                                  --selection-method={random,top}.  \\[default:\n                                  15]\n  --max-tokens INTEGER            The max token count for prompt generation.\n                                  \\[default: 2000]\n  --min-examples-required INTEGER\n                                  The minimum number of examples to\n                                  generate/include in the entity extraction\n                                  prompt.  \\[default: 2]\n  --chunk-size INTEGER            The size of each example text chunk.\n                                  Overrides chunks.size in the configuration\n                                  file.  \\[default: 1200]\n  --overlap INTEGER               The overlap size for chunking documents.\n                                  Overrides chunks.overlap in the\n                                  configuration file.  \\[default: 100]\n  --language TEXT                 The primary language used for inputs and\n                                  outputs in graphrag prompts.\n  --discover-entity-types / --no-discover-entity-types\n                                  Discover and extract unspecified entity\n                                  types.  \\[default: discover-entity-types]\n  -o, --output PATH               The directory to save prompts to, relative\n                                  to the project root directory.  \\[default:\n                                  prompts]\n
      "}, {"location": "cli/#query", "title": "query", "text": "

      Query a knowledge graph index.

      Usage:

       query [OPTIONS]\n

      Options:

        -m, --method [local|global|drift|basic]\n                                  The query algorithm to use.  \\[required]\n  -q, --query TEXT                The query to execute.  \\[required]\n  -c, --config PATH               The configuration to use.\n  -v, --verbose                   Run the query with verbose logging.\n  -d, --data PATH                 Index output directory (contains the parquet\n                                  files).\n  -r, --root PATH                 The project root directory.  \\[default: .]\n  --community-level INTEGER       Leiden hierarchy level from which to load\n                                  community reports. Higher values represent\n                                  smaller communities.  \\[default: 2]\n  --dynamic-community-selection / --no-dynamic-selection\n                                  Use global search with dynamic community\n                                  selection.  \\[default: no-dynamic-selection]\n  --response-type TEXT            Free-form description of the desired\n                                  response format (e.g. 'Single Sentence',\n                                  'List of 3-7 Points', etc.).  \\[default:\n                                  Multiple Paragraphs]\n  --streaming / --no-streaming    Print the response in a streaming manner.\n                                  \\[default: no-streaming]\n
      "}, {"location": "cli/#update", "title": "update", "text": "

      Update an existing knowledge graph index.

      Applies a default output configuration (if not provided by config), saving the new index to the local file system in the update_output folder.

      Usage:

       update [OPTIONS]\n

      Options:

        -c, --config PATH               The configuration to use.\n  -r, --root PATH                 The project root directory.  \\[default: .]\n  -m, --method [standard|fast|standard-update|fast-update]\n                                  The indexing method to use.  \\[default:\n                                  standard]\n  -v, --verbose                   Run the indexing pipeline with verbose\n                                  logging.\n  --memprofile                    Run the indexing pipeline with memory\n                                  profiling.\n  --cache / --no-cache            Use LLM cache.  \\[default: cache]\n  --skip-validation               Skip any preflight validation. Useful when\n                                  running no LLM steps.\n  -o, --output PATH               Indexing pipeline output directory.\n                                  Overrides output.base_dir in the\n                                  configuration file.\n
      "}, {"location": "developing/", "title": "Development Guide", "text": ""}, {"location": "developing/#requirements", "title": "Requirements", "text": "Name Installation Purpose Python 3.10-3.12 Download The library is Python-based. uv Instructions uv is used for package management and virtualenv management in Python codebases"}, {"location": "developing/#getting-started", "title": "Getting Started", "text": ""}, {"location": "developing/#install-dependencies", "title": "Install Dependencies", "text": "
      # (optional) create virtual environment\nuv venv --python 3.10\nsource .venv/bin/activate\n\n# install python dependencies\nuv sync --extra dev\n
      "}, {"location": "developing/#execute-the-indexing-engine", "title": "Execute the Indexing Engine", "text": "
      uv run poe index <...args>\n
      "}, {"location": "developing/#executing-queries", "title": "Executing Queries", "text": "
      uv run poe query <...args>\n
      "}, {"location": "developing/#azurite", "title": "Azurite", "text": "

      Some unit and smoke tests use Azurite to emulate Azure resources. This can be started by running:

      ./scripts/start-azurite.sh\n

      or by simply running azurite in the terminal if already installed globally. See the Azurite documentation for more information about how to install and use Azurite.

      "}, {"location": "developing/#lifecycle-scripts", "title": "Lifecycle Scripts", "text": "

      Our Python package utilize uv to manage dependencies and poethepoet to manage build scripts.

      Available scripts are:

      • uv run poe index - Run the Indexing CLI
      • uv run poe query - Run the Query CLI
      • uv build - This will build a wheel file and other distributable artifacts.
      • uv run poe test - This will execute all tests.
      • uv run poe test_unit - This will execute unit tests.
      • uv run poe test_integration - This will execute integration tests.
      • uv run poe test_smoke - This will execute smoke tests.
      • uv run poe test_verbs - This will execute tests of the basic workflows.
      • uv run poe check - This will perform a suite of static checks across the package, including:
      • formatting
      • documentation formatting
      • linting
      • security patterns
      • type-checking
      • uv run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
      • uv run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.
      • uv run poe format - Explicitly run the formatter across the package.
      "}, {"location": "developing/#troubleshooting", "title": "Troubleshooting", "text": ""}, {"location": "developing/#runtimeerror-llvm-config-failed-executing-please-point-llvm_config-to-the-path-for-llvm-config-when-running-uv-install", "title": "\"RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config\" when running uv install", "text": "

      Make sure llvm-9 and llvm-9-dev are installed:

      sudo apt-get install llvm-9 llvm-9-dev

      and then in your bashrc, add

      export LLVM_CONFIG=/usr/bin/llvm-config-9

      "}, {"location": "developing/#llm-call-constantly-exceeds-tpm-rpm-or-time-limits", "title": "LLM call constantly exceeds TPM, RPM or time limits", "text": "

      GRAPHRAG_LLM_THREAD_COUNT and GRAPHRAG_EMBEDDING_THREAD_COUNT are both set to 50 by default. You can modify these values to reduce concurrency. Please refer to the Configuration Documents

      "}, {"location": "get_started/", "title": "Getting Started", "text": ""}, {"location": "get_started/#requirements", "title": "Requirements", "text": "

      Python 3.10-3.12

      To get started with the GraphRAG system, you have a few options:

      \ud83d\udc49 Install from pypi. \ud83d\udc49 Use it from source

      The following is a simple end-to-end example for using the GraphRAG system, using the install from pypi option.

      It shows how to use the system to index some text, and then use the indexed data to answer questions about the documents.

      "}, {"location": "get_started/#install-graphrag", "title": "Install GraphRAG", "text": "
      pip install graphrag\n
      "}, {"location": "get_started/#running-the-indexer", "title": "Running the Indexer", "text": "

      We need to set up a data project and some initial configuration. First let's get a sample dataset ready:

      mkdir -p ./ragtest/input\n

      Get a copy of A Christmas Carol by Charles Dickens from a trusted source:

      curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt\n
      "}, {"location": "get_started/#set-up-your-workspace-variables", "title": "Set Up Your Workspace Variables", "text": "

      To initialize your workspace, first run the graphrag init command. Since we have already configured a directory named ./ragtest in the previous step, run the following command:

      graphrag init --root ./ragtest\n

      This will create two files: .env and settings.yaml in the ./ragtest directory.

      • .env contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined, GRAPHRAG_API_KEY=<API_KEY>. Replace <API_KEY> with your own OpenAI or Azure API key.
      • settings.yaml contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.
      "}, {"location": "get_started/#using-openai", "title": "Using OpenAI", "text": "

      If running in OpenAI mode, you only need to update the value of GRAPHRAG_API_KEY in the .env file with your OpenAI API key.

      "}, {"location": "get_started/#using-azure-openai", "title": "Using Azure OpenAI", "text": "

      In addition to setting your API key, Azure OpenAI users should set the variables below in the settings.yaml file. To find the appropriate sections, just search for the models: root configuration; you should see two sections, one for the default chat endpoint and one for the default embeddings endpoint. Here is an example of what to add to the chat model config:

      type: azure_openai_chat # Or azure_openai_embedding for embeddings\napi_base: https://<instance>.openai.azure.com\napi_version: 2024-02-15-preview # You can customize this for other versions\ndeployment_name: <azure_model_deployment_name>\n
      "}, {"location": "get_started/#using-managed-auth-on-azure", "title": "Using Managed Auth on Azure", "text": "

      To use managed auth, add an additional value to your model config and comment out or remove the api_key line:

      auth_type: azure_managed_identity # Default auth_type is is api_key\n# api_key: ${GRAPHRAG_API_KEY}\n

      You will also need to login with az login and select the subscription with your endpoint.

      "}, {"location": "get_started/#running-the-indexing-pipeline", "title": "Running the Indexing pipeline", "text": "

      Finally we'll run the pipeline!

      graphrag index --root ./ragtest\n

      This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your settings.yaml file). Once the pipeline is complete, you should see a new folder called ./ragtest/output with a series of parquet files.

      "}, {"location": "get_started/#using-the-query-engine", "title": "Using the Query Engine", "text": "

      Now let's ask some questions using this dataset.

      Here is an example using Global search to ask a high-level question:

      graphrag query \\\n--root ./ragtest \\\n--method global \\\n--query \"What are the top themes in this story?\"\n

      Here is an example using Local search to ask a more specific question about a particular character:

      graphrag query \\\n--root ./ragtest \\\n--method local \\\n--query \"Who is Scrooge and what are his main relationships?\"\n

      Please refer to Query Engine docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.

      "}, {"location": "get_started/#going-deeper", "title": "Going Deeper", "text": "
      • For more details about configuring GraphRAG, see the configuration documentation.
      • To learn more about Initialization, refer to the Initialization documentation.
      • For more details about using the CLI, refer to the CLI documentation.
      • Check out our visualization guide for a more interactive experience in debugging and exploring the knowledge graph.
      "}, {"location": "visualization_guide/", "title": "Visualizing and Debugging Your Knowledge Graph", "text": "

      The following step-by-step guide walks through the process to visualize a knowledge graph after it's been constructed by graphrag. Note that some of the settings recommended below are based on our own experience of what works well. Feel free to change and explore other settings for a better visualization experience!

      "}, {"location": "visualization_guide/#1-run-the-pipeline", "title": "1. Run the Pipeline", "text": "

      Before building an index, please review your settings.yaml configuration file and ensure that graphml snapshots is enabled.

      snapshots:\n  graphml: true\n
      (Optional) To support other visualization tools and exploration, additional parameters can be enabled that provide access to vector embeddings.
      embed_graph:\n  enabled: true # will generate node2vec embeddings for nodes\numap:\n  enabled: true # will generate UMAP embeddings for nodes, giving the entities table an x/y position to plot\n
      After running the indexing pipeline over your data, there will be an output folder (defined by the storage.base_dir setting).

      • Output Folder: Contains artifacts from the LLM\u2019s indexing pass.
      "}, {"location": "visualization_guide/#2-locate-the-knowledge-graph", "title": "2. Locate the Knowledge Graph", "text": "

      In the output folder, look for a file named graph.graphml. graphml is a standard file format supported by many visualization tools. We recommend trying Gephi.

      "}, {"location": "visualization_guide/#3-open-the-graph-in-gephi", "title": "3. Open the Graph in Gephi", "text": "
      1. Install and open Gephi
      2. Navigate to the output folder containing the various parquet files.
      3. Import the graph.graphml file into Gephi. This will result in a fairly plain view of the undirected graph nodes and edges.
      "}, {"location": "visualization_guide/#4-install-the-leiden-algorithm-plugin", "title": "4. Install the Leiden Algorithm Plugin", "text": "
      1. Go to Tools -> Plugins.
      2. Search for \"Leiden Algorithm\".
      3. Click Install and restart Gephi.
      "}, {"location": "visualization_guide/#5-run-statistics", "title": "5. Run Statistics", "text": "
      1. In the Statistics tab on the right, click Run for Average Degree and Leiden Algorithm.
      1. For the Leiden Algorithm, adjust the settings:
      2. Quality function: Modularity
      3. Resolution: 1
      "}, {"location": "visualization_guide/#6-color-the-graph-by-clusters", "title": "6. Color the Graph by Clusters", "text": "
      1. Go to the Appearance pane in the upper left side of Gephi.
      1. Select Nodes, then Partition, and click the color palette icon in the upper right.
      2. Choose Cluster from the dropdown.
      3. Click the Palette... hyperlink, then Generate....
      4. Uncheck Limit number of colors, click Generate, and then Ok.
      5. Click Apply to color the graph. This will color the graph based on the partitions discovered by Leiden.
      "}, {"location": "visualization_guide/#7-resize-nodes-by-degree-centrality", "title": "7. Resize Nodes by Degree Centrality", "text": "
      1. In the Appearance pane in the upper left, select Nodes -> Ranking
      2. Select the Sizing icon in the upper right.
      3. Choose Degree and set:
      4. Min: 10
      5. Max: 150
      6. Click Apply.
      "}, {"location": "visualization_guide/#8-layout-the-graph", "title": "8. Layout the Graph", "text": "
      1. In the Layout tab in the lower left, select OpenORD.
      1. Set Liquid and Expansion stages to 50, and everything else to 0.
      2. Click Run and monitor the progress.
      "}, {"location": "visualization_guide/#9-run-forceatlas2", "title": "9. Run ForceAtlas2", "text": "
      1. Select Force Atlas 2 in the layout options.
      1. Adjust the settings:
      2. Scaling: 15
      3. Dissuade Hubs: checked
      4. LinLog mode: uncheck
      5. Prevent Overlap: checked
      6. Click Run and wait.
      7. Press Stop when it looks like the graph nodes have settled and no longer change position significantly.
      "}, {"location": "visualization_guide/#10-add-text-labels-optional", "title": "10. Add Text Labels (Optional)", "text": "
      1. Turn on text labels in the appropriate section.
      2. Configure and resize them as needed.

      Your final graph should now be visually organized and ready for analysis!

      "}, {"location": "config/init/", "title": "Configuring GraphRAG Indexing", "text": "

      To start using GraphRAG, you must generate a configuration file. The init command is the easiest way to get started. It will create a .env and settings.yaml files in the specified directory with the necessary configuration settings. It will also output the default LLM prompts used by GraphRAG.

      "}, {"location": "config/init/#usage", "title": "Usage", "text": "
      graphrag init [--root PATH] [--force, --no-force]\n
      "}, {"location": "config/init/#options", "title": "Options", "text": "
      • --root PATH - The project root directory to initialize graphrag at. Default is the current directory.
      • --force, --no-force - Optional, default is --no-force. Overwrite existing configuration and prompt files if they exist.
      "}, {"location": "config/init/#example", "title": "Example", "text": "
      graphrag init --root ./ragtest\n
      "}, {"location": "config/init/#output", "title": "Output", "text": "

      The init command will create the following files in the specified directory:

      • settings.yaml - The configuration settings file. This file contains the configuration settings for GraphRAG.
      • .env - The environment variables file. These are referenced in the settings.yaml file.
      • prompts/ - The LLM prompts folder. This contains the default prompts used by GraphRAG, you can modify them or run the Auto Prompt Tuning command to generate new prompts adapted to your data.
      "}, {"location": "config/init/#next-steps", "title": "Next Steps", "text": "

      After initializing your workspace, you can either run the Prompt Tuning command to adapt the prompts to your data or even start running the Indexing Pipeline to index your data. For more information on configuration options available, see the YAML details page.

      "}, {"location": "config/models/", "title": "Language Model Selection and Overriding", "text": "

      This page contains information on selecting a model to use and options to supply your own model for GraphRAG. Note that this is not a guide to finding the right model for your use case.

      "}, {"location": "config/models/#default-model-support", "title": "Default Model Support", "text": "

      GraphRAG was built and tested using OpenAI models, so this is the default model set we support. This is not intended to be a limiter or statement of quality or fitness for your use case, only that it's the set we are most familiar with for prompting, tuning, and debugging.

      GraphRAG also utilizes a language model wrapper library used by several projects within our team, called fnllm. fnllm provides two important functions for GraphRAG: rate limiting configuration to help us maximize throughput for large indexing jobs, and robust caching of API calls to minimize consumption on repeated indexes for testing, experimentation, or incremental ingest. fnllm uses the OpenAI Python SDK under the covers, so OpenAI-compliant endpoints are a base requirement out-of-the-box.

      "}, {"location": "config/models/#model-selection-considerations", "title": "Model Selection Considerations", "text": "

      GraphRAG has been most thoroughly tested with the gpt-4 series of models from OpenAI, including gpt-4 gpt-4-turbo, gpt-4o, and gpt-4o-mini. Our arXiv paper, for example, performed quality evaluation using gpt-4-turbo.

      Versions of GraphRAG before 2.2.0 made extensive use of max_tokens and logit_bias to control generated response length or content. The introduction of the o-series of models added new, non-compatible parameters because these models include a reasoning component that has different consumption patterns and response generation attributes than non-reasoning models. GraphRAG 2.2.0 now supports these models, but there are important differences that need to be understood before you switch.

      • Previously, GraphRAG used max_tokens to limit responses in a few locations. This is done so that we can have predictable content sizes when building downstream context windows for summarization. We have now switched from using max_tokens to use a prompted approach, which is working well in our tests. We suggest using max_tokens in your language model config only for budgetary reasons if you want to limit consumption, and not for expected response length control. We now also support the o-series equivalent max_completion_tokens, but if you use this keep in mind that there may be some unknown fixed reasoning consumption amount in addition to the response tokens, so it is not a good technique for response control.
      • Previously, GraphRAG used a combination of max_tokens and logit_bias to strictly control a binary yes/no question during gleanings. This is not possible with reasoning models, so again we have switched to a prompted approach. Our tests with gpt-4o, gpt-4o-mini, and o1 show that this works consistently, but could have issues if you have an older or smaller model.
      • The o-series models are much slower and more expensive. It may be useful to use an asymmetric approach to model use in your config: you can define as many models as you like in the models block of your settings.yaml and reference them by key for every workflow that requires a language model. You could use gpt-4o for indexing and o1 for query, for example. Experiment to find the right balance of cost, speed, and quality for your use case.
      • The o-series models contain a form of native native chain-of-thought reasoning that is absent in the non-o-series models. GraphRAG's prompts sometimes contain CoT because it was an effective technique with the gpt-4* series. It may be counterproductive with the o-series, so you may want to tune or even re-write large portions of the prompt templates (particularly for graph and claim extraction).

      Example config with asymmetric model use:

      models:\n  extraction_chat_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_chat\n    auth_type: api_key\n    model: gpt-4o\n    model_supports_json: true\n  query_chat_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_chat\n    auth_type: api_key\n    model: o1\n    model_supports_json: true\n\n...\n\nextract_graph:\n  model_id: extraction_chat_model\n  prompt: \"prompts/extract_graph.txt\"\n  entity_types: [organization,person,geo,event]\n  max_gleanings: 1\n\n...\n\n\nglobal_search:\n  chat_model_id: query_chat_model\n  map_prompt: \"prompts/global_search_map_system_prompt.txt\"\n  reduce_prompt: \"prompts/global_search_reduce_system_prompt.txt\"\n  knowledge_prompt: \"prompts/global_search_knowledge_system_prompt.txt\"\n

      Another option would be to avoid using a language model at all for the graph extraction, instead using the fast indexing method that uses NLP for portions of the indexing phase in lieu of LLM APIs.

      "}, {"location": "config/models/#using-non-openai-models", "title": "Using Non-OpenAI Models", "text": "

      As noted above, our primary experience and focus has been on OpenAI models, so this is what is supported out-of-the-box. Many users have requested support for additional model types, but it's out of the scope of our research to handle the many models available today. There are two approaches you can use to connect to a non-OpenAI model:

      "}, {"location": "config/models/#proxy-apis", "title": "Proxy APIs", "text": "

      Many users have used platforms such as ollama to proxy the underlying model HTTP calls to a different model provider. This seems to work reasonably well, but we frequently see issues with malformed responses (especially JSON), so if you do this please understand that your model needs to reliably return the specific response formats that GraphRAG expects. If you're having trouble with a model, you may need to try prompting to coax the format, or intercepting the response within your proxy to try and handle malformed responses.

      "}, {"location": "config/models/#model-protocol", "title": "Model Protocol", "text": "

      As of GraphRAG 2.0.0, we support model injection through the use of a standard chat and embedding Protocol and an accompanying ModelFactory that you can use to register your model implementation. This is not supported with the CLI, so you'll need to use GraphRAG as a library.

      • Our Protocol is defined here
      • Our base implementation, which wraps fnllm, is here
      • We have a simple mock implementation in our tests that you can reference here

      Once you have a model implementation, you need to register it with our ModelFactory:

      class MyCustomModel:\n    ...\n    # implementation\n\n# elsewhere...\nModelFactory.register_chat(\"my-custom-chat-model\", lambda **kwargs: MyCustomModel(**kwargs))\n

      Then in your config you can reference the type name you used:

      models:\n  default_chat_model:\n    type: my-custom-chat-model\n\n\nextract_graph:\n  model_id: default_chat_model\n  prompt: \"prompts/extract_graph.txt\"\n  entity_types: [organization,person,geo,event]\n  max_gleanings: 1\n

      Note that your custom model will be passed the same params for init and method calls that we use throughout GraphRAG. There is not currently any ability to define custom parameters, so you may need to use closure scope or a factory pattern within your implementation to get custom config values.

      "}, {"location": "config/overview/", "title": "Configuring GraphRAG Indexing", "text": "

      The GraphRAG system is highly configurable. This page provides an overview of the configuration options available for the GraphRAG indexing engine.

      "}, {"location": "config/overview/#default-configuration-mode", "title": "Default Configuration Mode", "text": "

      The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The main ways to set up GraphRAG in Default Configuration mode are via:

      • Init command (recommended first step)
      • Edit settings.yaml for deeper control
      • Purely using environment variables (not recommended)
      "}, {"location": "config/yaml/", "title": "Default Configuration Mode (using YAML/JSON)", "text": "

      The default configuration mode may be configured by using a settings.yml or settings.json file in the data project root. If a .env file is present along with this config file, then it will be loaded, and the environment variables defined therein will be available for token replacements in your configuration document using ${ENV_VAR} syntax. We initialize with YML by default in graphrag init but you may use the equivalent JSON form if preferred.

      Many of these config values have defaults. Rather than replicate them here, please refer to the constants in the code directly.

      For example:

      # .env\nGRAPHRAG_API_KEY=some_api_key\n\n# settings.yml\nllm: \n  api_key: ${GRAPHRAG_API_KEY}\n
      "}, {"location": "config/yaml/#config-sections", "title": "Config Sections", "text": ""}, {"location": "config/yaml/#language-model-setup", "title": "Language Model Setup", "text": ""}, {"location": "config/yaml/#models", "title": "models", "text": "

      This is a dict of model configurations. The dict key is used to reference this configuration elsewhere when a model instance is desired. In this way, you can specify as many different models as you need, and reference them differentially in the workflow steps.

      For example:

      models:\n  default_chat_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_chat\n    model: gpt-4o\n    model_supports_json: true\n  default_embedding_model:\n    api_key: ${GRAPHRAG_API_KEY}\n    type: openai_embedding\n    model: text-embedding-ada-002\n

      "}, {"location": "config/yaml/#fields", "title": "Fields", "text": "
      • api_key str - The OpenAI API key to use.
      • auth_type api_key|azure_managed_identity - Indicate how you want to authenticate requests.
      • type openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding|mock_chat|mock_embeddings - The type of LLM to use.
      • model str - The model name.
      • encoding_model str - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
      • api_base str - The API base url to use.
      • api_version str - The API version.
      • deployment_name str - The deployment name to use (Azure).
      • organization str - The client organization.
      • proxy str - The proxy URL to use.
      • audience str - (Azure OpenAI only) The URI of the target Azure resource/service for which a managed identity token is requested. Used if api_key is not defined. Default=https://cognitiveservices.azure.com/.default
      • model_supports_json bool - Whether the model supports JSON-mode output.
      • request_timeout float - The per-request timeout.
      • tokens_per_minute int - Set a leaky-bucket throttle on tokens-per-minute.
      • requests_per_minute int - Set a leaky-bucket throttle on requests-per-minute.
      • retry_strategy str - Retry strategy to use, \"native\" is the default and uses the strategy built into the OpenAI SDK. Other allowable values include \"exponential_backoff\", \"random_wait\", and \"incremental_wait\".
      • max_retries int - The maximum number of retries to use.
      • max_retry_wait float - The maximum backoff time.
      • concurrent_requests int The number of open requests to allow at once.
      • async_mode asyncio|threaded The async mode to use. Either asyncio or threaded.
      • responses list[str] - If this model type is mock, this is a list of response strings to return.
      • n int - The number of completions to generate.
      • max_tokens int - The maximum number of output tokens. Not valid for o-series models.
      • temperature float - The temperature to use. Not valid for o-series models.
      • top_p float - The top-p value to use. Not valid for o-series models.
      • frequency_penalty float - Frequency penalty for token generation. Not valid for o-series models.
      • presence_penalty float - Frequency penalty for token generation. Not valid for o-series models.
      • max_completion_tokens int - Max number of tokens to consume for chat completion. Must be large enough to include an unknown amount for \"reasoning\" by the model. o-series models only.
      • reasoning_effort low|medium|high - Amount of \"thought\" for the model to expend reasoning about a response. o-series models only.
      "}, {"location": "config/yaml/#input-files-and-chunking", "title": "Input Files and Chunking", "text": ""}, {"location": "config/yaml/#input", "title": "input", "text": "

      Our pipeline can ingest .csv, .txt, or .json data from an input location. See the inputs page for more details and examples.

      "}, {"location": "config/yaml/#fields_1", "title": "Fields", "text": "
      • storage StorageConfig
      • type file|blob|cosmosdb - The storage type to use. Default=file
      • base_dir str - The base directory to write output artifacts to, relative to the root.
      • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
      • container_name str - (blob/cosmosdb only) The Azure Storage container name.
      • storage_account_blob_url str - (blob only) The storage account blob URL to use.
      • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
      • file_type text|csv|json - The type of input data to load. Default is text
      • encoding str - The encoding of the input file. Default is utf-8
      • file_pattern str - A regex to match input files. Default is .*\\.csv$, .*\\.txt$, or .*\\.json$ depending on the specified file_type, but you can customize it if needed.
      • file_filter dict - Key/value pairs to filter. Default is None.
      • text_column str - (CSV/JSON only) The text column name. If unset we expect a column named text.
      • title_column str - (CSV/JSON only) The title column name, filename will be used if unset.
      • metadata list[str] - (CSV/JSON only) The additional document attributes fields to keep.
      "}, {"location": "config/yaml/#chunks", "title": "chunks", "text": "

      These settings configure how we parse documents into text chunks. This is necessary because very large documents may not fit into a single context window, and graph extraction accuracy can be modulated. Also note the metadata setting in the input document config, which will replicate document metadata into each chunk.

      "}, {"location": "config/yaml/#fields_2", "title": "Fields", "text": "
      • size int - The max chunk size in tokens.
      • overlap int - The chunk overlap in tokens.
      • group_by_columns list[str] - Group documents by these fields before chunking.
      • strategy str[tokens|sentences] - How to chunk the text.
      • encoding_model str - The text encoding model to use for splitting on token boundaries.
      • prepend_metadata bool - Determines if metadata values should be added at the beginning of each chunk. Default=False.
      • chunk_size_includes_metadata bool - Specifies whether the chunk size calculation should include metadata tokens. Default=False.
      "}, {"location": "config/yaml/#outputs-and-storage", "title": "Outputs and Storage", "text": ""}, {"location": "config/yaml/#output", "title": "output", "text": "

      This section controls the storage mechanism used by the pipeline used for exporting output tables.

      "}, {"location": "config/yaml/#fields_3", "title": "Fields", "text": "
      • type file|memory|blob|cosmosdb - The storage type to use. Default=file
      • base_dir str - The base directory to write output artifacts to, relative to the root.
      • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
      • container_name str - (blob/cosmosdb only) The Azure Storage container name.
      • storage_account_blob_url str - (blob only) The storage account blob URL to use.
      • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
      "}, {"location": "config/yaml/#update_index_output", "title": "update_index_output", "text": "

      The section defines a secondary storage location for running incremental indexing, to preserve your original outputs.

      "}, {"location": "config/yaml/#fields_4", "title": "Fields", "text": "
      • type file|memory|blob|cosmosdb - The storage type to use. Default=file
      • base_dir str - The base directory to write output artifacts to, relative to the root.
      • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
      • container_name str - (blob/cosmosdb only) The Azure Storage container name.
      • storage_account_blob_url str - (blob only) The storage account blob URL to use.
      • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
      "}, {"location": "config/yaml/#cache", "title": "cache", "text": "

      This section controls the cache mechanism used by the pipeline. This is used to cache LLM invocation results for faster performance when re-running the indexing process.

      "}, {"location": "config/yaml/#fields_5", "title": "Fields", "text": "
      • type file|memory|blob|cosmosdb - The storage type to use. Default=file
      • base_dir str - The base directory to write output artifacts to, relative to the root.
      • connection_string str - (blob/cosmosdb only) The Azure Storage connection string.
      • container_name str - (blob/cosmosdb only) The Azure Storage container name.
      • storage_account_blob_url str - (blob only) The storage account blob URL to use.
      • cosmosdb_account_blob_url str - (cosmosdb only) The CosmosDB account blob URL to use.
      "}, {"location": "config/yaml/#reporting", "title": "reporting", "text": "

      This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.

      "}, {"location": "config/yaml/#fields_6", "title": "Fields", "text": "
      • type file|blob - The reporting type to use. Default=file
      • base_dir str - The base directory to write reports to, relative to the root.
      • connection_string str - (blob only) The Azure Storage connection string.
      • container_name str - (blob only) The Azure Storage container name.
      • storage_account_blob_url str - The storage account blob URL to use.
      "}, {"location": "config/yaml/#vector_store", "title": "vector_store", "text": "

      Where to put all vectors for the system. Configured for lancedb by default. This is a dict, with the key used to identify individual store parameters (e.g., for text embedding).

      "}, {"location": "config/yaml/#fields_7", "title": "Fields", "text": "
      • type lancedb|azure_ai_search|cosmosdb - Type of vector store. Default=lancedb
      • db_uri str (only for lancedb) - The database uri. Default=storage.base_dir/lancedb
      • url str (only for AI Search) - AI Search endpoint
      • api_key str (optional - only for AI Search) - The AI Search api key to use.
      • audience str (only for AI Search) - Audience for managed identity token if managed identity authentication is used.
      • container_name str - The name of a vector container. This stores all indexes (tables) for a given dataset ingest. Default=default
      • database_name str - (cosmosdb only) Name of the database.
      • overwrite bool (only used at index creation time) - Overwrite collection if it exist. Default=True
      "}, {"location": "config/yaml/#workflow-configurations", "title": "Workflow Configurations", "text": "

      These settings control each individual workflow as they execute.

      "}, {"location": "config/yaml/#workflows", "title": "workflows", "text": "

      list[str] - This is a list of workflow names to run, in order. GraphRAG has built-in pipelines to configure this, but you can run exactly and only what you want by specifying the list here. Useful if you have done part of the processing yourself.

      "}, {"location": "config/yaml/#embed_text", "title": "embed_text", "text": "

      By default, the GraphRAG indexer will only export embeddings required for our query methods. However, the model has embeddings defined for all plaintext fields, and these can be customized by setting the target and names fields.

      Supported embeddings names are:

      • text_unit.text
      • document.text
      • entity.title
      • entity.description
      • relationship.description
      • community.title
      • community.summary
      • community.full_content
      "}, {"location": "config/yaml/#fields_8", "title": "Fields", "text": "
      • model_id str - Name of the model definition to use for text embedding.
      • vector_store_id str - Name of vector store definition to write to.
      • batch_size int - The maximum batch size to use.
      • batch_max_tokens int - The maximum batch # of tokens.
      • names list[str] - List of the embeddings names to run (must be in supported list).
      "}, {"location": "config/yaml/#extract_graph", "title": "extract_graph", "text": "

      Tune the language model-based graph extraction process.

      "}, {"location": "config/yaml/#fields_9", "title": "Fields", "text": "
      • model_id str - Name of the model definition to use for API calls.
      • prompt str - The prompt file to use.
      • entity_types list[str] - The entity types to identify.
      • max_gleanings int - The maximum number of gleaning cycles to use.
      "}, {"location": "config/yaml/#summarize_descriptions", "title": "summarize_descriptions", "text": ""}, {"location": "config/yaml/#fields_10", "title": "Fields", "text": "
      • model_id str - Name of the model definition to use for API calls.
      • prompt str - The prompt file to use.
      • max_length int - The maximum number of output tokens per summarization.
      • max_input_length int - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
      "}, {"location": "config/yaml/#extract_graph_nlp", "title": "extract_graph_nlp", "text": "

      Defines settings for NLP-based graph extraction methods.

      "}, {"location": "config/yaml/#fields_11", "title": "Fields", "text": "
      • normalize_edge_weights bool - Whether to normalize the edge weights during graph construction. Default=True.
      • text_analyzer dict - Parameters for the NLP model.
      • extractor_type regex_english|syntactic_parser|cfg - Default=regex_english.
      • model_name str - Name of NLP model (for SpaCy-based models)
      • max_word_length int - Longest word to allow. Default=15.
      • word_delimiter str - Delimiter to split words. Default ' '.
      • include_named_entities bool - Whether to include named entities in noun phrases. Default=True.
      • exclude_nouns list[str] | None - List of nouns to exclude. If None, we use an internal stopword list.
      • exclude_entity_tags list[str] - List of entity tags to ignore.
      • exclude_pos_tags list[str] - List of part-of-speech tags to ignore.
      • noun_phrase_tags list[str] - List of noun phrase tags to ignore.
      • noun_phrase_grammars dict[str, str] - Noun phrase grammars for the model (cfg-only).
      "}, {"location": "config/yaml/#prune_graph", "title": "prune_graph", "text": "

      Parameters for manual graph pruning. This can be used to optimize the modularity of your graph clusters, by removing overly-connected or rare nodes.

      "}, {"location": "config/yaml/#fields_12", "title": "Fields", "text": "
      • min_node_freq int - The minimum node frequency to allow.
      • max_node_freq_std float | None - The maximum standard deviation of node frequency to allow.
      • min_node_degree int - The minimum node degree to allow.
      • max_node_degree_std float | None - The maximum standard deviation of node degree to allow.
      • min_edge_weight_pct float - The minimum edge weight percentile to allow.
      • remove_ego_nodes bool - Remove ego nodes.
      • lcc_only bool - Only use largest connected component.
      "}, {"location": "config/yaml/#cluster_graph", "title": "cluster_graph", "text": "

      These are the settings used for Leiden hierarchical clustering of the graph to create communities.

      "}, {"location": "config/yaml/#fields_13", "title": "Fields", "text": "
      • max_cluster_size int - The maximum cluster size to export.
      • use_lcc bool - Whether to only use the largest connected component.
      • seed int - A randomization seed to provide if consistent run-to-run results are desired. We do provide a default in order to guarantee clustering stability.
      "}, {"location": "config/yaml/#extract_claims", "title": "extract_claims", "text": ""}, {"location": "config/yaml/#fields_14", "title": "Fields", "text": "
      • enabled bool - Whether to enable claim extraction. Off by default, because claim prompts really need user tuning.
      • model_id str - Name of the model definition to use for API calls.
      • prompt str - The prompt file to use.
      • description str - Describes the types of claims we want to extract.
      • max_gleanings int - The maximum number of gleaning cycles to use.
      "}, {"location": "config/yaml/#community_reports", "title": "community_reports", "text": ""}, {"location": "config/yaml/#fields_15", "title": "Fields", "text": "
      • model_id str - Name of the model definition to use for API calls.
      • prompt str - The prompt file to use.
      • max_length int - The maximum number of output tokens per report.
      • max_input_length int - The maximum number of input tokens to use when generating reports.
      "}, {"location": "config/yaml/#embed_graph", "title": "embed_graph", "text": "

      We use node2vec to embed the graph. This is primarily used for visualization, so it is not turned on by default.

      "}, {"location": "config/yaml/#fields_16", "title": "Fields", "text": "
      • enabled bool - Whether to enable graph embeddings.
      • dimensions int - Number of vector dimensions to produce.
      • num_walks int - The node2vec number of walks.
      • walk_length int - The node2vec walk length.
      • window_size int - The node2vec window size.
      • iterations int - The node2vec number of iterations.
      • random_seed int - The node2vec random seed.
      • strategy dict - Fully override the embed graph strategy.
      "}, {"location": "config/yaml/#umap", "title": "umap", "text": "

      Indicates whether we should run UMAP dimensionality reduction. This is used to provide an x/y coordinate to each graph node, suitable for visualization. If this is not enabled, nodes will receive a 0/0 x/y coordinate. If this is enabled, you must enable graph embedding as well.

      "}, {"location": "config/yaml/#fields_17", "title": "Fields", "text": "
      • enabled bool - Whether to enable UMAP layouts.
      "}, {"location": "config/yaml/#snapshots", "title": "snapshots", "text": ""}, {"location": "config/yaml/#fields_18", "title": "Fields", "text": "
      • embeddings bool - Export embeddings snapshots to parquet.
      • graphml bool - Export graph snapshots to GraphML.
      "}, {"location": "config/yaml/#query", "title": "Query", "text": ""}, {"location": "config/yaml/#local_search", "title": "local_search", "text": ""}, {"location": "config/yaml/#fields_19", "title": "Fields", "text": "
      • chat_model_id str - Name of the model definition to use for Chat Completion calls.
      • embedding_model_id str - Name of the model definition to use for Embedding calls.
      • prompt str - The prompt file to use.
      • text_unit_prop float - The text unit proportion.
      • community_prop float - The community proportion.
      • conversation_history_max_turns int - The conversation history maximum turns.
      • top_k_entities int - The top k mapped entities.
      • top_k_relationships int - The top k mapped relations.
      • max_context_tokens int - The maximum tokens to use building the request context.
      "}, {"location": "config/yaml/#global_search", "title": "global_search", "text": ""}, {"location": "config/yaml/#fields_20", "title": "Fields", "text": "
      • chat_model_id str - Name of the model definition to use for Chat Completion calls.
      • map_prompt str - The mapper prompt file to use.
      • reduce_prompt str - The reducer prompt file to use.
      • knowledge_prompt str - The knowledge prompt file to use.
      • map_prompt str | None - The global search mapper prompt to use.
      • reduce_prompt str | None - The global search reducer to use.
      • knowledge_prompt str | None - The global search general prompt to use.
      • max_context_tokens int - The maximum context size to create, in tokens.
      • data_max_tokens int - The maximum tokens to use constructing the final response from the reduces responses.
      • map_max_length int - The maximum length to request for map responses, in words.
      • reduce_max_length int - The maximum length to request for reduce responses, in words.
      • dynamic_search_threshold int - Rating threshold in include a community report.
      • dynamic_search_keep_parent bool - Keep parent community if any of the child communities are relevant.
      • dynamic_search_num_repeats int - Number of times to rate the same community report.
      • dynamic_search_use_summary bool - Use community summary instead of full_context.
      • dynamic_search_max_level int - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
      "}, {"location": "config/yaml/#drift_search", "title": "drift_search", "text": ""}, {"location": "config/yaml/#fields_21", "title": "Fields", "text": "
      • chat_model_id str - Name of the model definition to use for Chat Completion calls.
      • embedding_model_id str - Name of the model definition to use for Embedding calls.
      • prompt str - The prompt file to use.
      • reduce_prompt str - The reducer prompt file to use.
      • data_max_tokens int - The data llm maximum tokens.
      • reduce_max_tokens int - The maximum tokens for the reduce phase. Only use if a non-o-series model.
      • reduce_max_completion_tokens int - The maximum tokens for the reduce phase. Only use for o-series models.
      • concurrency int - The number of concurrent requests.
      • drift_k_followups int - The number of top global results to retrieve.
      • primer_folds int - The number of folds for search priming.
      • primer_llm_max_tokens int - The maximum number of tokens for the LLM in primer.
      • n_depth int - The number of drift search steps to take.
      • local_search_text_unit_prop float - The proportion of search dedicated to text units.
      • local_search_community_prop float - The proportion of search dedicated to community properties.
      • local_search_top_k_mapped_entities int - The number of top K entities to map during local search.
      • local_search_top_k_relationships int - The number of top K relationships to map during local search.
      • local_search_max_data_tokens int - The maximum context size in tokens for local search.
      • local_search_temperature float - The temperature to use for token generation in local search.
      • local_search_top_p float - The top-p value to use for token generation in local search.
      • local_search_n int - The number of completions to generate in local search.
      • local_search_llm_max_gen_tokens int - The maximum number of generated tokens for the LLM in local search. Only use if a non-o-series model.
      • local_search_llm_max_gen_completion_tokens int - The maximum number of generated tokens for the LLM in local search. Only use for o-series models.
      "}, {"location": "config/yaml/#basic_search", "title": "basic_search", "text": ""}, {"location": "config/yaml/#fields_22", "title": "Fields", "text": "
      • chat_model_id str - Name of the model definition to use for Chat Completion calls.
      • embedding_model_id str - Name of the model definition to use for Embedding calls.
      • prompt str - The prompt file to use.
      • k int | None - Number of text units to retrieve from the vector store for context building.
      "}, {"location": "data/operation_dulce/ABOUT/", "title": "About", "text": "

      This document (Operation Dulce) is an AI-generated science fiction novella, included here for the purposes of integration testing.

      "}, {"location": "index/byog/", "title": "Bring Your Own Graph", "text": "

      Several users have asked if they can bring their own existing graph and have it summarized for query with GraphRAG. There are many possible ways to do this, but here we'll describe a simple method that aligns with the existing GraphRAG workflows quite easily.

      To cover the basic use cases for GraphRAG query, you should have two or three tables derived from your data:

      • entities.parquet - this is the list of entities found in the dataset, which are the nodes of the graph.
      • relationships.parquet - this is the list of relationships found in the dataset, which are the edges of the graph.
      • text_units.parquet - this is the source text chunks the graph was extracted from. This is optional depending on the query method you intend to use (described later).

      The approach described here will be to run a custom GraphRAG workflow pipeline that assumes the text chunking, entity extraction, and relationship extraction has already occurred.

      "}, {"location": "index/byog/#tables", "title": "Tables", "text": ""}, {"location": "index/byog/#entities", "title": "Entities", "text": "

      See the full entities table schema. For graph summarization purposes, you only need id, title, description, and the list of text_unit_ids.

      The additional properties are used for optional graph visualization purposes.

      "}, {"location": "index/byog/#relationships", "title": "Relationships", "text": "

      See the full relationships table schema. For graph summarization purposes, you only need id, source, target, description, weight, and the list of text_unit_ids.

      Note: the weight field is important because it is used to properly compute Leiden communities!

      "}, {"location": "index/byog/#workflow-configuration", "title": "Workflow Configuration", "text": "

      GraphRAG includes the ability to specify only the specific workflow steps that you need. For basic graph summarization and query, you need the following config in your settings.yaml:

      workflows: [create_communities, create_community_reports]\n

      This will result in only the minimal workflows required for GraphRAG Global Search.

      "}, {"location": "index/byog/#optional-additional-config", "title": "Optional Additional Config", "text": "

      If you would like to run Local, DRIFT, or Basic Search, you will need to include text_units and some embeddings.

      "}, {"location": "index/byog/#text-units", "title": "Text Units", "text": "

      See the full text_units table schema. Text units are chunks of your documents that are sized to ensure they fit into the context window of your model. Some search methods use these, so you may want to include them if you have them.

      "}, {"location": "index/byog/#expanded-config", "title": "Expanded Config", "text": "

      To perform the other search types above, you need some of the content to be embedded. Simply add the embeddings workflow:

      workflows: [create_communities, create_community_reports, generate_text_embeddings]\n
      "}, {"location": "index/byog/#fastgraphrag", "title": "FastGraphRAG", "text": "

      FastGraphRAG uses text_units for the community reports instead of the entity and relationship descriptions. If your graph is sourced in such a way that it does not have descriptions, this might be a useful alternative. In this case, you would update your workflows list to include the text variant of the community reports workflow:

      workflows: [create_communities, create_community_reports_text, generate_text_embeddings]\n

      This method requires that your entities and relationships tables have valid links to a list of text_unit_ids. Also note that generate_text_embeddings is still only required if you are doing searches other than Global Search.

      "}, {"location": "index/byog/#setup", "title": "Setup", "text": "

      Putting it all together:

      • output: Create an output folder and put your entities and relationships (and optionally text_units) parquet files in it.
      • Update your config as noted above to only run the workflows subset you need.
      • Run graphrag index --root <your project root>
      "}, {"location": "index/default_dataflow/", "title": "Indexing Dataflow", "text": ""}, {"location": "index/default_dataflow/#the-graphrag-knowledge-model", "title": "The GraphRAG Knowledge Model", "text": "

      The knowledge model is a specification for data outputs that conform to our data-model definition. You can find these definitions in the python/graphrag/graphrag/model folder within the GraphRAG repository. The following entity types are provided. The fields here represent the fields that are text-embedded by default.

      • Document - An input document into the system. These either represent individual rows in a CSV or individual .txt file.
      • TextUnit - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set CHUNK_BY_COLUMNS to id so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.
      • Entity - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
      • Relationship - A relationship between two entities.
      • Covariate - Extracted claim information, which contains statements about entities which may be time-bound.
      • Community - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
      • Community Report - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
      "}, {"location": "index/default_dataflow/#the-default-configuration-workflow", "title": "The Default Configuration Workflow", "text": "

      Let's take a look at how the default-configuration workflow transforms text documents into the GraphRAG Knowledge Model. This page gives a general overview of the major steps in this process. To fully configure this workflow, check out the configuration documentation.

      ---\ntitle: Dataflow Overview\n---\nflowchart TB\n    subgraph phase1[Phase 1: Compose TextUnits]\n    documents[Documents] --> chunk[Chunk]\n    chunk --> textUnits[Text Units]\n    end\n    subgraph phase2[Phase 2: Graph Extraction]\n    textUnits --> graph_extract[Entity & Relationship Extraction]\n    graph_extract --> graph_summarize[Entity & Relationship Summarization]\n    graph_summarize --> claim_extraction[Claim Extraction]\n    claim_extraction --> graph_outputs[Graph Tables]\n    end\n    subgraph phase3[Phase 3: Graph Augmentation]\n    graph_outputs --> community_detect[Community Detection]\n    community_detect --> community_outputs[Communities Table]\n    end\n    subgraph phase4[Phase 4: Community Summarization]\n    community_outputs --> summarized_communities[Community Summarization]\n    summarized_communities --> community_report_outputs[Community Reports Table]\n    end\n    subgraph phase5[Phase 5: Document Processing]\n    documents --> link_to_text_units[Link to TextUnits]\n    textUnits --> link_to_text_units\n    link_to_text_units --> document_outputs[Documents Table]\n    end\n    subgraph phase6[Phase 6: Network Visualization]\n    graph_outputs --> graph_embed[Graph Embedding]\n    graph_embed --> umap_entities[Umap Entities]\n    umap_entities --> combine_nodes[Final Entities]\n    end\n    subgraph phase7[Phase 7: Text Embeddings]\n    textUnits --> text_embed[Text Embedding]\n    graph_outputs --> description_embed[Description Embedding]\n    community_report_outputs --> content_embed[Content Embedding]\n    end
      "}, {"location": "index/default_dataflow/#phase-1-compose-textunits", "title": "Phase 1: Compose TextUnits", "text": "

      The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.

      The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single \"glean\" step. (A \"glean\" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.

      The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)

      ---\ntitle: Documents into Text Chunks\n---\nflowchart LR\n    doc1[Document 1] --> tu1[TextUnit 1]\n    doc1 --> tu2[TextUnit 2]\n    doc2[Document 2] --> tu3[TextUnit 3]\n    doc2 --> tu4[TextUnit 4]\n
      "}, {"location": "index/default_dataflow/#phase-2-graph-extraction", "title": "Phase 2: Graph Extraction", "text": "

      In this phase, we analyze each text unit and extract our graph primitives: Entities, Relationships, and Claims. Entities and Relationships are extracted at once in our entity_extract verb, and claims are extracted in our claim_extract verb. Results are then combined and passed into following phases of the pipeline.

      ---\ntitle: Graph Extraction\n---\nflowchart LR\n    tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]\n    tu --> ce[Claim Extraction]
      "}, {"location": "index/default_dataflow/#entity-relationship-extraction", "title": "Entity & Relationship Extraction", "text": "

      In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a title, type, and description, and a list of relationships with a source, target, and description.

      These subgraphs are merged together - any entities with the same title and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.

      "}, {"location": "index/default_dataflow/#entity-relationship-summarization", "title": "Entity & Relationship Summarization", "text": "

      Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.

      "}, {"location": "index/default_dataflow/#claim-extraction-optional", "title": "Claim Extraction (optional)", "text": "

      Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called Covariates.

      Note: claim extraction is optional and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.

      "}, {"location": "index/default_dataflow/#phase-3-graph-augmentation", "title": "Phase 3: Graph Augmentation", "text": "

      Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the topological structure of our graph.

      ---\ntitle: Graph Augmentation\n---\nflowchart LR\n    cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]
      "}, {"location": "index/default_dataflow/#community-detection", "title": "Community Detection", "text": "

      In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.

      "}, {"location": "index/default_dataflow/#graph-tables", "title": "Graph Tables", "text": "

      Once our graph augmentation steps are complete, the final Entities, Relationships, and Communities tables are exported.

      "}, {"location": "index/default_dataflow/#phase-4-community-summarization", "title": "Phase 4: Community Summarization", "text": "
      ---\ntitle: Community Summarization\n---\nflowchart LR\n    sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]

      At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.

      Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.

      "}, {"location": "index/default_dataflow/#generate-community-reports", "title": "Generate Community Reports", "text": "

      In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.

      "}, {"location": "index/default_dataflow/#summarize-community-reports", "title": "Summarize Community Reports", "text": "

      In this step, each community report is then summarized via the LLM for shorthand use.

      "}, {"location": "index/default_dataflow/#community-reports-table", "title": "Community Reports Table", "text": "

      At this point, some bookkeeping work is performed and we export the Community Reports tables.

      "}, {"location": "index/default_dataflow/#phase-5-document-processing", "title": "Phase 5: Document Processing", "text": "

      In this phase of the workflow, we create the Documents table for the knowledge model.

      ---\ntitle: Document Processing\n---\nflowchart LR\n    aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]
      "}, {"location": "index/default_dataflow/#augment-with-columns-csv-only", "title": "Augment with Columns (CSV Only)", "text": "

      If the workflow is operating on CSV data, you may configure your workflow to add additional fields to Documents output. These fields should exist on the incoming CSV tables. Details about configuring this can be found in the configuration documentation.

      "}, {"location": "index/default_dataflow/#link-to-textunits", "title": "Link to TextUnits", "text": "

      In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.

      "}, {"location": "index/default_dataflow/#documents-table", "title": "Documents Table", "text": "

      At this point, we can export the Documents table into the knowledge Model.

      "}, {"location": "index/default_dataflow/#phase-6-network-visualization-optional", "title": "Phase 6: Network Visualization (optional)", "text": "

      In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the Entity-Relationship graph and the Document graph.

      ---\ntitle: Network Visualization Workflows\n---\nflowchart LR\n    ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Entities Table]
      "}, {"location": "index/default_dataflow/#graph-embedding", "title": "Graph Embedding", "text": "

      In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.

      "}, {"location": "index/default_dataflow/#dimensionality-reduction", "title": "Dimensionality Reduction", "text": "

      For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are reduced to two dimensions as x/y coordinates.

      "}, {"location": "index/default_dataflow/#phase-7-text-embedding", "title": "Phase 7: Text Embedding", "text": "

      For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.

      ---\ntitle: Text Embedding Workflows\n---\nflowchart LR\n    textUnits[Text Units] --> text_embed[Text Embedding]\n    graph_outputs[Graph Tables] --> description_embed[Description Embedding]\n    community_report_outputs[Community Reports] --> content_embed[Content Embedding]
      "}, {"location": "index/inputs/", "title": "Inputs", "text": "

      GraphRAG supports several input formats to simplify ingesting your data. The mechanics and features available for input files and text chunking are discussed here.

      "}, {"location": "index/inputs/#input-loading-and-schema", "title": "Input Loading and Schema", "text": "

      All input formats are loaded within GraphRAG and passed to the indexing pipeline as a documents DataFrame. This DataFrame has a row for each document using a shared column schema:

      name type description id str ID of the document. This is generated using a hash of the text content to ensure stability across runs. text str The full text of the document. title str Name of the document. Some formats allow this to be configured. creation_date str The creation date of the document, represented as an ISO8601 string. This is harvested from the source file system. metadata dict Optional additional document metadata. More details below.

      Also see the outputs documentation for the final documents table schema saved to parquet after pipeline completion.

      "}, {"location": "index/inputs/#formats", "title": "Formats", "text": "

      We support three file formats out-of-the-box. This covers the overwhelming majority of use cases we have encountered. If you have a different format, we recommend writing a script to convert to one of these, which are widely used and supported by many tools and libraries.

      "}, {"location": "index/inputs/#plain-text", "title": "Plain Text", "text": "

      Plain text files (typically ending in .txt file extension). With plain text files we import the entire file contents as the text field, and the title is always the filename.

      "}, {"location": "index/inputs/#comma-delimited", "title": "Comma-delimited", "text": "

      CSV files (typically ending in a .csv extension). These are loaded using pandas' read_csv method with default options. Each row in a CSV file is treated as a single document. If you have multiple CSV files in your input folder, they will be concatenated into a single resulting documents DataFrame.

      With the CSV format you can configure the text_column, and title_column if your data has structured content you would prefer to use. If you do not configure these within the input block of your settings.yaml, the title will be the filename as described in the schema above. The text_column is assumed to be \"text\" in your file if not configured specifically. We will also look for and use an \"id\" column if present, otherwise the ID will be generated as described above.

      "}, {"location": "index/inputs/#json", "title": "JSON", "text": "

      JSON files (typically ending in a .json extension) contain structured objects. These are loaded using python's json.loads method, so your files must be properly compliant. JSON files may contain a single object in the file or the file may contain an array of objects at the root. We will check for and handle either of these cases. As with CSV, multiple files will be concatenated into a final table, and the text_column and title_column config options will be applied to the properties of each loaded object. Note that the specialized jsonl format produced by some libraries (one full JSON object on each line, not in an array) is not currently supported.

      "}, {"location": "index/inputs/#metadata", "title": "Metadata", "text": "

      With the structured file formats (CSV and JSON) you can configure any number of columns to be added to a persisted metadata field in the DataFrame. This is configured by supplying a list of columns name to collect. If this is configured, the output metadata column will have a dict containing a key for each column, and the value of the column for that document. This metadata can optionally be used later in the GraphRAG pipeline.

      "}, {"location": "index/inputs/#example", "title": "Example", "text": "

      software.csv

      text,title,tag\nMy first program,Hello World,tutorial\nAn early space shooter game,Space Invaders,arcade\n

      settings.yaml

      input:\n    metadata: [title,tag]\n

      Documents DataFrame

      id title text creation_date metadata (generated from text) Hello World My first program (create date of software.csv) { \"title\": \"Hello World\", \"tag\": \"tutorial\" } (generated from text) Space Invaders An early space shooter game (create date of software.csv) { \"title\": \"Space Invaders\", \"tag\": \"arcade\" }"}, {"location": "index/inputs/#chunking-and-metadata", "title": "Chunking and Metadata", "text": "

      As described on the default dataflow page, documents are chunked into smaller \"text units\" for processing. This is done because document content size often exceeds the available context window for a given language model. There are a handful of settings you can adjust for this chunking, the most relevant being the chunk_size and overlap. We now also support a metadata processing scheme that can improve indexing results for some use cases. We will describe this feature in detail here.

      Imagine the following scenario: you are indexing a collection of news articles. Each article text starts with a headline and author, and then proceeds with the content. When documents are chunked, they are split evenly according to your configured chunk size. In other words, the first n tokens are read into a text unit, and then the next n, until the end of the content. This means that front matter at the beginning of the document (such as the headline and author in this example) is not copied to each chunk. It only exists in the first chunk. When we later retrieve those chunks for summarization, they may therefore be missing shared information about the source document that should always be provided to the model. We have configuration options to copy repeated content into each text unit to address this issue.

      "}, {"location": "index/inputs/#input-config", "title": "Input Config", "text": "

      As described above, when documents are imported you can specify a list of metadata columns to include with each row. This must be configured for the per-chunk copying to work.

      "}, {"location": "index/inputs/#chunking-config", "title": "Chunking Config", "text": "

      Next, the chunks block needs to instruct the chunker how to handle this metadata when creating text units. By default, it is ignored. We have two settings to include it:

      • prepend_metadata. This instructs the importer to copy the contents of the metadata column for each row into the start of every single text chunk. This metadata is copied as key: value pairs on new lines.
      • chunk_size_includes_metadata: This tells the chunker how to compute the chunk size when metadata is included. By default, we create the text units using your specified chunk_size and then prepend the metadata. This means that the final text unit lengths may be longer than your configured chunk_size, and it will vary based on the length of the metadata for each document. When this setting is True, we will compute the raw text using the remainder after measuring the metadata length so that the resulting text units always comply with your configured chunk_size.
      "}, {"location": "index/inputs/#examples", "title": "Examples", "text": "

      The following are several examples to help illustrate how chunking config and metadate prepending works for each file format. Note that we are using word count here as \"tokens\" for the illustration, but language model tokens are not equivalent to words.

      "}, {"location": "index/inputs/#text-files", "title": "Text files", "text": "

      This example uses two individual news article text files.

      --

      File: US to lift most federal COVID-19 vaccine mandates.txt

      Content:

      WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday.

      --

      File: NY lawmakers begin debating budget 1 month after due date.txt

      Content:

      ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.

      --

      settings.yaml

      input:\n    file_type: text\n    metadata: [title]\n\nchunks:\n    size: 100\n    overlap: 0\n    prepend_metadata: true\n    chunk_size_includes_metadata: false\n

      Documents DataFrame

      id title text creation_date metadata (generated from text) US to lift most federal COVID-19 vaccine mandates.txt (full content of text file) (create date of article txt file) { \"title\": \"US to lift most federal COVID-19 vaccine mandates.txt\" } (generated from text) NY lawmakers begin debating budget 1 month after due date.txt (full content of text file) (create date of article txt file) { \"title\": \"NY lawmakers begin debating budget 1 month after due date.txt\" }

      Raw Text Chunks

      content length title: US to lift most federal COVID-19 vaccine mandates.txtWASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as 109 title: US to lift most federal COVID-19 vaccine mandates.txtthe deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. 82 title: NY lawmakers begin debating budget 1 month after due date.txtALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to 111 title: NY lawmakers begin debating budget 1 month after due date.txtbe wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it 111 title: NY lawmakers begin debating budget 1 month after due date.txtwould undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate. 89

      In this example we can see that the two input documents were parsed into five output text chunks. The title (filename) of each document is prepended but not included in the computed chunk size. Also note that the final text chunk for each document is usually smaller than the chunk size because it contains the last tokens.

      "}, {"location": "index/inputs/#csv-files", "title": "CSV files", "text": "

      This example uses a single CSV file with the same two articles as rows (note that the text content is not properly escaped for actual CSV use).

      --

      File: articles.csv

      Content

      headline,article

      US to lift most federal COVID-19 vaccine mandates,WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday.

      NY lawmakers begin debating budget 1 month after due date,ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.

      --

      settings.yaml

      input:\n    file_type: csv\n    title_column: headline\n    text_column: article\n    metadata: [headline]\n\nchunks:\n    size: 50\n    overlap: 5\n    prepend_metadata: true\n    chunk_size_includes_metadata: true\n

      Documents DataFrame

      id title text creation_date metadata (generated from text) US to lift most federal COVID-19 vaccine mandates (article column content) (create date of articles.csv) { \"headline\": \"US to lift most federal COVID-19 vaccine mandates\" } (generated from text) NY lawmakers begin debating budget 1 month after due date (article column content) (create date of articles.csv) { \"headline\": \"NY lawmakers begin debating budget 1 month after due date\" }

      Raw Text Chunks

      content length title: US to lift most federal COVID-19 vaccine mandatesWASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, 50 title: US to lift most federal COVID-19 vaccine mandatesfederal workers and federal contractors as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. 50 title: US to lift most federal COVID-19 vaccine mandatesnoncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how 50 title: US to lift most federal COVID-19 vaccine mandatesthe latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that 50 title: US to lift most federal COVID-19 vaccine mandatespoint where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. 38 title: NY lawmakers begin debating budget 1 month after due dateALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new 50 title: NY lawmakers begin debating budget 1 month after due datestoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and 50 title: NY lawmakers begin debating budget 1 month after due dateto the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget 50 title: NY lawmakers begin debating budget 1 month after due dateup Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been 50 title: NY lawmakers begin debating budget 1 month after due datevoting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges 50 title: NY lawmakers begin debating budget 1 month after due datethe standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 50 title: NY lawmakers begin debating budget 1 month after due datebail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum 50 title: NY lawmakers begin debating budget 1 month after due dateto state officials. The minimum wage would be raised to $17 in be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 50 title: NY lawmakers begin debating budget 1 month after due date2026. That's up from $15 in the city and $14.20 upstate. 22

      In this example we can see that the two input documents were parsed into fourteen output text chunks. The title (headline) of each document is prepended and included in the computed chunk size, so each chunk matches the configured chunk size (except the last one for each document). We've also configured some overlap in these text chunks, so the last five tokens are shared. Why would you use overlap in your text chunks? Consider that when you are splitting documents based on tokens, it is highly likely that sentences or even related concepts will be split into separate chunks. Each text chunk is processed separately by the language model, so this may result in incomplete \"ideas\" at the boundaries of the chunk. Overlap ensures that these split concepts are fully contained in at least one of the chunks.

      "}, {"location": "index/inputs/#json-files", "title": "JSON files", "text": "

      This final example uses a JSON file for each of the same two articles. In this example we'll set the object fields to read, but we will not add metadata to the text chunks.

      --

      File: article1.json

      Content

      {\n    \"headline\": \"US to lift most federal COVID-19 vaccine mandates\",\n    \"content\": \"WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday.\"\n}\n

      File: article2.json

      Content

      {\n    \"headline\": \"NY lawmakers begin debating budget 1 month after due date\",\n    \"content\": \"ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate.\"\n}\n

      --

      settings.yaml

      input:\n    file_type: json\n    title_column: headline\n    text_column: content\n\nchunks:\n    size: 100\n    overlap: 10\n

      Documents DataFrame

      id title text creation_date metadata (generated from text) US to lift most federal COVID-19 vaccine mandates (article column content) (create date of article1.json) { } (generated from text) NY lawmakers begin debating budget 1 month after due date (article column content) (create date of article2.json) { }

      Raw Text Chunks

      content length WASHINGTON (AP) The Biden administration will end most of the last remaining federal COVID-19 vaccine requirements next week when the national public health emergency for the coronavirus ends, the White House said Monday. Vaccine requirements for federal workers and federal contractors, as well as foreign air travelers to the U.S., will end May 11. The government is also beginning the process of lifting shot requirements for Head Start educators, healthcare workers, and noncitizens at U.S. land borders. The requirements are among the last vestiges of some of the more coercive measures taken by the federal government to promote vaccination as 100 measures taken by the federal government to promote vaccination as the deadly virus raged, and their end marks the latest display of how President Joe Biden's administration is moving to treat COVID-19 as a routine, endemic illness. \"While I believe that these vaccine mandates had a tremendous beneficial impact, we are now at a point where we think that it makes a lot of sense to pull these requirements down,\" White House COVID-19 coordinator Dr. Ashish Jha told The Associated Press on Monday. 83 ALBANY, N.Y. (AP) New York lawmakers began voting Monday on a $229 billion state budget due a month ago that would raise the minimum wage, crack down on illicit pot shops and ban gas stoves and furnaces in new buildings. Negotiations among Gov. Kathy Hochul and her fellow Democrats in control of the Legislature dragged on past the April 1 budget deadline, largely because of disagreements over changes to the bail law and other policy proposals included in the spending plan. Floor debates on some budget bills began Monday. State Senate Majority Leader Andrea Stewart-Cousins said she expected voting to 100 Senate Majority Leader Andrea Stewart-Cousins said she expected voting to be wrapped up Tuesday for a budget she said contains \"significant wins\" for New Yorkers. \"I would have liked to have done this sooner. I think we would all agree to that,\" Cousins told reporters before voting began. \"This has been a very policy-laden budget and a lot of the policies had to parsed through.\" Hochul was able to push through a change to the bail law that will eliminate the standard that requires judges to prescribe the \"least restrictive\" means to ensure defendants return to court. Hochul said judges 100 means to ensure defendants return to court. Hochul said judges needed the extra discretion. Some liberal lawmakers argued that it would undercut the sweeping bail reforms approved in 2019 and result in more people with low incomes and people of color in pretrial detention. Here are some other policy provisions that will be included in the budget, according to state officials. The minimum wage would be raised to $17 in New York City and some of its suburbs and $16 in the rest of the state by 2026. That's up from $15 in the city and $14.20 upstate. 98

      In this example the two input documents were parsed into five output text chunks. There is no metadata prepended, so each chunk matches the configured chunk size (except the last one for each document). We've also configured some overlap in these text chunks, so the last ten tokens are shared.

      "}, {"location": "index/methods/", "title": "Indexing Methods", "text": "

      GraphRAG is a platform for our research into RAG indexing methods that produce optimal context window content for language models. We have a standard indexing pipeline that uses a language model to extract the graph that our memory model is based upon. We may introduce additional indexing methods from time to time. This page documents those options.

      "}, {"location": "index/methods/#standard-graphrag", "title": "Standard GraphRAG", "text": "

      This is the method described in the original blog post. Standard uses a language model for all reasoning tasks:

      • entity extraction: LLM is prompted to extract named entities and provide a description from each text unit.
      • relationship extraction: LLM is prompted to describe the relationship between each pair of entities in each text unit.
      • entity summarization: LLM is prompted to combine the descriptions for every instance of an entity found across the text units into a single summary.
      • relationship summarization: LLM is prompted to combine the descriptions for every instance of a relationship found across the text units into a single summary.
      • claim extraction (optional): LLM is prompted to extract and describe claims from each text unit.
      • community report generation: entity and relationship descriptions (and optionally claims) for each community are collected and used to prompt the LLM to generate a summary report.

      graphrag index --method standard. This is the default method, so the method param can actual be omitted.

      "}, {"location": "index/methods/#fastgraphrag", "title": "FastGraphRAG", "text": "

      FastGraphRAG is a method that substitutes some of the language model reasoning for traditional natural language processing (NLP) methods. This is a hybrid technique that we developed as a faster and cheaper indexing alternative:

      • entity extraction: entities are noun phrases extracted using NLP libraries such as NLTK and spaCy. There is no description; the source text unit is used for this.
      • relationship extraction: relationships are defined as text unit co-occurrence between entity pairs. There is no description.
      • entity summarization: not necessary.
      • relationship summarization: not necessary.
      • claim extraction (optional): unused.
      • community report generation: The direct text unit content containing each entity noun phrase is collected and used to prompt the LLM to generate a summary report.

      graphrag index --method fast

      FastGraphRAG has a handful of NLP options built in. By default we use NLTK + regular expressions for the noun phrase extraction, which is very fast but primarily suitable for English. We have built in two additional methods using spaCy: semantic parsing and CFG. We use the en_core_web_md model by default for spaCy, but note that you can reference any supported model that you have installed.

      Note that we also generally configure the text chunking to produce much smaller chunks (50-100 tokens). This results in a better co-occurrence graph.

      \u26a0\ufe0f Note on SpaCy models:

      This package requires SpaCy models to function correctly. If the required model is not installed, the package will automatically download and install it the first time it is used.

      You can install it manually by running python -m spacy download <model_name>, for example python -m spacy download en_core_web_md.

      "}, {"location": "index/methods/#choosing-a-method", "title": "Choosing a Method", "text": "

      Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG provides high quality summarization at much less LLM cost.

      "}, {"location": "index/outputs/", "title": "Outputs", "text": "

      The default pipeline produces a series of output tables that align with the conceptual knowledge model. This page describes the detailed output table schemas. By default we write these tables out as parquet files on disk.

      "}, {"location": "index/outputs/#shared-fields", "title": "Shared fields", "text": "

      All tables have two identifier fields:

      name type description id str Generated UUID, assuring global uniqueness human_readable_id int This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually."}, {"location": "index/outputs/#communities", "title": "communities", "text": "

      This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.

      name type description community int Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment. parent int Parent community ID. children int[] List of child community IDs. level int Depth of the community in the hierarchy. title str Friendly name of the community. entity_ids str[] List of entities that are members of the community. relationship_ids str[] List of relationships that are wholly within the community (source and target are both in the community). text_unit_ids str[] List of text units represented within the community. period str Date of ingest, used for incremental update merges. ISO8601 size int Size of the community (entity count), used for incremental update merges."}, {"location": "index/outputs/#community_reports", "title": "community_reports", "text": "

      This is the list of summarized reports for each community.

      name type description community int Short ID of the community this report applies to. parent int Parent community ID. children int[] List of child community IDs. level int Level of the community this report applies to. title str LM-generated title for the report. summary str LM-generated summary of the report. full_content str LM-generated full report. rank float LM-derived relevance ranking of the report based on member entity salience rating_explanation str LM-derived explanation of the rank. findings dict LM-derived list of the top 5-10 insights from the community. Contains summary and explanation values. full_content_json json Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users. period str Date of ingest, used for incremental update merges. ISO8601 size int Size of the community (entity count), used for incremental update merges."}, {"location": "index/outputs/#covariates", "title": "covariates", "text": "

      (Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.

      name type description covariate_type str This is always \"claim\" with our default covariates. type str Nature of the claim type. description str LM-generated description of the behavior. subject_id str Name of the source entity (that is performing the claimed behavior). object_id str Name of the target entity (that the claimed behavior is performed on). status str LM-derived assessment of the correctness of the claim. One of [TRUE, FALSE, SUSPECTED] start_date str LM-derived start of the claimed activity. ISO8601 end_date str LM-derived end of the claimed activity. ISO8601 source_text str Short string of text containing the claimed behavior. text_unit_id str ID of the text unit the claim text was extracted from."}, {"location": "index/outputs/#documents", "title": "documents", "text": "

      List of document content after import.

      name type description title str Filename, unless otherwise configured during CSV import. text str Full text of the document. text_unit_ids str[] List of text units (chunks) that were parsed from the document. metadata dict If specified during CSV import, this is a dict of metadata for the document."}, {"location": "index/outputs/#entities", "title": "entities", "text": "

      List of all entities found in the data by the LM.

      name type description title str Name of the entity. type str Type of the entity. By default this will be \"organization\", \"person\", \"geo\", or \"event\" unless configured differently or auto-tuning is used. description str Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions. text_unit_ids str[] List of the text units containing the entity. frequency int Count of text units the entity was found within. degree int Node degree (connectedness) in the graph. x float X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0. y float Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0."}, {"location": "index/outputs/#relationships", "title": "relationships", "text": "

      List of all entity-to-entity relationships found in the data by the LM. This is also the edge list for the graph.

      name type description source str Name of the source entity. target str Name of the target entity. description str LM-derived description of the relationship. Also see note for entity descriptions. weight float Weight of the edge in the graph. This is summed from an LM-derived \"strength\" measure for each relationship instance. combined_degree int Sum of source and target node degrees. text_unit_ids str[] List of text units the relationship was found within."}, {"location": "index/outputs/#text_units", "title": "text_units", "text": "

      List of all text chunks parsed from the input documents.

      name type description text str Raw full text of the chunk. n_tokens int Number of tokens in the chunk. This should normally match the chunk_size config parameter, except for the last chunk which is often shorter. document_ids str[] List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents. entity_ids str[] List of entities found in the text unit. relationships_ids str[] List of relationships found in the text unit. covariate_ids str[] Optional list of covariates found in the text unit."}, {"location": "index/overview/", "title": "GraphRAG Indexing \ud83e\udd16", "text": "

      The GraphRAG indexing package is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using LLMs.

      Indexing Pipelines are configurable. They are composed of workflows, standard and custom steps, prompt templates, and input/output adapters. Our standard pipeline is designed to:

      • extract entities, relationships and claims from raw text
      • perform community detection in entities
      • generate community summaries and reports at multiple levels of granularity
      • embed entities into a graph vector space
      • embed text chunks into a textual vector space

      The outputs of the pipeline are stored as Parquet tables by default, and embeddings are written to your configured vector store.

      "}, {"location": "index/overview/#getting-started", "title": "Getting Started", "text": ""}, {"location": "index/overview/#requirements", "title": "Requirements", "text": "

      See the requirements section in Get Started for details on setting up a development environment.

      To configure GraphRAG, see the configuration documentation. After you have a config file you can run the pipeline using the CLI or the Python API.

      "}, {"location": "index/overview/#usage", "title": "Usage", "text": ""}, {"location": "index/overview/#cli", "title": "CLI", "text": "
      uv run poe index --root <data_root> # default config mode\n
      "}, {"location": "index/overview/#python-api", "title": "Python API", "text": "

      Please see the indexing API python file for the recommended method to call directly from Python code.

      "}, {"location": "index/overview/#further-reading", "title": "Further Reading", "text": "
      • To start developing within the GraphRAG project, see getting started
      • To understand the underlying concepts and execution model of the indexing library, see the architecture documentation
      • To read more about configuring the indexing engine, see the configuration documentation
      "}, {"location": "prompt_tuning/auto_prompt_tuning/", "title": "Auto Prompt Tuning \u2699\ufe0f", "text": "

      GraphRAG provides the ability to create domain adapted prompts for the generation of the knowledge graph. This step is optional, though it is highly encouraged to run it as it will yield better results when executing an Index Run.

      These are generated by loading the inputs, splitting them into chunks (text units) and then running a series of LLM invocations and template substitutions to generate the final prompts. We suggest using the default values provided by the script, but in this page you'll find the detail of each in case you want to further explore and tweak the prompt tuning algorithm.

      Figure 1: Auto Tuning Conceptual Diagram.

      "}, {"location": "prompt_tuning/auto_prompt_tuning/#prerequisites", "title": "Prerequisites", "text": "

      Before running auto tuning, ensure you have already initialized your workspace with the graphrag init command. This will create the necessary configuration files and the default prompts. Refer to the Init Documentation for more information about the initialization process.

      "}, {"location": "prompt_tuning/auto_prompt_tuning/#usage", "title": "Usage", "text": "

      You can run the main script from the command line with various options:

      graphrag prompt-tune [--root ROOT] [--config CONFIG] [--domain DOMAIN]  [--selection-method METHOD] [--limit LIMIT] [--language LANGUAGE] \\\n[--max-tokens MAX_TOKENS] [--chunk-size CHUNK_SIZE] [--n-subset-max N_SUBSET_MAX] [--k K] \\\n[--min-examples-required MIN_EXAMPLES_REQUIRED] [--discover-entity-types] [--output OUTPUT]\n
      "}, {"location": "prompt_tuning/auto_prompt_tuning/#command-line-options", "title": "Command-Line Options", "text": "
      • --config (required): The path to the configuration file. This is required to load the data and model settings.

      • --root (optional): The data project root directory, including the config files (YML, JSON, or .env). Defaults to the current directory.

      • --domain (optional): The domain related to your input data, such as 'space science', 'microbiology', or 'environmental news'. If left empty, the domain will be inferred from the input data.

      • --selection-method (optional): The method to select documents. Options are all, random, auto or top. Default is random.

      • --limit (optional): The limit of text units to load when using random or top selection. Default is 15.

      • --language (optional): The language to use for input processing. If it is different from the inputs' language, the LLM will translate. Default is \"\" meaning it will be automatically detected from the inputs.

      • --max-tokens (optional): Maximum token count for prompt generation. Default is 2000.

      • --chunk-size (optional): The size in tokens to use for generating text units from input documents. Default is 200.

      • --n-subset-max (optional): The number of text chunks to embed when using auto selection method. Default is 300.

      • --k (optional): The number of documents to select when using auto selection method. Default is 15.

      • --min-examples-required (optional): The minimum number of examples required for entity extraction prompts. Default is 2.

      • --discover-entity-types (optional): Allow the LLM to discover and extract entities automatically. We recommend using this when your data covers a lot of topics or it is highly randomized.

      • --output (optional): The folder to save the generated prompts. Default is \"prompts\".

      "}, {"location": "prompt_tuning/auto_prompt_tuning/#example-usage", "title": "Example Usage", "text": "
      python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --domain \"environmental news\" \\\n--selection-method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \\\n--no-entity-types --output /path/to/output\n

      or, with minimal configuration (suggested):

      python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types\n
      "}, {"location": "prompt_tuning/auto_prompt_tuning/#document-selection-methods", "title": "Document Selection Methods", "text": "

      The auto tuning feature ingests the input data and then divides it into text units the size of the chunk size parameter. After that, it uses one of the following selection methods to pick a sample to work with for prompt generation:

      • random: Select text units randomly. This is the default and recommended option.
      • top: Select the head n text units.
      • all: Use all text units for the generation. Use only with small datasets; this option is not usually recommended.
      • auto: Embed text units in a lower-dimensional space and select the k nearest neighbors to the centroid. This is useful when you have a large dataset and want to select a representative sample.
      "}, {"location": "prompt_tuning/auto_prompt_tuning/#modify-env-vars", "title": "Modify Env Vars", "text": "

      After running auto tuning, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default \"prompts\" path.

      • GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE = \"prompts/entity_extraction.txt\"

      • GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE = \"prompts/community_report.txt\"

      • GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE = \"prompts/summarize_descriptions.txt\"

      or in your yaml config file:

      entity_extraction:\n  prompt: \"prompts/entity_extraction.txt\"\n\nsummarize_descriptions:\n  prompt: \"prompts/summarize_descriptions.txt\"\n\ncommunity_reports:\n  prompt: \"prompts/community_report.txt\"\n
      "}, {"location": "prompt_tuning/manual_prompt_tuning/", "title": "Manual Prompt Tuning \u2699\ufe0f", "text": "

      The GraphRAG indexer, by default, will run with a handful of prompts that are designed to work well in the broad context of knowledge discovery. However, it is quite common to want to tune the prompts to better suit your specific use case. We provide a means for you to do this by allowing you to specify a custom prompt file, which will each use a series of token-replacements internally.

      Each of these prompts may be overridden by writing a custom prompt file in plaintext. We use token-replacements in the form of {token_name}, and the descriptions for the available tokens can be found below.

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#indexing-prompts", "title": "Indexing Prompts", "text": ""}, {"location": "prompt_tuning/manual_prompt_tuning/#entityrelationship-extraction", "title": "Entity/Relationship Extraction", "text": "

      Prompt Source

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens", "title": "Tokens", "text": "
      • {input_text} - The input text to be processed.
      • {entity_types} - A list of entity types
      • {tuple_delimiter} - A delimiter for separating values within a tuple. A single tuple is used to represent an individual entity or relationship.
      • {record_delimiter} - A delimiter for separating tuple instances.
      • {completion_delimiter} - An indicator for when generation is complete.
      "}, {"location": "prompt_tuning/manual_prompt_tuning/#summarize-entityrelationship-descriptions", "title": "Summarize Entity/Relationship Descriptions", "text": "

      Prompt Source

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_1", "title": "Tokens", "text": "
      • {entity_name} - The name of the entity or the source/target pair of the relationship.
      • {description_list} - A list of descriptions for the entity or relationship.
      "}, {"location": "prompt_tuning/manual_prompt_tuning/#claim-extraction", "title": "Claim Extraction", "text": "

      Prompt Source

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_2", "title": "Tokens", "text": "
      • {input_text} - The input text to be processed.
      • {tuple_delimiter} - A delimiter for separating values within a tuple. A single tuple is used to represent an individual entity or relationship.
      • {record_delimiter} - A delimiter for separating tuple instances.
      • {completion_delimiter} - An indicator for when generation is complete.
      • {entity_specs} - A list of entity types.
      • {claim_description} - Description of what claims should look like. Default is: \"Any claims or facts that could be relevant to information discovery.\"

      See the configuration documentation for details on how to change this.

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#generate-community-reports", "title": "Generate Community Reports", "text": "

      Prompt Source

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_3", "title": "Tokens", "text": "
      • {input_text} - The input text to generate the report with. This will contain tables of entities and relationships.
      "}, {"location": "prompt_tuning/manual_prompt_tuning/#query-prompts", "title": "Query Prompts", "text": ""}, {"location": "prompt_tuning/manual_prompt_tuning/#local-search", "title": "Local Search", "text": "

      Prompt Source

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_4", "title": "Tokens", "text": "
      • {response_type} - Describe how the response should look. We default to \"multiple paragraphs\".
      • {context_data} - The data tables from GraphRAG's index.
      "}, {"location": "prompt_tuning/manual_prompt_tuning/#global-search", "title": "Global Search", "text": "

      Mapper Prompt Source

      Reducer Prompt Source

      Knowledge Prompt Source

      Global search uses a map/reduce approach to summarization. You can tune these prompts independently. This search also includes the ability to adjust the use of general knowledge from the model's training.

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_5", "title": "Tokens", "text": "
      • {response_type} - Describe how the response should look (reducer only). We default to \"multiple paragraphs\".
      • {context_data} - The data tables from GraphRAG's index.
      "}, {"location": "prompt_tuning/manual_prompt_tuning/#drift-search", "title": "Drift Search", "text": "

      Prompt Source

      "}, {"location": "prompt_tuning/manual_prompt_tuning/#tokens_6", "title": "Tokens", "text": "
      • {response_type} - Describe how the response should look. We default to \"multiple paragraphs\".
      • {context_data} - The data tables from GraphRAG's index.
      • {community_reports} - The most relevant community reports to include in the summarization.
      • {query} - The query text as injected into the context.
      "}, {"location": "prompt_tuning/overview/", "title": "Prompt Tuning \u2699\ufe0f", "text": "

      This page provides an overview of the prompt tuning options available for the GraphRAG indexing engine.

      "}, {"location": "prompt_tuning/overview/#default-prompts", "title": "Default Prompts", "text": "

      The default prompts are the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. More details about each of the default prompts for indexing and query can be found on the manual tuning page.

      "}, {"location": "prompt_tuning/overview/#auto-tuning", "title": "Auto Tuning", "text": "

      Auto Tuning leverages your input data and LLM interactions to create domain adapted prompts for the generation of the knowledge graph. It is highly encouraged to run it as it will yield better results when executing an Index Run. For more details about how to use it, please refer to the Auto Tuning documentation.

      "}, {"location": "prompt_tuning/overview/#manual-tuning", "title": "Manual Tuning", "text": "

      Manual tuning is an advanced use-case. Most users will want to use the Auto Tuning feature instead. Details about how to use manual configuration are available in the manual tuning documentation.

      "}, {"location": "query/drift_search/", "title": "DRIFT Search \ud83d\udd0e", "text": ""}, {"location": "query/drift_search/#combining-local-and-global-search", "title": "Combining Local and Global Search", "text": "

      GraphRAG is a technique that uses large language models (LLMs) to create knowledge graphs and summaries from unstructured text documents and leverages them to improve retrieval-augmented generation (RAG) operations on private datasets. It offers comprehensive global overviews of large, private troves of unstructured text documents while also enabling exploration of detailed, localized information. By using LLMs to create comprehensive knowledge graphs that connect and describe entities and relationships contained in those documents, GraphRAG leverages semantic structuring of the data to generate responses to a wide variety of complex user queries.

      DRIFT search (Dynamic Reasoning and Inference with Flexible Traversal) builds upon Microsoft\u2019s GraphRAG technique, combining characteristics of both global and local search to generate detailed responses in a method that balances computational costs with quality outcomes using our drift search method.

      "}, {"location": "query/drift_search/#methodology", "title": "Methodology", "text": "

      Figure 1. An entire DRIFT search hierarchy highlighting the three core phases of the DRIFT search process. A (Primer): DRIFT compares the user\u2019s query with the top K most semantically relevant community reports, generating a broad initial answer and follow-up questions to steer further exploration. B (Follow-Up): DRIFT uses local search to refine queries, producing additional intermediate answers and follow-up questions that enhance specificity, guiding the engine towards context-rich information. A glyph on each node in the diagram shows the confidence the algorithm has to continue the query expansion step. C (Output Hierarchy): The final output is a hierarchical structure of questions and answers ranked by relevance, reflecting a balanced mix of global insights and local refinements, making the results adaptable and comprehensive.

      DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query\u2019s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions.

      "}, {"location": "query/drift_search/#configuration", "title": "Configuration", "text": "

      Below are the key parameters of the DRIFTSearch class:

      • llm: OpenAI model object to be used for response generation
      • context_builder: context builder object to be used for preparing context data from community reports and query information
      • config: model to define the DRIFT Search hyperparameters. DRIFT Config model
      • token_encoder: token encoder for tracking the budget for the algorithm.
      • query_state: a state object as defined in Query State that allows to track execution of a DRIFT Search instance, alongside follow ups and DRIFT actions.
      "}, {"location": "query/drift_search/#how-to-use", "title": "How to Use", "text": "

      An example of a drift search scenario can be found in the following notebook.

      "}, {"location": "query/drift_search/#learn-more", "title": "Learn More", "text": "

      For a more in-depth look at the DRIFT search method, please refer to our DRIFT Search blog post

      "}, {"location": "query/global_search/", "title": "Global Search \ud83d\udd0e", "text": ""}, {"location": "query/global_search/#whole-dataset-reasoning", "title": "Whole Dataset Reasoning", "text": "

      Baseline RAG struggles with queries that require aggregation of information across the dataset to compose an answer. Queries such as \u201cWhat are the top 5 themes in the data?\u201d perform terribly because baseline RAG relies on a vector search of semantically similar text content within the dataset. There is nothing in the query to direct it to the correct information.

      However, with GraphRAG we can answer such questions, because the structure of the LLM-generated knowledge graph tells us about the structure (and thus themes) of the dataset as a whole. This allows the private dataset to be organized into meaningful semantic clusters that are pre-summarized. Using our global search method, the LLM uses these clusters to summarize these themes when responding to a user query.

      "}, {"location": "query/global_search/#methodology", "title": "Methodology", "text": "
      ---\ntitle: Global Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n    uq[User Query] --- .1\n    ch1[Conversation History] --- .1\n\n    subgraph RIR\n        direction TB\n        ri1[Rated Intermediate<br/>Response 1]~~~ri2[Rated Intermediate<br/>Response 2] -.\"{1..N}\".-rin[Rated Intermediate<br/>Response N]\n    end\n\n    .1--Shuffled Community<br/>Report Batch 1-->RIR\n    .1--Shuffled Community<br/>Report Batch 2-->RIR---.2\n    .1--Shuffled Community<br/>Report Batch N-->RIR\n\n    .2--Ranking +<br/>Filtering-->agr[Aggregated Intermediate<br/>Responses]-->res[Response]\n\n\n\n     classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n     classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n     classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n     classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n     classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n     classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n     class uq,ch1 turquoise;\n     class ri1,ri2,rin rose;\n     class agr orange;\n     class res purple;\n     class .1,.2 invisible;\n

      Given a user query and, optionally, the conversation history, the global search method uses a collection of LLM-generated community reports from a specified level of the graph's community hierarchy as context data to generate response in a map-reduce manner. At the map step, community reports are segmented into text chunks of pre-defined size. Each text chunk is then used to produce an intermediate response containing a list of point, each of which is accompanied by a numerical rating indicating the importance of the point. At the reduce step, a filtered set of the most important points from the intermediate responses are aggregated and used as the context to generate the final response.

      The quality of the global search\u2019s response can be heavily influenced by the level of the community hierarchy chosen for sourcing community reports. Lower hierarchy levels, with their detailed reports, tend to yield more thorough responses, but may also increase the time and LLM resources needed to generate the final response due to the volume of reports.

      "}, {"location": "query/global_search/#configuration", "title": "Configuration", "text": "

      Below are the key parameters of the GlobalSearch class:

      • llm: OpenAI model object to be used for response generation
      • context_builder: context builder object to be used for preparing context data from community reports
      • map_system_prompt: prompt template used in the map stage. Default template can be found at map_system_prompt
      • reduce_system_prompt: prompt template used in the reduce stage, default template can be found at reduce_system_prompt
      • response_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)
      • allow_general_knowledge: setting this to True will include additional instructions to the reduce_system_prompt to prompt the LLM to incorporate relevant real-world knowledge outside of the dataset. Note that this may increase hallucinations, but can be useful for certain scenarios. Default is False *general_knowledge_inclusion_prompt: instruction to add to the reduce_system_prompt if allow_general_knowledge is enabled. Default instruction can be found at general_knowledge_instruction
      • max_data_tokens: token budget for the context data
      • map_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call at the map stage
      • reduce_llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to passed to the LLM call at the reduce stage
      • context_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context window for the map stage.
      • concurrent_coroutines: controls the degree of parallelism in the map stage.
      • callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming events
      "}, {"location": "query/global_search/#how-to-use", "title": "How to Use", "text": "

      An example of a global search scenario can be found in the following notebook.

      "}, {"location": "query/local_search/", "title": "Local Search \ud83d\udd0e", "text": ""}, {"location": "query/local_search/#entity-based-reasoning", "title": "Entity-based Reasoning", "text": "

      The local search method combines structured data from the knowledge graph with unstructured data from the input documents to augment the LLM context with relevant entity information at query time. It is well-suited for answering questions that require an understanding of specific entities mentioned in the input documents (e.g., \u201cWhat are the healing properties of chamomile?\u201d).

      "}, {"location": "query/local_search/#methodology", "title": "Methodology", "text": "
      ---\ntitle: Local Search Dataflow\n---\n%%{ init: { 'flowchart': { 'curve': 'step' } } }%%\nflowchart LR\n\n    uq[User Query] ---.1\n    ch1[Conversation<br/>History]---.1\n\n    .1--Entity<br/>Description<br/>Embedding--> ee[Extracted Entities]\n\n    ee[Extracted Entities] ---.2--Entity-Text<br/>Unit Mapping--> ctu[Candidate<br/>Text Units]--Ranking + <br/>Filtering -->ptu[Prioritized<br/>Text Units]---.3\n    .2--Entity-Report<br/>Mapping--> ccr[Candidate<br/>Community Reports]--Ranking + <br/>Filtering -->pcr[Prioritized<br/>Community Reports]---.3\n    .2--Entity-Entity<br/>Relationships--> ce[Candidate<br/>Entities]--Ranking + <br/>Filtering -->pe[Prioritized<br/>Entities]---.3\n    .2--Entity-Entity<br/>Relationships--> cr[Candidate<br/>Relationships]--Ranking + <br/>Filtering -->pr[Prioritized<br/>Relationships]---.3\n    .2--Entity-Covariate<br/>Mappings--> cc[Candidate<br/>Covariates]--Ranking + <br/>Filtering -->pc[Prioritized<br/>Covariates]---.3\n    ch1 -->ch2[Conversation History]---.3\n    .3-->res[Response]\n\n     classDef green fill:#26B653,stroke:#333,stroke-width:2px,color:#fff;\n     classDef turquoise fill:#19CCD3,stroke:#333,stroke-width:2px,color:#fff;\n     classDef rose fill:#DD8694,stroke:#333,stroke-width:2px,color:#fff;\n     classDef orange fill:#F19914,stroke:#333,stroke-width:2px,color:#fff;\n     classDef purple fill:#B356CD,stroke:#333,stroke-width:2px,color:#fff;\n     classDef invisible fill:#fff,stroke:#fff,stroke-width:0px,color:#fff, width:0px;\n     class uq,ch1 turquoise\n     class ee green\n     class ctu,ccr,ce,cr,cc rose\n     class ptu,pcr,pe,pr,pc,ch2 orange\n     class res purple\n     class .1,.2,.3 invisible\n\n

      Given a user query and, optionally, the conversation history, the local search method identifies a set of entities from the knowledge graph that are semantically-related to the user input. These entities serve as access points into the knowledge graph, enabling the extraction of further relevant details such as connected entities, relationships, entity covariates, and community reports. Additionally, it also extracts relevant text chunks from the raw input documents that are associated with the identified entities. These candidate data sources are then prioritized and filtered to fit within a single context window of pre-defined size, which is used to generate a response to the user query.

      "}, {"location": "query/local_search/#configuration", "title": "Configuration", "text": "

      Below are the key parameters of the LocalSearch class:

      • llm: OpenAI model object to be used for response generation
      • context_builder: context builder object to be used for preparing context data from collections of knowledge model objects
      • system_prompt: prompt template used to generate the search response. Default template can be found at system_prompt
      • response_type: free-form text describing the desired response type and format (e.g., Multiple Paragraphs, Multi-Page Report)
      • llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call
      • context_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the search prompt
      • callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming events
      "}, {"location": "query/local_search/#how-to-use", "title": "How to Use", "text": "

      An example of a local search scenario can be found in the following notebook.

      "}, {"location": "query/overview/", "title": "Query Engine \ud83d\udd0e", "text": "

      The Query Engine is the retrieval module of the Graph RAG Library. It is one of the two main components of the Graph RAG library, the other being the Indexing Pipeline (see Indexing Pipeline). It is responsible for the following tasks:

      • Local Search
      • Global Search
      • DRIFT Search
      • Question Generation
      "}, {"location": "query/overview/#local-search", "title": "Local Search", "text": "

      Local search method generates answers by combining relevant data from the AI-extracted knowledge-graph with text chunks of the raw documents. This method is suitable for questions that require an understanding of specific entities mentioned in the documents (e.g. What are the healing properties of chamomile?).

      For more details about how Local Search works please refer to the Local Search documentation.

      "}, {"location": "query/overview/#global-search", "title": "Global Search", "text": "

      Global search method generates answers by searching over all AI-generated community reports in a map-reduce fashion. This is a resource-intensive method, but often gives good responses for questions that require an understanding of the dataset as a whole (e.g. What are the most significant values of the herbs mentioned in this notebook?).

      More about this can be checked at the Global Search documentation.

      "}, {"location": "query/overview/#drift-search", "title": "DRIFT Search", "text": "

      DRIFT Search introduces a new approach to local search queries by including community information in the search process. This greatly expands the breadth of the query\u2019s starting point and leads to retrieval and usage of a far higher variety of facts in the final answer. This addition expands the GraphRAG query engine by providing a more comprehensive option for local search, which uses community insights to refine a query into detailed follow-up questions.

      To learn more about DRIFT Search, please refer to the DRIFT Search documentation.

      "}, {"location": "query/overview/#basic-search", "title": "Basic Search", "text": "

      GraphRAG includes a rudimentary implementation of basic vector RAG to make it easy to compare different search results based on the type of question you are asking. You can specify the top k txt unit chunks to include in the summarization context.

      "}, {"location": "query/overview/#question-generation", "title": "Question Generation", "text": "

      This functionality takes a list of user queries and generates the next candidate questions. This is useful for generating follow-up questions in a conversation or for generating a list of questions for the investigator to dive deeper into the dataset.

      Information about how question generation works can be found at the Question Generation documentation page.

      "}, {"location": "query/question_generation/", "title": "Question Generation \u2754", "text": ""}, {"location": "query/question_generation/#entity-based-question-generation", "title": "Entity-based Question Generation", "text": "

      The question generation method combines structured data from the knowledge graph with unstructured data from the input documents to generate candidate questions related to specific entities.

      "}, {"location": "query/question_generation/#methodology", "title": "Methodology", "text": "

      Given a list of prior user questions, the question generation method uses the same context-building approach employed in local search to extract and prioritize relevant structured and unstructured data, including entities, relationships, covariates, community reports and raw text chunks. These data records are then fitted into a single LLM prompt to generate candidate follow-up questions that represent the most important or urgent information content or themes in the data.

      "}, {"location": "query/question_generation/#configuration", "title": "Configuration", "text": "

      Below are the key parameters of the Question Generation class:

      • llm: OpenAI model object to be used for response generation
      • context_builder: context builder object to be used for preparing context data from collections of knowledge model objects, using the same context builder class as in local search
      • system_prompt: prompt template used to generate candidate questions. Default template can be found at system_prompt
      • llm_params: a dictionary of additional parameters (e.g., temperature, max_tokens) to be passed to the LLM call
      • context_builder_params: a dictionary of additional parameters to be passed to the context_builder object when building context for the question generation prompt
      • callbacks: optional callback functions, can be used to provide custom event handlers for LLM's completion streaming events
      "}, {"location": "query/question_generation/#how-to-use", "title": "How to Use", "text": "

      An example of the question generation function can be found in the following notebook.

      "}, {"location": "query/notebooks/overview/", "title": "API Notebooks", "text": "
      • API Overview Notebook
      "}, {"location": "query/notebooks/overview/#query-engine-notebooks", "title": "Query Engine Notebooks", "text": "

      For examples about running Query please refer to the following notebooks:

      • Global Search Notebook
      • Local Search Notebook
      • DRIFT Search Notebook

      The test dataset for these notebooks can be found in dataset.zip.

      "}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 70bf7b5c4849bf59c019b083bac90388af5ca011..59f414b9211ef6f1f141e757d62a9dae5c736856 100644 GIT binary patch delta 13 Ucmb=gXP58h;9yXnJCVHt02smplmGw# delta 13 Ucmb=gXP58h;9#(uI+48s02vhnqW}N^