Compare commits

...

42 Commits
v2.2.0 ... main

Author SHA1 Message Date
Alonso Guevara
fdb7e3835b
Release v2.7.0 (#2087)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
2025-10-08 21:33:34 -07:00
Nathan Evans
ac8a7f5eef
Housekeeping (#2086)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Add deprecation warnings for fnllm and multi-search

* Fix dangling token_encoder refs

* Fix local_search notebook

* Fix global search dynamic notebook

* Fix global search notebook

* Fix drift notebook

* Switch example notebooks to use LiteLLM config

* Properly annotate dev deps as a group

* Semver

* Remove --extra dev

* Remove llm_model variable

* Ignore ruff ASYNC240

* Add note about expected broken notebook in docs

* Fix custom vector store notebook

* Push tokenizer throughout
2025-10-07 16:21:24 -07:00
Nathan Evans
6c86b0a7bb
Init config cleanup (#2084)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Spruce up init_config output, including LiteLLM default

* Remove deployment_name requirement for Azure

* Semver

* Add model_provider

* Add default model_provider

* Remove OBE test

* Update minimal config for tests

* Add model_provider to verb tests
2025-10-06 12:06:41 -07:00
Nathan Evans
2bd3922d8d
Litellm auth fix (#2083)
* Fix scope for Azure auth with LiteLLM

* Change internal language on max_attempts to max_retries

* Rework model config connectivity validation

* Semver

* Swtich smoke tests to LiteLLM

* Take out temporary retry_strategy = none since it is not fnllm compatible

* Bump smoke test timeout

* Bump smoke timeout further

* Tune smoke params

* Update smoke test bounds

* Remove covariates from min-csv smoke

* Smoke: adjust communities, remove drift

* Remove secrets where they aren't necessary

* Clean out old env var references
2025-10-06 10:54:21 -07:00
Nathan Evans
7f996cf584
Docs/2.6.0 (#2070)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Add basic search to overview

* Add info on input documents DataFrame

* Add info on factories to docs

* Add consumption warning and switch to "christmas" for folder name

* Add logger to factories list

* Add litellm docs. (#2058)

* Fix version for input docs

* Spelling

---------

Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
2025-09-23 14:48:28 -07:00
Alonso Guevara
9bc899fe95
Release v2.6.0 (#2068)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
2025-09-22 16:16:54 -06:00
Derek Worthen
2b70e4a4f3
Tokenizer (#2051)
* Add LiteLLM chat and embedding model providers.

* Fix code review findings.

* Add litellm.

* Fix formatting.

* Update dictionary.

* Update litellm.

* Fix embedding.

* Remove manual use of tiktoken and replace with
Tokenizer interface. Adds support for encoding
and decoding the models supported by litellm.

* Update litellm.

* Configure litellm to drop unsupported params.

* Cleanup semversioner release notes.

* Add num_tokens util to Tokenizer interface.

* Update litellm service factories.

* Cleanup litellm chat/embedding model argument assignment.

* Update chat and embedding type field for litellm use and future migration away from fnllm.

* Flatten litellm service organization.

* Update litellm.

* Update litellm factory validation.

* Flatten litellm rate limit service organization.

* Update rate limiter - disable with None/null instead of 0.

* Fix usage of get_tokenizer.

* Update litellm service registrations.

* Add jitter to exponential retry.

* Update validation.

* Update validation.

* Add litellm request logging layer.

* Update cache key.

* Update defaults.

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-09-22 13:55:14 -06:00
gaudyb
82cd3b7df2
Custom vector store schema implementation (#2062)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* progress on vector customization

* fix for lancedb vectors

* cosmosdb implementation

* uv run poe format

* clean test for vector store

* semversioner update

* test_factory.py integration test fixes

* fixes for cosmosdb test

* integration test fix for lancedb

* uv fix for format

* test fixes

* fixes for tests

* fix cosmosdb bug

* print statement

* test

* test

* fix cosmosdb bug

* test validation

* validation cosmosdb

* validate cosmosdb

* fix cosmosdb

* fix small feedback from PR

---------

Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local>
2025-09-19 10:11:34 -07:00
Nathan Evans
075cadd59a
Clarify managed auth setup in Azure documentation (#2064)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
Updated instructions for using managed auth on Azure.
2025-09-18 14:58:09 -07:00
Nathan Evans
6d7a50b7f0
Remove community reports rate limiter (#2056)
* Remove hard-coded community reports rate limiter

* Semver

* Format

* Add memory cache factory
2025-09-18 13:40:24 -07:00
Nathan Evans
2bf7e7c018
Fix multi-index search (#2063) 2025-09-18 12:49:56 -07:00
Nathan Evans
6c66b7c30f
Configure async for NLP extraction (#2059)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Make async mode configurable for NLP extraction

* Semver
2025-09-16 11:52:18 -07:00
Chenghua Duan
a398cc38bb
Update command to use no-discover-entity-types (#2038)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
"no-entity-types" is an incorrect configuration parameter.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-09-09 16:46:06 -06:00
Derek Worthen
ac95c917d3
Update fnllm. (#2043)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
2025-09-05 08:52:05 -07:00
Nathan Evans
1cb20b66f5
Input docs API parameter (#2034)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Add optional input_documents to index API

* Semver

* Add input dataframe example notebook

* Format

* Fix docs and notebook
2025-09-02 16:15:50 -07:00
Copilot
2030f94eb4
Refactor CacheFactory, StorageFactory, and VectorStoreFactory to use consistent registration patterns and add custom vector store documentation (#2006)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Initial plan

* Refactor VectorStoreFactory to use registration functionality like StorageFactory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues in VectorStoreFactory refactoring

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove backward compatibility support from VectorStoreFactory and StorageFactory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Run ruff check --fix and ruff format, add semversioner file

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

* Fix pytest errors in storage factory tests by updating PipelineStorage interface implementation

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

* update storage factory design

* Refactor CacheFactory to use registration functionality like StorageFactory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* revert copilot changes

* fix copilot changes

* update comments

* Fix failing pytest compatibility for factory tests

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update class instantiation issue

* ruff fixes

* fix pytest

* add default value

* ruff formatting changes

* ruff fixes

* revert minor changes

* cleanup cache factory

* Update CacheFactory tests to match consistent factory pattern

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update pytest thresholds

* adjust threshold levels

* Add custom vector store implementation notebook

Create comprehensive notebook demonstrating how to implement and register custom vector stores with GraphRAG as a plug-and-play framework. Includes:

- Complete implementation of SimpleInMemoryVectorStore
- Registration with VectorStoreFactory
- Testing and validation examples
- Configuration examples for GraphRAG settings
- Advanced features and best practices
- Production considerations checklist

The notebook provides a complete walkthrough for developers to understand and implement their own vector store backends.

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove sample notebook for now

* update tests

* fix cache pytests

* add pandas-stub to dev dependencies

* disable warning check for well known key

* skip tests when running on ubuntu

* add documentation for custom vector store implementations

* ignore ruff findings in notebooks

* fix merge breakages

* speedup CLI import statements

* remove unnecessary import statements in init file

* Add str type option on storage/cache type

* Fix store name

* Add LoggerFactory

* Fix up logging setup across CLI/API

* Add LoggerFactory test

* Fix err message

* Semver

* Remove enums from factory methods

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-08-28 13:53:07 -07:00
Nathan Evans
69ad36e735
Fix id baseline (#2036)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Fix all human_readable_id columns to start at 0

* Semver
2025-08-27 11:15:21 -07:00
Nathan Evans
30bdb35cc8
Selective embeddings loading (#2035)
* Invert embedding table loading logic

* Semver
2025-08-27 11:12:01 -07:00
Nathan Evans
77fb7d9d7d
Logging improvements (#2030)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Turn down blob/cosmos exception reporting to match file storage

* Restore indexing-engine.log

* Restore some basic console logging and progress for index CLI

* Semver

* Ignore small ruff complaints

* Fix CLI console printing
2025-08-25 14:56:43 -07:00
Alonso Guevara
469ee8568f
Release v2.5.0 (#2028)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
2025-08-14 08:06:52 -06:00
Copilot
7c28c70d5c
Switch from Poetry to uv for package management (#2008)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Initial plan

* Switch from Poetry to uv for package management

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Clean up build artifacts and update gitignore

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove build artifacts

* remove hardcoded version string

* fix calls to pip in cicd

* Update gh-pages.yml workflow to use uv instead of Poetry

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

* update cicd workflow with latest uv action

* fix command to retrieve package version

* update development instructions

* remove Poetry references

* Replace deprecated azuright action with npm-based Azurite installation

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* skip api version check for azurite

* add semversioner file

* update more changes from switching to UV

* Migrate unified-search-app from Poetry to uv package management

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* minor typo update

* minor Dockerfile update

* update cicd thresholds

* update pytest thresholds

* ruff fixes

* ruff fixes

* remove legacy npm settings that no longer apply

* Update Unified Search App Readme

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-08-13 18:57:25 -06:00
Alonso Guevara
5713205210
Feat/additional context (#2021)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Users/snehitgajjar/add optional api param for pipeline state (#2019)

* Add support for additional context for PipelineState

* Clean up

* Clean up

* Clean up

* Nit

---------

Co-authored-by: Snehit Gajjar <snehitgajjar@microsoft.com>

* Semver

* Update graphrag/api/index.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove additional_context from serialization

---------

Co-authored-by: Snehit Gajjar <snehit.gajjar@gmail.com>
Co-authored-by: Snehit Gajjar <snehitgajjar@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-08 16:59:24 -06:00
Alonso Guevara
1da1380615
Release v2.4.0 (#1994)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Release v2.4.0

* Update changelog
2025-07-14 18:54:27 -06:00
Alonso Guevara
dce02563eb
Fix/fnllm embedding limiter defaults (#1993)
* default embeddings tpm/rpm to null

* Semver
2025-07-14 18:00:45 -06:00
Copilot
13bf315a35
Refactor StorageFactory class to use registration functionality (#1944)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Initial plan for issue

* Refactored StorageFactory to use a registration-based approach

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Added semversioner change record

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix Python CI test failures and improve code quality

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2025-07-10 12:08:44 -06:00
Copilot
e84df28e64
Improve internal logging functionality by using Python's standard logging module (#1956)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Initial plan for issue

* Implement standard logging module and integrate with existing loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add test cases and improve documentation for standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting and add semversioner file for logging improvements

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove custom logger classes and refactor to use standard logging only

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting to resolve CI/CD test failures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add semversioner file and fix linting issues

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* fix spelling error

* Remove StandardProgressLogger and refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove LoggerFactory and custom loggers, refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright error: use logger.info() instead of calling logger as function in cosmosdb_pipeline_storage.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Remove deprecated logger files that were marked as deprecated placeholders

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace custom get_logger with standard Python logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues found by ruff check --fix

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* add word to dictionary

* Fix type checker error in ModelManager.__new__ method

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor multiple logging.getLogger() calls to use single logger per file

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove progress_logger parameter from build_index() and logger parameter from generate_indexing_prompts()

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove logger parameter from run_pipeline and standardize logger naming

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logger parameter with log_level parameter in CLI commands

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass poetry poe check

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove --logger parameter from smoke test command

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix Windows CI/CD issue with log file cleanup in tests

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add StreamHandler to root logger in __main__.py for CLI logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Only add StreamHandler if root logger doesn't have existing StreamHandler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logging.StreamHandler with colorlog.StreamHandler for colorized log output

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Regenerate poetry.lock file after adding colorlog dependency

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* move printing of dataframes to debug level

* remove colorlog for now

* Refactor workflow callbacks to inherit from logging.Handler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues in workflow callback handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type errors in blob and file workflow callbacks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor pipeline logging to use pure logging.Handler subclasses

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Rename workflow callback classes to workflow logger classes and move to logger directory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update dictionary

* apply ruff fixes

* fix function name

* simplify logger code

* update

* Remove error, warning, and log methods from WorkflowCallbacks and replace with standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Fix pyright errors by removing WorkflowCallbacks from strategy type signatures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove ConsoleWorkflowLogger and apply consistent formatter to all handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Refactor pipeline_logger.py to use standard FileHandler and remove FileWorkflowLogger

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove conditional azure import checks from blob_workflow_logger.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type checking errors in mock_provider.py and utils.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Run ruff check --fix to fix import ordering in notebooks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Merge configure_logging and create_pipeline_logger into init_loggers function

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove configure_logging and create_pipeline_logger functions, replace all usage with init_loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* cleanup unused code

* Update init_loggers to accept GraphRagConfig instead of ReportingConfig

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* Fix test failures by providing valid GraphRagConfig with required model configurations

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* remove logging_workflow_callback

* cleanup logging messages

* Add logging to track progress of pandas DataFrame apply operation in create_base_text_units

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* cleanup logger logic throughout codebase

* update

* more cleanup of old loggers

* small logger cleanup

* final code cleanup and added loggers to query

* add verbose logging to query

* minor code cleanup

* Fix broken unit tests for chunk_text and standard_logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Fix test_chunk_text by mocking progress_ticker function instead of ProgressTicker class

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove unnecessary logger

* remove rich and fix type annotation

* revert test formatting changes my by copilot

* promote graphrag logs to root logger

* add correct semversioner file

* revert change to file

* revert formatting changes that have no effect

* fix changes after merge with main

* revert unnecessary copilot changes

* remove whitespace

* cleanup docstring

* simplify some logic with less code

* update poetry lock file

* ruff fixes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2025-07-09 18:29:03 -06:00
Nathan Evans
27c6de846f
Update docs for 2.0+ (#1984)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Update docs

* Fix prompt links
2025-06-23 13:49:47 -07:00
Nathan Evans
1df89727c3
Pipeline registration (#1940)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Move covariate run conditional

* All pipeline registration

* Fix method name construction

* Rename context storage -> output_storage

* Rename OutputConfig as generic StorageConfig

* Reuse Storage model under InputConfig

* Move input storage creation out of document loading

* Move document loading into workflows

* Semver

* Fix smoke test config for new workflows

* Fix unit tests

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
2025-06-12 16:14:39 -07:00
Nathan Evans
17e431cf42
Update typer (#1958)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
2025-06-02 14:20:21 -07:00
Alonso Guevara
4a42ac81af
Release v2.3.0 (#1951)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
2025-05-23 15:19:29 -06:00
Alonso Guevara
f1e2041f07
Fix/drift search reduce (#1948)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Fix Reduce Response for non streaming calls

* Semver
2025-05-23 08:07:09 -06:00
Alonso Guevara
7fba9522d4
Task/raw model answer (#1947)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Add full_response to llm provider output

* Semver

* Small leftover cleanup

* Add pyi to suppress Pyright errors. full_content is optional

* Format

* Add missing stubs
2025-05-22 08:22:44 -06:00
Alonso Guevara
fb4fe72a73
Fix/global reduce prompt (#1942)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Add missing string formatter

* Semver
2025-05-20 17:00:32 -06:00
Copilot
f5a472ab14
Upgrade pyarrow dependency to >=17.0.0 to fix CVE-2024-52338 (#1939) 2025-05-20 18:34:28 -04:00
Alonso Guevara
24018c6155
Task/remove dynamic retries (#1941)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Remove max retries. Update Typer args

* Format

* Semver

* Fix typo

* Ruff and Typos

* Format
2025-05-20 11:48:27 -06:00
Nathan Evans
36948b8d2e
Various minor updates (#1932)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Add text unit ids to Community model

* Add graph utilities

* Turn off LCC for clustering by default

* Simplify embeddings config/flow

* Semver
2025-05-16 14:48:53 -07:00
Alonso Guevara
ee1b2db4a0
Update to latest fnllm (#1930)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Update to latest fnllm

* Semver + smoke tests

* Add --method to smoke tests indexing

* format...

* Adjust embeddings limiter
2025-05-15 14:57:01 -06:00
Alonso Guevara
56a865bff0
Release v2.2.1 (#1910)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
2025-04-30 18:15:01 -06:00
Alonso Guevara
8fb95a6209
Fix/community report tuning (#1909)
* Fix community report prompt tuning

* Semver

* Format ...
2025-04-30 17:44:31 -06:00
Andres Morales
8c81cc1563
Update Index as workflows (#1908)
* Incremental index as workflow

* Update function docs

* fix state management

* Remove update workflows when specifying workflows in the config

* Fix ruff errors

* Add semver

* Remove callbacks param
2025-04-30 16:25:36 -06:00
Nathan Evans
832abf1e0c
Fix graph creation (#1905)
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Add edge weight to all graph creation

* Semver
2025-04-29 18:18:49 -07:00
Nathan Evans
25bbae8642
Docs: Add models page (#1842)
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
* Add models page

* Update config docs for new params

* Spelling

* Add comment on CoT with o-series

* Add notes about managed identity

* Update the viz guide

* Spruce up the getting started wording

* Capitalization

* Add BYOG page

* More BYOG edits

* Update dictionary

* Change example model name
2025-04-28 17:36:08 -07:00
317 changed files with 17325 additions and 14723 deletions

View File

@ -6,8 +6,7 @@ permissions:
contents: write
env:
POETRY_VERSION: '1.8.3'
PYTHON_VERSION: '3.11'
PYTHON_VERSION: "3.11"
jobs:
build:
@ -16,8 +15,6 @@ jobs:
GH_PAGES: 1
DEBUG: 1
GRAPHRAG_API_KEY: ${{ secrets.GRAPHRAG_API_KEY }}
GRAPHRAG_LLM_MODEL: ${{ secrets.GRAPHRAG_LLM_MODEL }}
GRAPHRAG_EMBEDDING_MODEL: ${{ secrets.GRAPHRAG_EMBEDDING_MODEL }}
steps:
- uses: actions/checkout@v4
@ -29,18 +26,16 @@ jobs:
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install Poetry ${{ env.POETRY_VERSION }}
uses: abatilo/actions-poetry@v3.0.0
with:
poetry-version: ${{ env.POETRY_VERSION }}
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: poetry intsall
- name: Install dependencies
shell: bash
run: poetry install
run: uv sync
- name: mkdocs build
shell: bash
run: poetry run poe build_docs
run: uv run poe build_docs
- name: List Docsite Contents
run: find site

View File

@ -26,9 +26,6 @@ concurrency:
# Only run the for the latest commit
cancel-in-progress: true
env:
POETRY_VERSION: 1.8.3
jobs:
python-ci:
# skip draft PRs
@ -51,7 +48,7 @@ jobs:
filters: |
python:
- 'graphrag/**/*'
- 'poetry.lock'
- 'uv.lock'
- 'pyproject.toml'
- '**/*.py'
- '**/*.toml'
@ -64,30 +61,27 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: abatilo/actions-poetry@v3.0.0
with:
poetry-version: $POETRY_VERSION
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Install dependencies
shell: bash
run: |
poetry self add setuptools wheel
poetry run python -m pip install gensim
poetry install
uv sync
uv pip install gensim
- name: Check
run: |
poetry run poe check
uv run poe check
- name: Build
run: |
poetry build
uv build
- name: Unit Test
run: |
poetry run poe test_unit
uv run poe test_unit
- name: Verb Test
run: |
poetry run poe test_verbs
uv run poe test_verbs

View File

@ -26,9 +26,6 @@ concurrency:
# only run the for the latest commit
cancel-in-progress: true
env:
POETRY_VERSION: 1.8.3
jobs:
python-ci:
# skip draft PRs
@ -51,7 +48,7 @@ jobs:
filters: |
python:
- 'graphrag/**/*'
- 'poetry.lock'
- 'uv.lock'
- 'pyproject.toml'
- '**/*.py'
- '**/*.toml'
@ -64,25 +61,24 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: abatilo/actions-poetry@v3.0.0
with:
poetry-version: $POETRY_VERSION
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Install dependencies
shell: bash
run: |
poetry self add setuptools wheel
poetry run python -m pip install gensim
poetry install
uv sync
uv pip install gensim
- name: Build
run: |
poetry build
uv build
- name: Install Azurite
id: azuright
uses: potatoqualitee/azuright@v1.1
- name: Install and start Azurite
shell: bash
run: |
npm install -g azurite
azurite --silent --skipApiVersionCheck --location /tmp/azurite --debug /tmp/azurite-debug.log &
# For more information on installation/setup of Azure Cosmos DB Emulator
# https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-develop-emulator?tabs=docker-linux%2Cpython&pivots=api-nosql
@ -97,4 +93,4 @@ jobs:
- name: Integration Test
run: |
poetry run poe test_integration
uv run poe test_integration

View File

@ -26,9 +26,6 @@ concurrency:
# Only run the for the latest commit
cancel-in-progress: true
env:
POETRY_VERSION: 1.8.3
jobs:
python-ci:
# skip draft PRs
@ -41,8 +38,6 @@ jobs:
env:
DEBUG: 1
GRAPHRAG_API_KEY: ${{ secrets.OPENAI_NOTEBOOK_KEY }}
GRAPHRAG_LLM_MODEL: ${{ secrets.GRAPHRAG_LLM_MODEL }}
GRAPHRAG_EMBEDDING_MODEL: ${{ secrets.GRAPHRAG_EMBEDDING_MODEL }}
runs-on: ${{ matrix.os }}
steps:
@ -54,7 +49,7 @@ jobs:
filters: |
python:
- 'graphrag/**/*'
- 'poetry.lock'
- 'uv.lock'
- 'pyproject.toml'
- '**/*.py'
- '**/*.toml'
@ -66,18 +61,15 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: abatilo/actions-poetry@v3.0.0
with:
poetry-version: $POETRY_VERSION
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Install dependencies
shell: bash
run: |
poetry self add setuptools wheel
poetry run python -m pip install gensim
poetry install
uv sync
uv pip install gensim
- name: Notebook Test
run: |
poetry run poe test_notebook
uv run poe test_notebook

View File

@ -6,7 +6,6 @@ on:
branches: [main]
env:
POETRY_VERSION: "1.8.3"
PYTHON_VERSION: "3.10"
jobs:
@ -31,21 +30,19 @@ jobs:
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install Poetry
uses: abatilo/actions-poetry@v3.0.0
with:
poetry-version: ${{ env.POETRY_VERSION }}
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Install dependencies
shell: bash
run: poetry install
run: uv sync
- name: Export Publication Version
run: echo "version=`poetry version --short`" >> $GITHUB_OUTPUT
run: echo "version=$(uv version --short)" >> $GITHUB_OUTPUT
- name: Build Distributable
shell: bash
run: poetry build
run: uv build
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

View File

@ -26,9 +26,6 @@ concurrency:
# Only run the for the latest commit
cancel-in-progress: true
env:
POETRY_VERSION: 1.8.3
jobs:
python-ci:
# skip draft PRs
@ -40,20 +37,8 @@ jobs:
fail-fast: false # Continue running all jobs even if one fails
env:
DEBUG: 1
GRAPHRAG_LLM_TYPE: "azure_openai_chat"
GRAPHRAG_EMBEDDING_TYPE: "azure_openai_embedding"
GRAPHRAG_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GRAPHRAG_API_BASE: ${{ secrets.GRAPHRAG_API_BASE }}
GRAPHRAG_API_VERSION: ${{ secrets.GRAPHRAG_API_VERSION }}
GRAPHRAG_LLM_DEPLOYMENT_NAME: ${{ secrets.GRAPHRAG_LLM_DEPLOYMENT_NAME }}
GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME: ${{ secrets.GRAPHRAG_EMBEDDING_DEPLOYMENT_NAME }}
GRAPHRAG_LLM_MODEL: ${{ secrets.GRAPHRAG_LLM_MODEL }}
GRAPHRAG_EMBEDDING_MODEL: ${{ secrets.GRAPHRAG_EMBEDDING_MODEL }}
# We have Windows + Linux runners in 3.10, so we need to divide the rate limits by 2
GRAPHRAG_LLM_TPM: 200_000 # 400_000 / 2
GRAPHRAG_LLM_RPM: 1_000 # 2_000 / 2
GRAPHRAG_EMBEDDING_TPM: 225_000 # 450_000 / 2
GRAPHRAG_EMBEDDING_RPM: 1_000 # 2_000 / 2
# Azure AI Search config
AZURE_AI_SEARCH_URL_ENDPOINT: ${{ secrets.AZURE_AI_SEARCH_URL_ENDPOINT }}
AZURE_AI_SEARCH_API_KEY: ${{ secrets.AZURE_AI_SEARCH_API_KEY }}
@ -68,7 +53,7 @@ jobs:
filters: |
python:
- 'graphrag/**/*'
- 'poetry.lock'
- 'uv.lock'
- 'pyproject.toml'
- '**/*.py'
- '**/*.toml'
@ -81,33 +66,32 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
uses: abatilo/actions-poetry@v3.0.0
with:
poetry-version: $POETRY_VERSION
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Install dependencies
shell: bash
run: |
poetry self add setuptools wheel
poetry run python -m pip install gensim
poetry install
uv sync
uv pip install gensim
- name: Build
run: |
poetry build
uv build
- name: Install Azurite
id: azuright
uses: potatoqualitee/azuright@v1.1
- name: Install and start Azurite
shell: bash
run: |
npm install -g azurite
azurite --silent --skipApiVersionCheck --location /tmp/azurite --debug /tmp/azurite-debug.log &
- name: Smoke Test
if: steps.changes.outputs.python == 'true'
run: |
poetry run poe test_smoke
uv run poe test_smoke
- uses: actions/upload-artifact@v4
if: always()
with:
name: smoke-test-artifacts-${{ matrix.python-version }}-${{ matrix.poetry-version }}-${{ runner.os }}
name: smoke-test-artifacts-${{ matrix.python-version }}-${{ runner.os }}
path: tests/fixtures/*

2
.gitignore vendored
View File

@ -1,6 +1,8 @@
# Python Artifacts
python/*/lib/
dist/
build/
*.egg-info/
# Test Output
.coverage

18
.semversioner/2.2.1.json Normal file
View File

@ -0,0 +1,18 @@
{
"changes": [
{
"description": "Fix Community Report prompt tuning response",
"type": "patch"
},
{
"description": "Fix graph creation missing edge weights.",
"type": "patch"
},
{
"description": "Update as workflows",
"type": "patch"
}
],
"created_at": "2025-04-30T23:50:31+00:00",
"version": "2.2.1"
}

34
.semversioner/2.3.0.json Normal file
View File

@ -0,0 +1,34 @@
{
"changes": [
{
"description": "Remove Dynamic Max Retries support. Refactor typer typing in cli interface",
"type": "minor"
},
{
"description": "Update fnllm to latest. Update default graphrag configuration",
"type": "minor"
},
{
"description": "A few fixes and enhancements for better reuse and flow.",
"type": "patch"
},
{
"description": "Add full llm response to LLM PRovider output",
"type": "patch"
},
{
"description": "Fix Drift Reduce Response for non streaming calls",
"type": "patch"
},
{
"description": "Fix global search prompt to include missing formatting key",
"type": "patch"
},
{
"description": "Upgrade pyarrow dependency to >=17.0.0 to fix CVE-2024-52338",
"type": "patch"
}
],
"created_at": "2025-05-23T21:02:47+00:00",
"version": "2.3.0"
}

26
.semversioner/2.4.0.json Normal file
View File

@ -0,0 +1,26 @@
{
"changes": [
{
"description": "Allow injection of custom pipelines.",
"type": "minor"
},
{
"description": "Refactored StorageFactory to use a registration-based approach",
"type": "minor"
},
{
"description": "Fix default values for tpm and rpm limiters on embeddings",
"type": "patch"
},
{
"description": "Update typer.",
"type": "patch"
},
{
"description": "cleaned up logging to follow python standards.",
"type": "patch"
}
],
"created_at": "2025-07-15T00:04:15+00:00",
"version": "2.4.0"
}

14
.semversioner/2.5.0.json Normal file
View File

@ -0,0 +1,14 @@
{
"changes": [
{
"description": "Add additional context variable to build index signature for custom parameter bag",
"type": "minor"
},
{
"description": "swap package management from Poetry -> UV",
"type": "minor"
}
],
"created_at": "2025-08-14T00:59:46+00:00",
"version": "2.5.0"
}

54
.semversioner/2.6.0.json Normal file
View File

@ -0,0 +1,54 @@
{
"changes": [
{
"description": "Add LiteLLM chat and embedding model providers.",
"type": "minor"
},
{
"description": "Add LoggerFactory and clean up related API.",
"type": "minor"
},
{
"description": "Add config for NLP async mode.",
"type": "minor"
},
{
"description": "Add optional input documents to indexing API.",
"type": "minor"
},
{
"description": "add customization to vector store",
"type": "minor"
},
{
"description": "Add gpt-5 support by updating fnllm dependency.",
"type": "patch"
},
{
"description": "Fix all human_readable_id fields to be 0-based.",
"type": "patch"
},
{
"description": "Fix multi-index search.",
"type": "patch"
},
{
"description": "Improve upon recent logging refactor",
"type": "patch"
},
{
"description": "Make cache, storage, and vector_store factories consistent with similar registration support",
"type": "patch"
},
{
"description": "Remove hard-coded community rate limiter.",
"type": "patch"
},
{
"description": "generate_text_embeddings only loads tables if embedding field is specified.",
"type": "patch"
}
],
"created_at": "2025-09-22T21:44:51+00:00",
"version": "2.6.0"
}

18
.semversioner/2.7.0.json Normal file
View File

@ -0,0 +1,18 @@
{
"changes": [
{
"description": "Set LiteLLM as default in init_content.",
"type": "minor"
},
{
"description": "Fix Azure auth scope issue with LiteLLM.",
"type": "patch"
},
{
"description": "Housekeeping toward 2.7.",
"type": "patch"
}
],
"created_at": "2025-10-08T22:39:42+00:00",
"version": "2.7.0"
}

57
.vscode/launch.json vendored
View File

@ -6,21 +6,24 @@
"name": "Indexer",
"type": "debugpy",
"request": "launch",
"module": "poetry",
"module": "graphrag",
"args": [
"poe", "index",
"--root", "<path_to_ragtest_root_demo>"
"index",
"--root",
"<path_to_index_folder>"
],
"console": "integratedTerminal"
},
{
"name": "Query",
"type": "debugpy",
"request": "launch",
"module": "poetry",
"module": "graphrag",
"args": [
"poe", "query",
"--root", "<path_to_ragtest_root_demo>",
"--method", "global",
"query",
"--root",
"<path_to_index_folder>",
"--method", "basic",
"--query", "What are the top themes in this story",
]
},
@ -28,12 +31,48 @@
"name": "Prompt Tuning",
"type": "debugpy",
"request": "launch",
"module": "poetry",
"module": "uv",
"args": [
"poe", "prompt-tune",
"--config",
"<path_to_ragtest_root_demo>/settings.yaml",
]
}
},
{
"name": "Debug Integration Pytest",
"type": "debugpy",
"request": "launch",
"module": "pytest",
"args": [
"./tests/integration/vector_stores",
"-k", "test_azure_ai_search"
],
"console": "integratedTerminal",
"justMyCode": false
},
{
"name": "Debug Verbs Pytest",
"type": "debugpy",
"request": "launch",
"module": "pytest",
"args": [
"./tests/verbs",
"-k", "test_generate_text_embeddings"
],
"console": "integratedTerminal",
"justMyCode": false
},
{
"name": "Debug Smoke Pytest",
"type": "debugpy",
"request": "launch",
"module": "pytest",
"args": [
"./tests/smoke",
"-k", "test_fixtures"
],
"console": "integratedTerminal",
"justMyCode": false
},
]
}

37
.vscode/settings.json vendored
View File

@ -1,43 +1,8 @@
{
"search.exclude": {
"**/.yarn": true,
"**/.pnp.*": true
},
"editor.formatOnSave": false,
"eslint.nodePath": ".yarn/sdks",
"typescript.tsdk": ".yarn/sdks/typescript/lib",
"typescript.enablePromptUseWorkspaceTsdk": true,
"javascript.preferences.importModuleSpecifier": "relative",
"javascript.preferences.importModuleSpecifierEnding": "js",
"typescript.preferences.importModuleSpecifier": "relative",
"typescript.preferences.importModuleSpecifierEnding": "js",
"explorer.fileNesting.enabled": true,
"explorer.fileNesting.patterns": {
"*.ts": "${capture}.ts, ${capture}.hooks.ts, ${capture}.hooks.tsx, ${capture}.contexts.ts, ${capture}.stories.tsx, ${capture}.story.tsx, ${capture}.spec.tsx, ${capture}.base.ts, ${capture}.base.tsx, ${capture}.types.ts, ${capture}.styles.ts, ${capture}.styles.tsx, ${capture}.utils.ts, ${capture}.utils.tsx, ${capture}.constants.ts, ${capture}.module.scss, ${capture}.module.css, ${capture}.md",
"*.js": "${capture}.js.map, ${capture}.min.js, ${capture}.d.ts",
"*.jsx": "${capture}.js",
"*.tsx": "${capture}.ts, ${capture}.hooks.ts, ${capture}.hooks.tsx, ${capture}.contexts.ts, ${capture}.stories.tsx, ${capture}.story.tsx, ${capture}.spec.tsx, ${capture}.base.ts, ${capture}.base.tsx, ${capture}.types.ts, ${capture}.styles.ts, ${capture}.styles.tsx, ${capture}.utils.ts, ${capture}.utils.tsx, ${capture}.constants.ts, ${capture}.module.scss, ${capture}.module.css, ${capture}.md, ${capture}.css",
"tsconfig.json": "tsconfig.*.json",
"package.json": "package-lock.json, turbo.json, tsconfig.json, rome.json, biome.json, .npmignore, dictionary.txt, cspell.config.yaml",
"README.md": "*.md, LICENSE, CODEOWNERS",
".eslintrc": ".eslintignore",
".prettierrc": ".prettierignore",
".gitattributes": ".gitignore",
".yarnrc.yml": "yarn.lock, .pnp.*",
"jest.config.js": "jest.setup.mjs",
"pyproject.toml": "poetry.lock, poetry.toml, mkdocs.yaml",
"cspell.config.yaml": "dictionary.txt"
},
"azureFunctions.postDeployTask": "npm install (functions)",
"azureFunctions.projectLanguage": "TypeScript",
"azureFunctions.projectRuntime": "~4",
"debug.internalConsoleOptions": "neverOpen",
"azureFunctions.preDeployTask": "npm prune (functions)",
"appService.zipIgnorePattern": [
"node_modules{,/**}",
".vscode{,/**}"
],
"python.defaultInterpreterPath": "python/services/.venv/bin/python",
"python.defaultInterpreterPath": "${workspaceRoot}/.venv/bin/python",
"python.languageServer": "Pylance",
"cSpell.customDictionaries": {
"project-words": {

View File

@ -1,6 +1,56 @@
# Changelog
Note: version releases in the 0.x.y range may introduce breaking changes.
## 2.7.0
- minor: Set LiteLLM as default in init_content.
- patch: Fix Azure auth scope issue with LiteLLM.
- patch: Housekeeping toward 2.7.
## 2.6.0
- minor: Add LiteLLM chat and embedding model providers.
- minor: Add LoggerFactory and clean up related API.
- minor: Add config for NLP async mode.
- minor: Add optional input documents to indexing API.
- minor: add customization to vector store
- patch: Add gpt-5 support by updating fnllm dependency.
- patch: Fix all human_readable_id fields to be 0-based.
- patch: Fix multi-index search.
- patch: Improve upon recent logging refactor
- patch: Make cache, storage, and vector_store factories consistent with similar registration support
- patch: Remove hard-coded community rate limiter.
- patch: generate_text_embeddings only loads tables if embedding field is specified.
## 2.5.0
- minor: Add additional context variable to build index signature for custom parameter bag
- minor: swap package management from Poetry -> UV
## 2.4.0
- minor: Allow injection of custom pipelines.
- minor: Refactored StorageFactory to use a registration-based approach
- patch: Fix default values for tpm and rpm limiters on embeddings
- patch: Update typer.
- patch: cleaned up logging to follow python standards.
## 2.3.0
- minor: Remove Dynamic Max Retries support. Refactor typer typing in cli interface
- minor: Update fnllm to latest. Update default graphrag configuration
- patch: A few fixes and enhancements for better reuse and flow.
- patch: Add full llm response to LLM PRovider output
- patch: Fix Drift Reduce Response for non streaming calls
- patch: Fix global search prompt to include missing formatting key
- patch: Upgrade pyarrow dependency to >=17.0.0 to fix CVE-2024-52338
## 2.2.1
- patch: Fix Community Report prompt tuning response
- patch: Fix graph creation missing edge weights.
- patch: Update as workflows
## 2.2.0
- minor: Support OpenAI reasoning models.

View File

@ -22,7 +22,7 @@ or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any addi
2. Create a new branch for your contribution: `git checkout -b my-contribution`.
3. Make your changes and ensure that the code passes all tests.
4. Commit your changes: `git commit -m "Add my contribution"`.
5. Create and commit a semver impact document by running `poetry run semversioner add-change -t <major|minor|patch> -d <description>`.
5. Create and commit a semver impact document by running `uv run semversioner add-change -t <major|minor|patch> -d <description>`.
6. Push your changes to your forked repository: `git push origin my-contribution`.
7. Open a pull request to the main repository.

View File

@ -5,29 +5,29 @@
| Name | Installation | Purpose |
| ------------------- | ------------------------------------------------------------ | ----------------------------------------------------------------------------------- |
| Python 3.10 or 3.11 | [Download](https://www.python.org/downloads/) | The library is Python-based. |
| Poetry | [Instructions](https://python-poetry.org/docs/#installation) | Poetry is used for package management and virtualenv management in Python codebases |
| uv | [Instructions](https://docs.astral.sh/uv/) | uv is used for package management and virtualenv management in Python codebases |
# Getting Started
## Install Dependencies
```shell
# install python dependencies
poetry install
uv sync
```
## Execute the indexing engine
```shell
poetry run poe index <...args>
uv run poe index <...args>
```
## Execute prompt tuning
```shell
poetry run poe prompt_tune <...args>
uv run poe prompt_tune <...args>
```
## Execute Queries
```shell
poetry run poe query <...args>
uv run poe query <...args>
```
## Repository Structure
@ -63,7 +63,7 @@ Where appropriate, the factories expose a registration method for users to provi
We use [semversioner](https://github.com/raulgomis/semversioner) to automate and enforce semantic versioning in the release process. Our CI/CD pipeline checks that all PR's include a json file generated by semversioner. When submitting a PR, please run:
```shell
poetry run semversioner add-change -t patch -d "<a small sentence describing changes made>."
uv run semversioner add-change -t patch -d "<a small sentence describing changes made>."
```
# Azurite
@ -78,29 +78,29 @@ or by simply running `azurite` in the terminal if already installed globally. Se
# Lifecycle Scripts
Our Python package utilizes Poetry to manage dependencies and [poethepoet](https://pypi.org/project/poethepoet/) to manage custom build scripts.
Our Python package utilizes uv to manage dependencies and [poethepoet](https://pypi.org/project/poethepoet/) to manage custom build scripts.
Available scripts are:
- `poetry run poe index` - Run the Indexing CLI
- `poetry run poe query` - Run the Query CLI
- `poetry build` - This invokes `poetry build`, which will build a wheel file and other distributable artifacts.
- `poetry run poe test` - This will execute all tests.
- `poetry run poe test_unit` - This will execute unit tests.
- `poetry run poe test_integration` - This will execute integration tests.
- `poetry run poe test_smoke` - This will execute smoke tests.
- `poetry run poe check` - This will perform a suite of static checks across the package, including:
- `uv run poe index` - Run the Indexing CLI
- `uv run poe query` - Run the Query CLI
- `uv build` - This invokes `uv build`, which will build a wheel file and other distributable artifacts.
- `uv run poe test` - This will execute all tests.
- `uv run poe test_unit` - This will execute unit tests.
- `uv run poe test_integration` - This will execute integration tests.
- `uv run poe test_smoke` - This will execute smoke tests.
- `uv run poe check` - This will perform a suite of static checks across the package, including:
- formatting
- documentation formatting
- linting
- security patterns
- type-checking
- `poetry run poe fix` - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
- `poetry run poe fix_unsafe` - This will apply any available auto-fixes to the package, including those that may be unsafe.
- `poetry run poe format` - Explicitly run the formatter across the package.
- `uv run poe fix` - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
- `uv run poe fix_unsafe` - This will apply any available auto-fixes to the package, including those that may be unsafe.
- `uv run poe format` - Explicitly run the formatter across the package.
## Troubleshooting
### "RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running poetry install
### "RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running uv sync
Make sure llvm-9 and llvm-9-dev are installed:
@ -110,13 +110,8 @@ and then in your bashrc, add
`export LLVM_CONFIG=/usr/bin/llvm-config-9`
### "numba/\_pymodule.h:6:10: fatal error: Python.h: No such file or directory" when running poetry install
### "numba/\_pymodule.h:6:10: fatal error: Python.h: No such file or directory" when running uv sync
Make sure you have python3.10-dev installed or more generally `python<version>-dev`
`sudo apt-get install python3.10-dev`
### LLM call constantly exceeds TPM, RPM or time limits
`GRAPHRAG_LLM_THREAD_COUNT` and `GRAPHRAG_EMBEDDING_THREAD_COUNT` are both set to 50 by default. You can modify these values
to reduce concurrency. Please refer to the [Configuration Documents](https://microsoft.github.io/graphrag/config/overview/)

View File

@ -1,6 +1,5 @@
# GraphRAG
👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator) <br/>
👉 [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)<br/>
👉 [Read the docs](https://microsoft.github.io/graphrag)<br/>
👉 [GraphRAG Arxiv](https://arxiv.org/pdf/2404.16130)
@ -28,7 +27,7 @@ To learn more about GraphRAG and how it can be used to enhance your LLM's abilit
## Quickstart
To get started with the GraphRAG system we recommend trying the [Solution Accelerator](https://github.com/Azure-Samples/graphrag-accelerator) package. This provides a user-friendly end-to-end experience with Azure resources.
To get started with the GraphRAG system we recommend trying the [command line quickstart](https://microsoft.github.io/graphrag/get_started/).
## Repository Guidance

View File

@ -12,6 +12,12 @@ There are five surface areas that may be impacted on any given release. They are
> TL;DR: Always run `graphrag init --path [path] --force` between minor version bumps to ensure you have the latest config format. Run the provided migration notebook between major version bumps if you want to avoid re-indexing prior datasets. Note that this will overwrite your configuration and prompts, so backup if necessary.
# v2
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v2.ipynb) to convert older tables to the v2 format.
The v2 release renamed all of our index tables to simply name the items each table contains. The previous naming was a leftover requirement of our use of DataShaper, which is no longer necessary.
# v1
Run the [migration notebook](./docs/examples_notebooks/index_migration_to_v1.ipynb) to convert older tables to the v1 format.
@ -27,7 +33,7 @@ All of the breaking changes listed below are accounted for in the four steps abo
- Alignment of fields from `create_final_entities` (such as name -> title) with `create_final_nodes`, and removal of redundant content across these tables
- Rename of `document.raw_content` to `document.text`
- Rename of `entity.name` to `entity.title`
- Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree`fields
- Rename `rank` to `combined_degree` in `create_final_relationships` and removal of `source_degree` and `target_degree` fields
- Fixed community tables to use a proper UUID for the `id` field, and retain `community` and `human_readable_id` for the short IDs
- Removal of all embeddings columns from parquet files in favor of direct vector store writes

View File

@ -79,6 +79,9 @@ mkdocs
fnllm
typer
spacy
kwargs
ollama
litellm
# Library Methods
iterrows
@ -100,6 +103,9 @@ itertuples
isin
nocache
nbconvert
levelno
acompletion
aembedding
# HTML
nbsp
@ -184,12 +190,14 @@ Verdantis's
# English
skippable
upvote
unconfigured
# Misc
Arxiv
kwds
jsons
txts
byog
# Dulce
astrotechnician

View File

@ -4,7 +4,7 @@ As of version 1.3, GraphRAG no longer supports a full complement of pre-built en
The only standard environment variable we expect, and include in the default settings.yml, is `GRAPHRAG_API_KEY`. If you are already using a number of the previous GRAPHRAG_* environment variables, you can insert them with template syntax into settings.yml and they will be adopted.
> **The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml.**
> **The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml. We also WILL NOT be updating this page as the main config object changes.**
---
@ -178,11 +178,11 @@ This section controls the cache mechanism used by the pipeline. This is used to
### Reporting
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.
| Parameter | Description | Type | Required or Optional | Default |
| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- | -------------------- | ------- |
| `GRAPHRAG_REPORTING_TYPE` | The type of reporter to use. Options are `file`, `console`, or `blob` | `str` | optional | `file` |
| `GRAPHRAG_REPORTING_TYPE` | The type of reporter to use. Options are `file` or `blob` | `str` | optional | `file` |
| `GRAPHRAG_REPORTING_STORAGE_ACCOUNT_BLOB_URL` | The Azure Storage blob endpoint to use when in `blob` mode and using managed identity. Will have the format `https://<storage_account_name>.blob.core.windows.net` | `str` | optional | None |
| `GRAPHRAG_REPORTING_CONNECTION_STRING` | The Azure Storage connection string to use when in `blob` mode. | `str` | optional | None |
| `GRAPHRAG_REPORTING_CONTAINER_NAME` | The Azure Storage container name to use when in `blob` mode. | `str` | optional | None |

View File

@ -29,4 +29,4 @@ The `init` command will create the following files in the specified directory:
## Next Steps
After initializing your workspace, you can either run the [Prompt Tuning](../prompt_tuning/auto_prompt_tuning.md) command to adapt the prompts to your data or even start running the [Indexing Pipeline](../index/overview.md) to index your data. For more information on configuring GraphRAG, see the [Configuration](overview.md) documentation.
After initializing your workspace, you can either run the [Prompt Tuning](../prompt_tuning/auto_prompt_tuning.md) command to adapt the prompts to your data or even start running the [Indexing Pipeline](../index/overview.md) to index your data. For more information on configuration options available, see the [YAML details page](yaml.md).

130
docs/config/models.md Normal file
View File

@ -0,0 +1,130 @@
# Language Model Selection and Overriding
This page contains information on selecting a model to use and options to supply your own model for GraphRAG. Note that this is not a guide to finding the right model for your use case.
## Default Model Support
GraphRAG was built and tested using OpenAI models, so this is the default model set we support. This is not intended to be a limiter or statement of quality or fitness for your use case, only that it's the set we are most familiar with for prompting, tuning, and debugging.
GraphRAG also utilizes a language model wrapper library used by several projects within our team, called fnllm. fnllm provides two important functions for GraphRAG: rate limiting configuration to help us maximize throughput for large indexing jobs, and robust caching of API calls to minimize consumption on repeated indexes for testing, experimentation, or incremental ingest. fnllm uses the OpenAI Python SDK under the covers, so OpenAI-compliant endpoints are a base requirement out-of-the-box.
Starting with version 2.6.0, GraphRAG supports using [LiteLLM](https://docs.litellm.ai/) instead of fnllm for calling language models. LiteLLM provides support for 100+ models though it is important to note that when choosing a model it must support returning [structured outputs](https://openai.com/index/introducing-structured-outputs-in-the-api/) adhering to a [JSON schema](https://docs.litellm.ai/docs/completion/json_mode).
Example using LiteLLm as the language model tool for GraphRAG:
```yaml
models:
default_chat_model:
type: chat
auth_type: api_key
api_key: ${GEMINI_API_KEY}
model_provider: gemini
model: gemini-2.5-flash-lite
default_embedding_model:
type: embedding
auth_type: api_key
api_key: ${GEMINI_API_KEY}
model_provider: gemini
model: gemini-embedding-001
```
To use LiteLLM one must
- Set `type` to either `chat` or `embedding`.
- Provide a `model_provider`, e.g., `openai`, `azure`, `gemini`, etc.
- Set the `model` to a one supported by the `model_provider`'s API.
- Provide a `deployment_name` if using `azure` as the `model_provider`.
See [Detailed Configuration](yaml.md) for more details on configuration. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`).
## Model Selection Considerations
GraphRAG has been most thoroughly tested with the gpt-4 series of models from OpenAI, including gpt-4 gpt-4-turbo, gpt-4o, and gpt-4o-mini. Our [arXiv paper](https://arxiv.org/abs/2404.16130), for example, performed quality evaluation using gpt-4-turbo. As stated above, non-OpenAI models are now supported with GraphRAG 2.6.0 and onwards through the use of LiteLLM but the suite of gpt-4 series of models from OpenAI remain the most tested and supported suite of models for GraphRAG.
Versions of GraphRAG before 2.2.0 made extensive use of `max_tokens` and `logit_bias` to control generated response length or content. The introduction of the o-series of models added new, non-compatible parameters because these models include a reasoning component that has different consumption patterns and response generation attributes than non-reasoning models. GraphRAG 2.2.0 now supports these models, but there are important differences that need to be understood before you switch.
- Previously, GraphRAG used `max_tokens` to limit responses in a few locations. This is done so that we can have predictable content sizes when building downstream context windows for summarization. We have now switched from using `max_tokens` to use a prompted approach, which is working well in our tests. We suggest using `max_tokens` in your language model config only for budgetary reasons if you want to limit consumption, and not for expected response length control. We now also support the o-series equivalent `max_completion_tokens`, but if you use this keep in mind that there may be some unknown fixed reasoning consumption amount in addition to the response tokens, so it is not a good technique for response control.
- Previously, GraphRAG used a combination of `max_tokens` and `logit_bias` to strictly control a binary yes/no question during gleanings. This is not possible with reasoning models, so again we have switched to a prompted approach. Our tests with gpt-4o, gpt-4o-mini, and o1 show that this works consistently, but could have issues if you have an older or smaller model.
- The o-series models are much slower and more expensive. It may be useful to use an asymmetric approach to model use in your config: you can define as many models as you like in the `models` block of your settings.yaml and reference them by key for every workflow that requires a language model. You could use gpt-4o for indexing and o1 for query, for example. Experiment to find the right balance of cost, speed, and quality for your use case.
- The o-series models contain a form of native native chain-of-thought reasoning that is absent in the non-o-series models. GraphRAG's prompts sometimes contain CoT because it was an effective technique with the gpt-4* series. It may be counterproductive with the o-series, so you may want to tune or even re-write large portions of the prompt templates (particularly for graph and claim extraction).
Example config with asymmetric model use:
```yaml
models:
extraction_chat_model:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat
auth_type: api_key
model: gpt-4o
model_supports_json: true
query_chat_model:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat
auth_type: api_key
model: o1
model_supports_json: true
...
extract_graph:
model_id: extraction_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
...
global_search:
chat_model_id: query_chat_model
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
```
Another option would be to avoid using a language model at all for the graph extraction, instead using the `fast` [indexing method](../index/methods.md) that uses NLP for portions of the indexing phase in lieu of LLM APIs.
## Using Non-OpenAI Models
As shown above, non-OpenAI models may be used via LiteLLM starting with GraphRAG version 2.6.0 but cases may still exist in which some users wish to use models not supported by LiteLLM. There are two approaches one can use to connect to unsupported models:
### Proxy APIs
Many users have used platforms such as [ollama](https://ollama.com/) and [LiteLLM Proxy Server](https://docs.litellm.ai/docs/simple_proxy) to proxy the underlying model HTTP calls to a different model provider. This seems to work reasonably well, but we frequently see issues with malformed responses (especially JSON), so if you do this please understand that your model needs to reliably return the specific response formats that GraphRAG expects. If you're having trouble with a model, you may need to try prompting to coax the format, or intercepting the response within your proxy to try and handle malformed responses.
### Model Protocol
As of GraphRAG 2.0.0, we support model injection through the use of a standard chat and embedding Protocol and an accompanying ModelFactory that you can use to register your model implementation. This is not supported with the CLI, so you'll need to use GraphRAG as a library.
- Our Protocol is [defined here](https://github.com/microsoft/graphrag/blob/main/graphrag/language_model/protocol/base.py)
- Our base implementation, which wraps fnllm, [is here](https://github.com/microsoft/graphrag/blob/main/graphrag/language_model/providers/fnllm/models.py)
- We have a simple mock implementation in our tests that you can [reference here](https://github.com/microsoft/graphrag/blob/main/tests/mock_provider.py)
Once you have a model implementation, you need to register it with our ModelFactory:
```python
class MyCustomModel:
...
# implementation
# elsewhere...
ModelFactory.register_chat("my-custom-chat-model", lambda **kwargs: MyCustomModel(**kwargs))
```
Then in your config you can reference the type name you used:
```yaml
models:
default_chat_model:
type: my-custom-chat-model
extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
```
Note that your custom model will be passed the same params for init and method calls that we use throughout GraphRAG. There is not currently any ability to define custom parameters, so you may need to use closure scope or a factory pattern within your implementation to get custom config values.

View File

@ -4,8 +4,8 @@ The GraphRAG system is highly configurable. This page provides an overview of th
## Default Configuration Mode
The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The primary configuration sections for the Indexing Engine pipelines are described below. The main ways to set up GraphRAG in Default Configuration mode are via:
The default configuration mode is the simplest way to get started with the GraphRAG system. It is designed to work out-of-the-box with minimal configuration. The main ways to set up GraphRAG in Default Configuration mode are via:
- [Init command](init.md) (recommended)
- [Using YAML for deeper control](yaml.md)
- [Init command](init.md) (recommended first step)
- [Edit settings.yaml for deeper control](yaml.md)
- [Purely using environment variables](env_vars.md) (not recommended)

View File

@ -40,8 +40,9 @@ models:
#### Fields
- `api_key` **str** - The OpenAI API key to use.
- `auth_type` **api_key|managed_identity** - Indicate how you want to authenticate requests.
- `type` **openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding|mock_chat|mock_embeddings** - The type of LLM to use.
- `auth_type` **api_key|azure_managed_identity** - Indicate how you want to authenticate requests.
- `type` **chat**|**embedding**|**openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding|mock_chat|mock_embeddings** - The type of LLM to use.
- `model_provider` **str|None** - The model provider to use, e.g., openai, azure, anthropic, etc. Required when `type == chat|embedding`. When `type == chat|embedding`, [LiteLLM](https://docs.litellm.ai/) is used under the hood which has support for calling 100+ models. [View LiteLLm basic usage](https://docs.litellm.ai/docs/#basic-usage) for details on how models are called (The `model_provider` is the portion prior to `/` while the `model` is the portion following the `/`). [View Language Model Selection](models.md) for more details and examples on using LiteLLM.
- `model` **str** - The model name.
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).
- `api_base` **str** - The API base url to use.
@ -60,27 +61,31 @@ models:
- `concurrent_requests` **int** The number of open requests to allow at once.
- `async_mode` **asyncio|threaded** The async mode to use. Either `asyncio` or `threaded`.
- `responses` **list[str]** - If this model type is mock, this is a list of response strings to return.
- `max_tokens` **int** - The maximum number of output tokens.
- `temperature` **float** - The temperature to use.
- `top_p` **float** - The top-p value to use.
- `n` **int** - The number of completions to generate.
- `frequency_penalty` **float** - Frequency penalty for token generation.
- `presence_penalty` **float** - Frequency penalty for token generation.
- `max_tokens` **int** - The maximum number of output tokens. Not valid for o-series models.
- `temperature` **float** - The temperature to use. Not valid for o-series models.
- `top_p` **float** - The top-p value to use. Not valid for o-series models.
- `frequency_penalty` **float** - Frequency penalty for token generation. Not valid for o-series models.
- `presence_penalty` **float** - Frequency penalty for token generation. Not valid for o-series models.
- `max_completion_tokens` **int** - Max number of tokens to consume for chat completion. Must be large enough to include an unknown amount for "reasoning" by the model. o-series models only.
- `reasoning_effort` **low|medium|high** - Amount of "thought" for the model to expend reasoning about a response. o-series models only.
## Input Files and Chunking
### input
Our pipeline can ingest .csv, .txt, or .json data from an input folder. See the [inputs page](../index/inputs.md) for more details and examples.
Our pipeline can ingest .csv, .txt, or .json data from an input location. See the [inputs page](../index/inputs.md) for more details and examples.
#### Fields
- `type` **file|blob** - The input type to use. Default=`file`
- `storage` **StorageConfig**
- `type` **file|blob|cosmosdb** - The storage type to use. Default=`file`
- `base_dir` **str** - The base directory to write output artifacts to, relative to the root.
- `connection_string` **str** - (blob/cosmosdb only) The Azure Storage connection string.
- `container_name` **str** - (blob/cosmosdb only) The Azure Storage container name.
- `storage_account_blob_url` **str** - (blob only) The storage account blob URL to use.
- `cosmosdb_account_blob_url` **str** - (cosmosdb only) The CosmosDB account blob URL to use.
- `file_type` **text|csv|json** - The type of input data to load. Default is `text`
- `base_dir` **str** - The base directory to read input from, relative to the root.
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `storage_account_blob_url` **str** - The storage account blob URL to use.
- `container_name` **str** - (blob only) The Azure Storage container name.
- `encoding` **str** - The encoding of the input file. Default is `utf-8`
- `file_pattern` **str** - A regex to match input files. Default is `.*\.csv$`, `.*\.txt$`, or `.*\.json$` depending on the specified `file_type`, but you can customize it if needed.
- `file_filter` **dict** - Key/value pairs to filter. Default is None.
@ -145,11 +150,11 @@ This section controls the cache mechanism used by the pipeline. This is used to
### reporting
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to the console or to an Azure Blob Storage container.
This section controls the reporting mechanism used by the pipeline, for common events and error messages. The default is to write reports to a file in the output directory. However, you can also choose to write reports to an Azure Blob Storage container.
#### Fields
- `type` **file|console|blob** - The reporting type to use. Default=`file`
- `type` **file|blob** - The reporting type to use. Default=`file`
- `base_dir` **str** - The base directory to write reports to, relative to the root.
- `connection_string` **str** - (blob only) The Azure Storage connection string.
- `container_name` **str** - (blob only) The Azure Storage container name.
@ -199,8 +204,7 @@ Supported embeddings names are:
- `vector_store_id` **str** - Name of vector store definition to write to.
- `batch_size` **int** - The maximum batch size to use.
- `batch_max_tokens` **int** - The maximum batch # of tokens.
- `target` **required|all|selected|none** - Determines which set of embeddings to export.
- `names` **list[str]** - If target=selected, this should be an explicit list of the embeddings names we support.
- `names` **list[str]** - List of the embeddings names to run (must be in supported list).
### extract_graph
@ -212,7 +216,6 @@ Tune the language model-based graph extraction process.
- `prompt` **str** - The prompt file to use.
- `entity_types` **list[str]** - The entity types to identify.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset). This is only used for gleanings during the logit_bias check.
### summarize_descriptions
@ -221,6 +224,7 @@ Tune the language model-based graph extraction process.
- `model_id` **str** - Name of the model definition to use for API calls.
- `prompt` **str** - The prompt file to use.
- `max_length` **int** - The maximum number of output tokens per summarization.
- `max_input_length` **int** - The maximum number of tokens to collect for summarization (this will limit how many descriptions you send to be summarized for a given entity or relationship).
### extract_graph_nlp
@ -274,7 +278,6 @@ These are the settings used for Leiden hierarchical clustering of the graph to c
- `prompt` **str** - The prompt file to use.
- `description` **str** - Describes the types of claims we want to extract.
- `max_gleanings` **int** - The maximum number of gleaning cycles to use.
- `encoding_model` **str** - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset). This is only used for gleanings during the logit_bias check.
### community_reports
@ -329,11 +332,7 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
- `conversation_history_max_turns` **int** - The conversation history maximum turns.
- `top_k_entities` **int** - The top k mapped entities.
- `top_k_relationships` **int** - The top k mapped relations.
- `temperature` **float | None** - The temperature to use for token generation.
- `top_p` **float | None** - The top-p value to use for token generation.
- `n` **int | None** - The number of completions to generate.
- `max_tokens` **int** - The maximum tokens.
- `llm_max_tokens` **int** - The LLM maximum tokens.
- `max_context_tokens` **int** - The maximum tokens to use building the request context.
### global_search
@ -346,20 +345,14 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
- `map_prompt` **str | None** - The global search mapper prompt to use.
- `reduce_prompt` **str | None** - The global search reducer to use.
- `knowledge_prompt` **str | None** - The global search general prompt to use.
- `temperature` **float | None** - The temperature to use for token generation.
- `top_p` **float | None** - The top-p value to use for token generation.
- `n` **int | None** - The number of completions to generate.
- `max_tokens` **int** - The maximum context size in tokens.
- `data_max_tokens` **int** - The data llm maximum tokens.
- `map_max_tokens` **int** - The map llm maximum tokens.
- `reduce_max_tokens` **int** - The reduce llm maximum tokens.
- `concurrency` **int** - The number of concurrent requests.
- `dynamic_search_llm` **str** - LLM model to use for dynamic community selection.
- `max_context_tokens` **int** - The maximum context size to create, in tokens.
- `data_max_tokens` **int** - The maximum tokens to use constructing the final response from the reduces responses.
- `map_max_length` **int** - The maximum length to request for map responses, in words.
- `reduce_max_length` **int** - The maximum length to request for reduce responses, in words.
- `dynamic_search_threshold` **int** - Rating threshold in include a community report.
- `dynamic_search_keep_parent` **bool** - Keep parent community if any of the child communities are relevant.
- `dynamic_search_num_repeats` **int** - Number of times to rate the same community report.
- `dynamic_search_use_summary` **bool** - Use community summary instead of full_context.
- `dynamic_search_concurrent_coroutines` **int** - Number of concurrent coroutines to rate community reports.
- `dynamic_search_max_level` **int** - The maximum level of community hierarchy to consider if none of the processed communities are relevant.
### drift_search
@ -370,11 +363,9 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
- `prompt` **str** - The prompt file to use.
- `reduce_prompt` **str** - The reducer prompt file to use.
- `temperature` **float** - The temperature to use for token generation.",
- `top_p` **float** - The top-p value to use for token generation.
- `n` **int** - The number of completions to generate.
- `max_tokens` **int** - The maximum context size in tokens.
- `data_max_tokens` **int** - The data llm maximum tokens.
- `reduce_max_tokens` **int** - The maximum tokens for the reduce phase. Only use if a non-o-series model.
- `reduce_max_completion_tokens` **int** - The maximum tokens for the reduce phase. Only use for o-series models.
- `concurrency` **int** - The number of concurrent requests.
- `drift_k_followups` **int** - The number of top global results to retrieve.
- `primer_folds` **int** - The number of folds for search priming.
@ -388,7 +379,8 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
- `local_search_temperature` **float** - The temperature to use for token generation in local search.
- `local_search_top_p` **float** - The top-p value to use for token generation in local search.
- `local_search_n` **int** - The number of completions to generate in local search.
- `local_search_llm_max_gen_tokens` **int** - The maximum number of generated tokens for the LLM in local search.
- `local_search_llm_max_gen_tokens` **int** - The maximum number of generated tokens for the LLM in local search. Only use if a non-o-series model.
- `local_search_llm_max_gen_completion_tokens` **int** - The maximum number of generated tokens for the LLM in local search. Only use for o-series models.
### basic_search
@ -397,13 +389,4 @@ Indicates whether we should run UMAP dimensionality reduction. This is used to p
- `chat_model_id` **str** - Name of the model definition to use for Chat Completion calls.
- `embedding_model_id` **str** - Name of the model definition to use for Embedding calls.
- `prompt` **str** - The prompt file to use.
- `text_unit_prop` **float** - The text unit proportion.
- `community_prop` **float** - The community proportion.
- `conversation_history_max_turns` **int** - The conversation history maximum turns.
- `top_k_entities` **int** - The top k mapped entities.
- `top_k_relationships` **int** - The top k mapped relations.
- `temperature` **float | None** - The temperature to use for token generation.
- `top_p` **float | None** - The top-p value to use for token generation.
- `n` **int | None** - The number of completions to generate.
- `max_tokens` **int** - The maximum tokens.
- `llm_max_tokens` **int** - The LLM maximum tokens.
- `k` **int | None** - Number of text units to retrieve from the vector store for context building.

View File

@ -5,27 +5,27 @@
| Name | Installation | Purpose |
| ------------------- | ------------------------------------------------------------ | ----------------------------------------------------------------------------------- |
| Python 3.10-3.12 | [Download](https://www.python.org/downloads/) | The library is Python-based. |
| Poetry | [Instructions](https://python-poetry.org/docs/#installation) | Poetry is used for package management and virtualenv management in Python codebases |
| uv | [Instructions](https://docs.astral.sh/uv/) | uv is used for package management and virtualenv management in Python codebases |
# Getting Started
## Install Dependencies
```sh
# Install Python dependencies.
poetry install
# install python dependencies
uv sync
```
## Execute the Indexing Engine
```sh
poetry run poe index <...args>
uv run poe index <...args>
```
## Executing Queries
```sh
poetry run poe query <...args>
uv run poe query <...args>
```
# Azurite
@ -40,31 +40,31 @@ or by simply running `azurite` in the terminal if already installed globally. Se
# Lifecycle Scripts
Our Python package utilizes Poetry to manage dependencies and [poethepoet](https://pypi.org/project/poethepoet/) to manage build scripts.
Our Python package utilize uv to manage dependencies and [poethepoet](https://pypi.org/project/poethepoet/) to manage build scripts.
Available scripts are:
- `poetry run poe index` - Run the Indexing CLI
- `poetry run poe query` - Run the Query CLI
- `poetry build` - This invokes `poetry build`, which will build a wheel file and other distributable artifacts.
- `poetry run poe test` - This will execute all tests.
- `poetry run poe test_unit` - This will execute unit tests.
- `poetry run poe test_integration` - This will execute integration tests.
- `poetry run poe test_smoke` - This will execute smoke tests.
- `poetry run poe test_verbs` - This will execute tests of the basic workflows.
- `poetry run poe check` - This will perform a suite of static checks across the package, including:
- `uv run poe index` - Run the Indexing CLI
- `uv run poe query` - Run the Query CLI
- `uv build` - This will build a wheel file and other distributable artifacts.
- `uv run poe test` - This will execute all tests.
- `uv run poe test_unit` - This will execute unit tests.
- `uv run poe test_integration` - This will execute integration tests.
- `uv run poe test_smoke` - This will execute smoke tests.
- `uv run poe test_verbs` - This will execute tests of the basic workflows.
- `uv run poe check` - This will perform a suite of static checks across the package, including:
- formatting
- documentation formatting
- linting
- security patterns
- type-checking
- `poetry run poe fix` - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
- `poetry run poe fix_unsafe` - This will apply any available auto-fixes to the package, including those that may be unsafe.
- `poetry run poe format` - Explicitly run the formatter across the package.
- `uv run poe fix` - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
- `uv run poe fix_unsafe` - This will apply any available auto-fixes to the package, including those that may be unsafe.
- `uv run poe format` - Explicitly run the formatter across the package.
## Troubleshooting
### "RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running poetry install
### "RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running uv install
Make sure llvm-9 and llvm-9-dev are installed:
@ -73,14 +73,3 @@ Make sure llvm-9 and llvm-9-dev are installed:
and then in your bashrc, add
`export LLVM_CONFIG=/usr/bin/llvm-config-9`
### "numba/\_pymodule.h:6:10: fatal error: Python.h: No such file or directory" when running poetry install
Make sure you have python3.10-dev installed or more generally `python<version>-dev`
`sudo apt-get install python3.10-dev`
### LLM call constantly exceeds TPM, RPM or time limits
`GRAPHRAG_LLM_THREAD_COUNT` and `GRAPHRAG_EMBEDDING_THREAD_COUNT` are both set to 50 by default. You can modify these values
to reduce concurrency. Please refer to the [Configuration Documents](config/overview.md)

View File

@ -67,6 +67,8 @@
"metadata": {},
"outputs": [],
"source": [
"# note that we expect this to fail on the deployed docs because the PROJECT_DIRECTORY is not set to a real location.\n",
"# if you run this notebook locally, make sure to point at a location containing your settings.yaml\n",
"graphrag_config = load_config(Path(PROJECT_DIRECTORY))"
]
},

View File

@ -0,0 +1,680 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright (c) 2024 Microsoft Corporation.\n",
"# Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bring-Your-Own Vector Store\n",
"\n",
"This notebook demonstrates how to implement a custom vector store and register for usage with GraphRAG.\n",
"\n",
"## Overview\n",
"\n",
"GraphRAG uses a plug-and-play architecture that allow for easy integration of custom vector stores (outside of what is natively supported) by following a factory design pattern. This allows you to:\n",
"\n",
"- **Extend functionality**: Add support for new vector database backends\n",
"- **Customize behavior**: Implement specialized search logic or data structures\n",
"- **Integrate existing systems**: Connect GraphRAG to your existing vector database infrastructure\n",
"\n",
"### What You'll Learn\n",
"\n",
"1. Understanding the `BaseVectorStore` interface\n",
"2. Implementing a custom vector store class\n",
"3. Registering your vector store with the `VectorStoreFactory`\n",
"4. Testing and validating your implementation\n",
"5. Configuring GraphRAG to use your custom vector store\n",
"\n",
"Let's get started!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Import Required Dependencies\n",
"\n",
"First, let's import the necessary GraphRAG components and other dependencies we'll need.\n",
"\n",
"```bash\n",
"pip install graphrag\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from typing import Any\n",
"\n",
"import numpy as np\n",
"import yaml\n",
"\n",
"from graphrag.config.models.vector_store_schema_config import VectorStoreSchemaConfig\n",
"from graphrag.data_model.types import TextEmbedder\n",
"\n",
"# GraphRAG vector store components\n",
"from graphrag.vector_stores.base import (\n",
" BaseVectorStore,\n",
" VectorStoreDocument,\n",
" VectorStoreSearchResult,\n",
")\n",
"from graphrag.vector_stores.factory import VectorStoreFactory"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Understand the BaseVectorStore Interface\n",
"\n",
"Before using a custom vector store, let's examine the `BaseVectorStore` interface to understand what methods need to be implemented."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's inspect the BaseVectorStore class to understand the required methods\n",
"import inspect\n",
"\n",
"print(\"BaseVectorStore Abstract Methods:\")\n",
"print(\"=\" * 40)\n",
"\n",
"abstract_methods = []\n",
"for name, method in inspect.getmembers(BaseVectorStore, predicate=inspect.isfunction):\n",
" if getattr(method, \"__isabstractmethod__\", False):\n",
" signature = inspect.signature(method)\n",
" abstract_methods.append(f\"• {name}{signature}\")\n",
" print(f\"• {name}{signature}\")\n",
"\n",
"print(f\"\\nTotal abstract methods to implement: {len(abstract_methods)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Implement a Custom Vector Store\n",
"\n",
"Now let's implement a simple in-memory vector store as an example. This vector store will:\n",
"\n",
"- Store documents and vectors in memory using Python data structures\n",
"- Support all required BaseVectorStore methods\n",
"\n",
"**Note**: This is a simplified example for demonstration. Production vector stores would typically use optimized libraries like FAISS, more sophisticated indexing, and persistent storage."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class SimpleInMemoryVectorStore(BaseVectorStore):\n",
" \"\"\"A simple in-memory vector store implementation for demonstration purposes.\n",
"\n",
" This vector store stores documents and their embeddings in memory and provides\n",
" basic similarity search functionality using cosine similarity.\n",
"\n",
" WARNING: This is for demonstration only - not suitable for production use.\n",
" For production, consider using optimized vector databases like LanceDB,\n",
" Azure AI Search, or other specialized vector stores.\n",
" \"\"\"\n",
"\n",
" # Internal storage for documents and vectors\n",
" documents: dict[str, VectorStoreDocument]\n",
" vectors: dict[str, np.ndarray]\n",
" connected: bool\n",
"\n",
" def __init__(self, **kwargs: Any):\n",
" \"\"\"Initialize the in-memory vector store.\"\"\"\n",
" super().__init__(**kwargs)\n",
"\n",
" self.documents: dict[str, VectorStoreDocument] = {}\n",
" self.vectors: dict[str, np.ndarray] = {}\n",
" self.connected = False\n",
"\n",
" print(f\"🚀 SimpleInMemoryVectorStore initialized for index: {self.index_name}\")\n",
"\n",
" def connect(self, **kwargs: Any) -> None:\n",
" \"\"\"Connect to the vector storage (no-op for in-memory store).\"\"\"\n",
" self.connected = True\n",
" print(f\"✅ Connected to in-memory vector store: {self.index_name}\")\n",
"\n",
" def load_documents(\n",
" self, documents: list[VectorStoreDocument], overwrite: bool = True\n",
" ) -> None:\n",
" \"\"\"Load documents into the vector store.\"\"\"\n",
" if not self.connected:\n",
" msg = \"Vector store not connected. Call connect() first.\"\n",
" raise RuntimeError(msg)\n",
"\n",
" if overwrite:\n",
" self.documents.clear()\n",
" self.vectors.clear()\n",
"\n",
" loaded_count = 0\n",
" for doc in documents:\n",
" if doc.vector is not None:\n",
" doc_id = str(doc.id)\n",
" self.documents[doc_id] = doc\n",
" self.vectors[doc_id] = np.array(doc.vector, dtype=np.float32)\n",
" loaded_count += 1\n",
"\n",
" print(f\"📚 Loaded {loaded_count} documents into vector store\")\n",
"\n",
" def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:\n",
" \"\"\"Calculate cosine similarity between two vectors.\"\"\"\n",
" # Normalize vectors\n",
" norm1 = np.linalg.norm(vec1)\n",
" norm2 = np.linalg.norm(vec2)\n",
"\n",
" if norm1 == 0 or norm2 == 0:\n",
" return 0.0\n",
"\n",
" return float(np.dot(vec1, vec2) / (norm1 * norm2))\n",
"\n",
" def similarity_search_by_vector(\n",
" self, query_embedding: list[float], k: int = 10, **kwargs: Any\n",
" ) -> list[VectorStoreSearchResult]:\n",
" \"\"\"Perform similarity search using a query vector.\"\"\"\n",
" if not self.connected:\n",
" msg = \"Vector store not connected. Call connect() first.\"\n",
" raise RuntimeError(msg)\n",
"\n",
" if not self.vectors:\n",
" return []\n",
"\n",
" query_vec = np.array(query_embedding, dtype=np.float32)\n",
" similarities = []\n",
"\n",
" # Calculate similarity with all stored vectors\n",
" for doc_id, stored_vec in self.vectors.items():\n",
" similarity = self._cosine_similarity(query_vec, stored_vec)\n",
" similarities.append((doc_id, similarity))\n",
"\n",
" # Sort by similarity (descending) and take top k\n",
" similarities.sort(key=lambda x: x[1], reverse=True)\n",
" top_k = similarities[:k]\n",
"\n",
" # Create search results\n",
" results = []\n",
" for doc_id, score in top_k:\n",
" document = self.documents[doc_id]\n",
" result = VectorStoreSearchResult(document=document, score=score)\n",
" results.append(result)\n",
"\n",
" return results\n",
"\n",
" def similarity_search_by_text(\n",
" self, text: str, text_embedder: TextEmbedder, k: int = 10, **kwargs: Any\n",
" ) -> list[VectorStoreSearchResult]:\n",
" \"\"\"Perform similarity search using text (which gets embedded first).\"\"\"\n",
" # Embed the text first\n",
" query_embedding = text_embedder(text)\n",
"\n",
" # Use vector search with the embedding\n",
" return self.similarity_search_by_vector(query_embedding, k, **kwargs)\n",
"\n",
" def filter_by_id(self, include_ids: list[str] | list[int]) -> Any:\n",
" \"\"\"Build a query filter to filter documents by id.\n",
"\n",
" For this simple implementation, we return the list of IDs as the filter.\n",
" \"\"\"\n",
" return [str(id_) for id_ in include_ids]\n",
"\n",
" def search_by_id(self, id: str) -> VectorStoreDocument:\n",
" \"\"\"Search for a document by id.\"\"\"\n",
" doc_id = str(id)\n",
" if doc_id not in self.documents:\n",
" msg = f\"Document with id '{id}' not found\"\n",
" raise KeyError(msg)\n",
"\n",
" return self.documents[doc_id]\n",
"\n",
" def get_stats(self) -> dict[str, Any]:\n",
" \"\"\"Get statistics about the vector store (custom method).\"\"\"\n",
" return {\n",
" \"index_name\": self.index_name,\n",
" \"document_count\": len(self.documents),\n",
" \"vector_count\": len(self.vectors),\n",
" \"connected\": self.connected,\n",
" \"vector_dimension\": len(next(iter(self.vectors.values())))\n",
" if self.vectors\n",
" else 0,\n",
" }\n",
"\n",
"\n",
"print(\"✅ SimpleInMemoryVectorStore class defined!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Register the Custom Vector Store\n",
"\n",
"Now let's register our custom vector store with the `VectorStoreFactory` so it can be used throughout GraphRAG."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Register our custom vector store with a unique identifier\n",
"CUSTOM_VECTOR_STORE_TYPE = \"simple_memory\"\n",
"\n",
"# Register the vector store class\n",
"VectorStoreFactory.register(CUSTOM_VECTOR_STORE_TYPE, SimpleInMemoryVectorStore)\n",
"\n",
"print(f\"✅ Registered custom vector store with type: '{CUSTOM_VECTOR_STORE_TYPE}'\")\n",
"\n",
"# Verify registration\n",
"available_types = VectorStoreFactory.get_vector_store_types()\n",
"print(f\"\\n📋 Available vector store types: {available_types}\")\n",
"print(\n",
" f\"🔍 Is our custom type supported? {VectorStoreFactory.is_supported_type(CUSTOM_VECTOR_STORE_TYPE)}\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5: Test the Custom Vector Store\n",
"\n",
"Let's create some sample data and test our custom vector store implementation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create sample documents with mock embeddings\n",
"def create_mock_embedding(dimension: int = 384) -> list[float]:\n",
" \"\"\"Create a random embedding vector for testing.\"\"\"\n",
" return np.random.normal(0, 1, dimension).tolist()\n",
"\n",
"\n",
"# Sample documents\n",
"sample_documents = [\n",
" VectorStoreDocument(\n",
" id=\"doc_1\",\n",
" text=\"GraphRAG is a powerful knowledge graph extraction and reasoning framework.\",\n",
" vector=create_mock_embedding(),\n",
" attributes={\"category\": \"technology\", \"source\": \"documentation\"},\n",
" ),\n",
" VectorStoreDocument(\n",
" id=\"doc_2\",\n",
" text=\"Vector stores enable efficient similarity search over high-dimensional data.\",\n",
" vector=create_mock_embedding(),\n",
" attributes={\"category\": \"technology\", \"source\": \"research\"},\n",
" ),\n",
" VectorStoreDocument(\n",
" id=\"doc_3\",\n",
" text=\"Machine learning models can process and understand natural language text.\",\n",
" vector=create_mock_embedding(),\n",
" attributes={\"category\": \"AI\", \"source\": \"article\"},\n",
" ),\n",
" VectorStoreDocument(\n",
" id=\"doc_4\",\n",
" text=\"Custom implementations allow for specialized behavior and integration.\",\n",
" vector=create_mock_embedding(),\n",
" attributes={\"category\": \"development\", \"source\": \"tutorial\"},\n",
" ),\n",
"]\n",
"\n",
"print(f\"📝 Created {len(sample_documents)} sample documents\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test creating vector store using the factory\n",
"schema = VectorStoreSchemaConfig(index_name=\"test_collection\")\n",
"\n",
"# Create vector store instance using factory\n",
"vector_store = VectorStoreFactory.create_vector_store(\n",
" CUSTOM_VECTOR_STORE_TYPE, vector_store_schema_config=schema\n",
")\n",
"\n",
"print(f\"✅ Created vector store instance: {type(vector_store).__name__}\")\n",
"print(f\"📊 Initial stats: {vector_store.get_stats()}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Connect and load documents\n",
"vector_store.connect()\n",
"vector_store.load_documents(sample_documents)\n",
"\n",
"print(f\"📊 Updated stats: {vector_store.get_stats()}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test similarity search\n",
"query_vector = create_mock_embedding() # Random query vector for testing\n",
"\n",
"search_results = vector_store.similarity_search_by_vector(\n",
" query_vector,\n",
" k=3, # Get top 3 similar documents\n",
")\n",
"\n",
"print(f\"🔍 Found {len(search_results)} similar documents:\\n\")\n",
"\n",
"for i, result in enumerate(search_results, 1):\n",
" doc = result.document\n",
" print(f\"{i}. ID: {doc.id}\")\n",
" print(f\" Text: {doc.text[:60]}...\")\n",
" print(f\" Similarity Score: {result.score:.4f}\")\n",
" print(f\" Category: {doc.attributes.get('category', 'N/A')}\")\n",
" print()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test search by ID\n",
"try:\n",
" found_doc = vector_store.search_by_id(\"doc_2\")\n",
" print(\"✅ Found document by ID:\")\n",
" print(f\" ID: {found_doc.id}\")\n",
" print(f\" Text: {found_doc.text}\")\n",
" print(f\" Attributes: {found_doc.attributes}\")\n",
"except KeyError as e:\n",
" print(f\"❌ Error: {e}\")\n",
"\n",
"# Test filter by ID\n",
"id_filter = vector_store.filter_by_id([\"doc_1\", \"doc_3\"])\n",
"print(f\"\\n🔧 ID filter result: {id_filter}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6: Configuration for GraphRAG\n",
"\n",
"Now let's see how you would configure GraphRAG to use your custom vector store in a settings file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Example GraphRAG yaml settings\n",
"example_settings = {\n",
" \"vector_store\": {\n",
" \"default_vector_store\": {\n",
" \"type\": CUSTOM_VECTOR_STORE_TYPE, # \"simple_memory\"\n",
" \"collection_name\": \"graphrag_entities\",\n",
" # Add any custom parameters your vector store needs\n",
" \"custom_parameter\": \"custom_value\",\n",
" }\n",
" },\n",
" # Other GraphRAG configuration...\n",
" \"models\": {\n",
" \"default_embedding_model\": {\n",
" \"type\": \"openai_embedding\",\n",
" \"model\": \"text-embedding-3-small\",\n",
" }\n",
" },\n",
"}\n",
"\n",
"# Convert to YAML format for settings.yml\n",
"yaml_config = yaml.dump(example_settings, default_flow_style=False, indent=2)\n",
"\n",
"print(\"📄 Example settings.yml configuration:\")\n",
"print(\"=\" * 40)\n",
"print(yaml_config)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 7: Integration with GraphRAG Pipeline\n",
"\n",
"Here's how your custom vector store would be used in a typical GraphRAG pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Example of how GraphRAG would use your custom vector store\n",
"def simulate_graphrag_pipeline():\n",
" \"\"\"Simulate how GraphRAG would use the custom vector store.\"\"\"\n",
" print(\"🚀 Simulating GraphRAG pipeline with custom vector store...\\n\")\n",
"\n",
" # 1. GraphRAG creates vector store using factory\n",
" schema = VectorStoreSchemaConfig(index_name=\"graphrag_entities\")\n",
"\n",
" store = VectorStoreFactory.create_vector_store(\n",
" CUSTOM_VECTOR_STORE_TYPE,\n",
" vector_store_schema_config=schema,\n",
" similarity_threshold=0.3,\n",
" )\n",
" store.connect()\n",
"\n",
" print(\"✅ Step 1: Vector store created and connected\")\n",
"\n",
" # 2. During indexing, GraphRAG loads extracted entities\n",
" entity_documents = [\n",
" VectorStoreDocument(\n",
" id=f\"entity_{i}\",\n",
" text=f\"Entity {i} description: Important concept in the knowledge graph\",\n",
" vector=create_mock_embedding(),\n",
" attributes={\"type\": \"entity\", \"importance\": i % 3 + 1},\n",
" )\n",
" for i in range(10)\n",
" ]\n",
"\n",
" store.load_documents(entity_documents)\n",
" print(f\"✅ Step 2: Loaded {len(entity_documents)} entity documents\")\n",
"\n",
" # 3. During query time, GraphRAG searches for relevant entities\n",
" query_embedding = create_mock_embedding()\n",
" relevant_entities = store.similarity_search_by_vector(query_embedding, k=5)\n",
"\n",
" print(f\"✅ Step 3: Found {len(relevant_entities)} relevant entities for query\")\n",
"\n",
" # 4. GraphRAG uses these entities for context building\n",
" context_entities = [result.document for result in relevant_entities]\n",
"\n",
" print(\"✅ Step 4: Context built using retrieved entities\")\n",
" print(f\"📊 Final stats: {store.get_stats()}\")\n",
"\n",
" return context_entities\n",
"\n",
"\n",
"# Run the simulation\n",
"context = simulate_graphrag_pipeline()\n",
"print(f\"\\n🎯 Retrieved {len(context)} entities for context building\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 8: Testing and Validation\n",
"\n",
"Let's create a comprehensive test suite to ensure our vector store works correctly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def test_custom_vector_store():\n",
" \"\"\"Comprehensive test suite for the custom vector store.\"\"\"\n",
" print(\"🧪 Running comprehensive vector store tests...\\n\")\n",
"\n",
" # Test 1: Basic functionality\n",
" print(\"Test 1: Basic functionality\")\n",
" store = VectorStoreFactory.create_vector_store(\n",
" CUSTOM_VECTOR_STORE_TYPE,\n",
" vector_store_schema_config=VectorStoreSchemaConfig(index_name=\"test\"),\n",
" )\n",
" store.connect()\n",
"\n",
" # Load test documents\n",
" test_docs = sample_documents[:2]\n",
" store.load_documents(test_docs)\n",
"\n",
" assert len(store.documents) == 2, \"Should have 2 documents\"\n",
" assert len(store.vectors) == 2, \"Should have 2 vectors\"\n",
" print(\"✅ Basic functionality test passed\")\n",
"\n",
" # Test 2: Search functionality\n",
" print(\"\\nTest 2: Search functionality\")\n",
" query_vec = create_mock_embedding()\n",
" results = store.similarity_search_by_vector(query_vec, k=5)\n",
"\n",
" assert len(results) <= 2, \"Should not return more results than documents\"\n",
" assert all(isinstance(r, VectorStoreSearchResult) for r in results), (\n",
" \"Should return VectorStoreSearchResult objects\"\n",
" )\n",
" assert all(-1 <= r.score <= 1 for r in results), (\n",
" \"Similarity scores should be between -1 and 1\"\n",
" )\n",
" print(\"✅ Search functionality test passed\")\n",
"\n",
" # Test 3: Search by ID\n",
" print(\"\\nTest 3: Search by ID\")\n",
" found_doc = store.search_by_id(\"doc_1\")\n",
" assert found_doc.id == \"doc_1\", \"Should find correct document\"\n",
"\n",
" try:\n",
" store.search_by_id(\"nonexistent\")\n",
" assert False, \"Should raise KeyError for nonexistent ID\"\n",
" except KeyError:\n",
" pass # Expected\n",
"\n",
" print(\"✅ Search by ID test passed\")\n",
"\n",
" # Test 4: Filter functionality\n",
" print(\"\\nTest 4: Filter functionality\")\n",
" filter_result = store.filter_by_id([\"doc_1\", \"doc_2\"])\n",
" assert filter_result == [\"doc_1\", \"doc_2\"], \"Should return filtered IDs\"\n",
" print(\"✅ Filter functionality test passed\")\n",
"\n",
" # Test 5: Error handling\n",
" print(\"\\nTest 5: Error handling\")\n",
" disconnected_store = VectorStoreFactory.create_vector_store(\n",
" CUSTOM_VECTOR_STORE_TYPE,\n",
" vector_store_schema_config=VectorStoreSchemaConfig(index_name=\"test2\"),\n",
" )\n",
"\n",
" try:\n",
" disconnected_store.load_documents(test_docs)\n",
" assert False, \"Should raise error when not connected\"\n",
" except RuntimeError:\n",
" pass # Expected\n",
"\n",
" try:\n",
" disconnected_store.similarity_search_by_vector(query_vec)\n",
" assert False, \"Should raise error when not connected\"\n",
" except RuntimeError:\n",
" pass # Expected\n",
"\n",
" print(\"✅ Error handling test passed\")\n",
"\n",
" print(\"\\n🎉 All tests passed! Your custom vector store is working correctly.\")\n",
"\n",
"\n",
"# Run the tests\n",
"test_custom_vector_store()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary and Next Steps\n",
"\n",
"Congratulations! You've successfully learned how to implement and register a custom vector store with GraphRAG. Here's what you accomplished:\n",
"\n",
"### What You Built\n",
"- ✅ **Custom Vector Store Class**: Implemented `SimpleInMemoryVectorStore` with all required methods\n",
"- ✅ **Factory Integration**: Registered your vector store with `VectorStoreFactory`\n",
"- ✅ **Comprehensive Testing**: Validated functionality with a full test suite\n",
"- ✅ **Configuration Examples**: Learned how to configure GraphRAG to use your vector store\n",
"\n",
"### Key Takeaways\n",
"1. **Interface Compliance**: Always implement all methods from `BaseVectorStore`\n",
"2. **Factory Pattern**: Use `VectorStoreFactory.register()` to make your vector store available\n",
"3. **Configuration**: Vector stores are configured in GraphRAG settings files\n",
"4. **Testing**: Thoroughly test all functionality before deploying\n",
"\n",
"### Next Steps\n",
"Check out the API Overview notebook to learn how to index and query data via the graphrag API.\n",
"\n",
"### Resources\n",
"- [GraphRAG Documentation](https://microsoft.github.io/graphrag/)\n",
"\n",
"Happy building! 🚀"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "graphrag",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -20,11 +20,11 @@
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"import tiktoken\n",
"\n",
"from graphrag.config.enums import ModelType\n",
"from graphrag.config.models.drift_search_config import DRIFTSearchConfig\n",
"from graphrag.config.models.language_model_config import LanguageModelConfig\n",
"from graphrag.config.models.vector_store_schema_config import VectorStoreSchemaConfig\n",
"from graphrag.language_model.manager import ModelManager\n",
"from graphrag.query.indexer_adapters import (\n",
" read_indexer_entities,\n",
@ -37,6 +37,7 @@
" DRIFTSearchContextBuilder,\n",
")\n",
"from graphrag.query.structured_search.drift_search.search import DRIFTSearch\n",
"from graphrag.tokenizer.get_tokenizer import get_tokenizer\n",
"from graphrag.vector_stores.lancedb import LanceDBVectorStore\n",
"\n",
"INPUT_DIR = \"./inputs/operation dulce\"\n",
@ -62,12 +63,16 @@
"# load description embeddings to an in-memory lancedb vectorstore\n",
"# to connect to a remote db, specify url and port values.\n",
"description_embedding_store = LanceDBVectorStore(\n",
" collection_name=\"default-entity-description\",\n",
" vector_store_schema_config=VectorStoreSchemaConfig(\n",
" index_name=\"default-entity-description\"\n",
" ),\n",
")\n",
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
"\n",
"full_content_embedding_store = LanceDBVectorStore(\n",
" collection_name=\"default-community-full_content\",\n",
" vector_store_schema_config=VectorStoreSchemaConfig(\n",
" index_name=\"default-community-full_content\"\n",
" )\n",
")\n",
"full_content_embedding_store.connect(db_uri=LANCEDB_URI)\n",
"\n",
@ -94,33 +99,33 @@
"outputs": [],
"source": [
"api_key = os.environ[\"GRAPHRAG_API_KEY\"]\n",
"llm_model = os.environ[\"GRAPHRAG_LLM_MODEL\"]\n",
"embedding_model = os.environ[\"GRAPHRAG_EMBEDDING_MODEL\"]\n",
"\n",
"chat_config = LanguageModelConfig(\n",
" api_key=api_key,\n",
" type=ModelType.OpenAIChat,\n",
" model=llm_model,\n",
" type=ModelType.Chat,\n",
" model_provider=\"openai\",\n",
" model=\"gpt-4.1\",\n",
" max_retries=20,\n",
")\n",
"chat_model = ModelManager().get_or_create_chat_model(\n",
" name=\"local_search\",\n",
" model_type=ModelType.OpenAIChat,\n",
" model_type=ModelType.Chat,\n",
" config=chat_config,\n",
")\n",
"\n",
"token_encoder = tiktoken.encoding_for_model(llm_model)\n",
"tokenizer = get_tokenizer(chat_config)\n",
"\n",
"embedding_config = LanguageModelConfig(\n",
" api_key=api_key,\n",
" type=ModelType.OpenAIEmbedding,\n",
" model=embedding_model,\n",
" type=ModelType.Embedding,\n",
" model_provider=\"openai\",\n",
" model=\"text-embedding-3-small\",\n",
" max_retries=20,\n",
")\n",
"\n",
"text_embedder = ModelManager().get_or_create_embedding_model(\n",
" name=\"local_search_embedding\",\n",
" model_type=ModelType.OpenAIEmbedding,\n",
" model_type=ModelType.Embedding,\n",
" config=embedding_config,\n",
")"
]
@ -173,12 +178,12 @@
" reports=reports,\n",
" entity_text_embeddings=description_embedding_store,\n",
" text_units=text_units,\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
" config=drift_params,\n",
")\n",
"\n",
"search = DRIFTSearch(\n",
" model=chat_model, context_builder=context_builder, token_encoder=token_encoder\n",
" model=chat_model, context_builder=context_builder, tokenizer=tokenizer\n",
")"
]
},
@ -212,7 +217,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "graphrag",
"language": "python",
"name": "python3"
},
@ -226,7 +231,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.10"
}
},
"nbformat": 4,

View File

@ -19,7 +19,6 @@
"import os\n",
"\n",
"import pandas as pd\n",
"import tiktoken\n",
"\n",
"from graphrag.config.enums import ModelType\n",
"from graphrag.config.models.language_model_config import LanguageModelConfig\n",
@ -32,7 +31,8 @@
"from graphrag.query.structured_search.global_search.community_context import (\n",
" GlobalCommunityContext,\n",
")\n",
"from graphrag.query.structured_search.global_search.search import GlobalSearch"
"from graphrag.query.structured_search.global_search.search import GlobalSearch\n",
"from graphrag.tokenizer.get_tokenizer import get_tokenizer"
]
},
{
@ -58,21 +58,21 @@
"outputs": [],
"source": [
"api_key = os.environ[\"GRAPHRAG_API_KEY\"]\n",
"llm_model = os.environ[\"GRAPHRAG_LLM_MODEL\"]\n",
"\n",
"config = LanguageModelConfig(\n",
" api_key=api_key,\n",
" type=ModelType.OpenAIChat,\n",
" model=llm_model,\n",
" type=ModelType.Chat,\n",
" model_provider=\"openai\",\n",
" model=\"gpt-4.1\",\n",
" max_retries=20,\n",
")\n",
"model = ModelManager().get_or_create_chat_model(\n",
" name=\"global_search\",\n",
" model_type=ModelType.OpenAIChat,\n",
" model_type=ModelType.Chat,\n",
" config=config,\n",
")\n",
"\n",
"token_encoder = tiktoken.encoding_for_model(llm_model)"
"tokenizer = get_tokenizer(config)"
]
},
{
@ -142,7 +142,7 @@
" community_reports=reports,\n",
" communities=communities,\n",
" entities=entities, # default to None if you don't want to use community weights for ranking\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
")"
]
},
@ -193,7 +193,7 @@
"search_engine = GlobalSearch(\n",
" model=model,\n",
" context_builder=context_builder,\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
" max_data_tokens=12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)\n",
" map_llm_params=map_llm_params,\n",
" reduce_llm_params=reduce_llm_params,\n",
@ -241,7 +241,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "graphrag",
"language": "python",
"name": "python3"
},
@ -255,7 +255,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.10"
}
},
"nbformat": 4,

View File

@ -19,7 +19,6 @@
"import os\n",
"\n",
"import pandas as pd\n",
"import tiktoken\n",
"\n",
"from graphrag.config.enums import ModelType\n",
"from graphrag.config.models.language_model_config import LanguageModelConfig\n",
@ -57,22 +56,24 @@
"metadata": {},
"outputs": [],
"source": [
"from graphrag.tokenizer.get_tokenizer import get_tokenizer\n",
"\n",
"api_key = os.environ[\"GRAPHRAG_API_KEY\"]\n",
"llm_model = os.environ[\"GRAPHRAG_LLM_MODEL\"]\n",
"\n",
"config = LanguageModelConfig(\n",
" api_key=api_key,\n",
" type=ModelType.OpenAIChat,\n",
" model=llm_model,\n",
" type=ModelType.Chat,\n",
" model_provider=\"openai\",\n",
" model=\"gpt-4.1\",\n",
" max_retries=20,\n",
")\n",
"model = ModelManager().get_or_create_chat_model(\n",
" name=\"global_search\",\n",
" model_type=ModelType.OpenAIChat,\n",
" model_type=ModelType.Chat,\n",
" config=config,\n",
")\n",
"\n",
"token_encoder = tiktoken.encoding_for_model(llm_model)"
"tokenizer = get_tokenizer(config)"
]
},
{
@ -155,11 +156,11 @@
" community_reports=reports,\n",
" communities=communities,\n",
" entities=entities, # default to None if you don't want to use community weights for ranking\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
" dynamic_community_selection=True,\n",
" dynamic_community_selection_kwargs={\n",
" \"model\": model,\n",
" \"token_encoder\": token_encoder,\n",
" \"tokenizer\": tokenizer,\n",
" },\n",
")"
]
@ -211,7 +212,7 @@
"search_engine = GlobalSearch(\n",
" model=model,\n",
" context_builder=context_builder,\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
" max_data_tokens=12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)\n",
" map_llm_params=map_llm_params,\n",
" reduce_llm_params=reduce_llm_params,\n",
@ -255,7 +256,7 @@
"prompt_tokens = result.prompt_tokens_categories[\"build_context\"]\n",
"output_tokens = result.output_tokens_categories[\"build_context\"]\n",
"print(\n",
" f\"Build context ({llm_model})\\nLLM calls: {llm_calls}. Prompt tokens: {prompt_tokens}. Output tokens: {output_tokens}.\"\n",
" f\"Build context LLM calls: {llm_calls}. Prompt tokens: {prompt_tokens}. Output tokens: {output_tokens}.\"\n",
")\n",
"# inspect number of LLM calls and tokens in map-reduce\n",
"llm_calls = result.llm_calls_categories[\"map\"] + result.llm_calls_categories[\"reduce\"]\n",
@ -266,14 +267,14 @@
" result.output_tokens_categories[\"map\"] + result.output_tokens_categories[\"reduce\"]\n",
")\n",
"print(\n",
" f\"Map-reduce ({llm_model})\\nLLM calls: {llm_calls}. Prompt tokens: {prompt_tokens}. Output tokens: {output_tokens}.\"\n",
" f\"Map-reduce LLM calls: {llm_calls}. Prompt tokens: {prompt_tokens}. Output tokens: {output_tokens}.\"\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "graphrag",
"language": "python",
"name": "python3"
},
@ -287,7 +288,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.10"
}
},
"nbformat": 4,

View File

@ -202,10 +202,11 @@
"metadata": {},
"outputs": [],
"source": [
"from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings\n",
"\n",
"from graphrag.cache.factory import CacheFactory\n",
"from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks\n",
"from graphrag.config.embeddings import get_embedded_fields, get_embedding_settings\n",
"from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings\n",
"\n",
"# We only need to re-run the embeddings workflow, to ensure that embeddings for all required search fields are in place\n",
"# We'll construct the context and run this function flow directly to avoid everything else\n",

View File

@ -0,0 +1,194 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright (c) 2024 Microsoft Corporation.\n",
"# Licensed under the MIT License."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example of indexing from an existing in-memory dataframe\n",
"\n",
"Newer versions of GraphRAG let you submit a dataframe directly instead of running through the input processing step. This notebook demonstrates with regular or update runs.\n",
"\n",
"If performing an update, the assumption is that your dataframe contains only the new documents to add to the index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"from pprint import pprint\n",
"\n",
"import pandas as pd\n",
"\n",
"import graphrag.api as api\n",
"from graphrag.config.load_config import load_config\n",
"from graphrag.index.typing.pipeline_run_result import PipelineRunResult"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"PROJECT_DIRECTORY = \"<your project directory>\"\n",
"UPDATE = False\n",
"FILENAME = \"new_documents.parquet\" if UPDATE else \"<original_documents>.parquet\"\n",
"inputs = pd.read_parquet(f\"{PROJECT_DIRECTORY}/input/{FILENAME}\")\n",
"# Only the bare minimum for input. These are the same fields that would be present after the load_input_documents workflow\n",
"inputs = inputs.loc[:, [\"id\", \"title\", \"text\", \"creation_date\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate a `GraphRagConfig` object"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"graphrag_config = load_config(Path(PROJECT_DIRECTORY))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Indexing API\n",
"\n",
"*Indexing* is the process of ingesting raw text data and constructing a knowledge graph. GraphRAG currently supports plaintext (`.txt`) and `.csv` file formats."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Build an index"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"index_result: list[PipelineRunResult] = await api.build_index(\n",
" config=graphrag_config, input_documents=inputs, is_update_run=UPDATE\n",
")\n",
"\n",
"# index_result is a list of workflows that make up the indexing pipeline that was run\n",
"for workflow_result in index_result:\n",
" status = f\"error\\n{workflow_result.errors}\" if workflow_result.errors else \"success\"\n",
" print(f\"Workflow Name: {workflow_result.workflow}\\tStatus: {status}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Query an index\n",
"\n",
"To query an index, several index files must first be read into memory and passed to the query API. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"entities = pd.read_parquet(f\"{PROJECT_DIRECTORY}/output/entities.parquet\")\n",
"communities = pd.read_parquet(f\"{PROJECT_DIRECTORY}/output/communities.parquet\")\n",
"community_reports = pd.read_parquet(\n",
" f\"{PROJECT_DIRECTORY}/output/community_reports.parquet\"\n",
")\n",
"\n",
"response, context = await api.global_search(\n",
" config=graphrag_config,\n",
" entities=entities,\n",
" communities=communities,\n",
" community_reports=community_reports,\n",
" community_level=2,\n",
" dynamic_community_selection=False,\n",
" response_type=\"Multiple Paragraphs\",\n",
" query=\"What are the top five themes of the dataset?\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The response object is the official reponse from graphrag while the context object holds various metadata regarding the querying process used to obtain the final response."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Digging into the context a bit more provides users with extremely granular information such as what sources of data (down to the level of text chunks) were ultimately retrieved and used as part of the context sent to the LLM model)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pprint(context) # noqa: T203"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "graphrag",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

View File

@ -19,8 +19,8 @@
"import os\n",
"\n",
"import pandas as pd\n",
"import tiktoken\n",
"\n",
"from graphrag.config.models.vector_store_schema_config import VectorStoreSchemaConfig\n",
"from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey\n",
"from graphrag.query.indexer_adapters import (\n",
" read_indexer_covariates,\n",
@ -102,7 +102,9 @@
"# load description embeddings to an in-memory lancedb vectorstore\n",
"# to connect to a remote db, specify url and port values.\n",
"description_embedding_store = LanceDBVectorStore(\n",
" collection_name=\"default-entity-description\",\n",
" vector_store_schema_config=VectorStoreSchemaConfig(\n",
" index_name=\"default-entity-description\"\n",
" )\n",
")\n",
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
"\n",
@ -195,37 +197,38 @@
"from graphrag.config.enums import ModelType\n",
"from graphrag.config.models.language_model_config import LanguageModelConfig\n",
"from graphrag.language_model.manager import ModelManager\n",
"from graphrag.tokenizer.get_tokenizer import get_tokenizer\n",
"\n",
"api_key = os.environ[\"GRAPHRAG_API_KEY\"]\n",
"llm_model = os.environ[\"GRAPHRAG_LLM_MODEL\"]\n",
"embedding_model = os.environ[\"GRAPHRAG_EMBEDDING_MODEL\"]\n",
"\n",
"chat_config = LanguageModelConfig(\n",
" api_key=api_key,\n",
" type=ModelType.OpenAIChat,\n",
" model=llm_model,\n",
" type=ModelType.Chat,\n",
" model_provider=\"openai\",\n",
" model=\"gpt-4.1\",\n",
" max_retries=20,\n",
")\n",
"chat_model = ModelManager().get_or_create_chat_model(\n",
" name=\"local_search\",\n",
" model_type=ModelType.OpenAIChat,\n",
" model_type=ModelType.Chat,\n",
" config=chat_config,\n",
")\n",
"\n",
"token_encoder = tiktoken.encoding_for_model(llm_model)\n",
"\n",
"embedding_config = LanguageModelConfig(\n",
" api_key=api_key,\n",
" type=ModelType.OpenAIEmbedding,\n",
" model=embedding_model,\n",
" type=ModelType.Embedding,\n",
" model_provider=\"openai\",\n",
" model=\"text-embedding-3-small\",\n",
" max_retries=20,\n",
")\n",
"\n",
"text_embedder = ModelManager().get_or_create_embedding_model(\n",
" name=\"local_search_embedding\",\n",
" model_type=ModelType.OpenAIEmbedding,\n",
" model_type=ModelType.Embedding,\n",
" config=embedding_config,\n",
")"
")\n",
"\n",
"tokenizer = get_tokenizer(chat_config)"
]
},
{
@ -251,7 +254,7 @@
" entity_text_embeddings=description_embedding_store,\n",
" embedding_vectorstore_key=EntityVectorStoreKey.ID, # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE\n",
" text_embedder=text_embedder,\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
")"
]
},
@ -314,7 +317,7 @@
"search_engine = LocalSearch(\n",
" model=chat_model,\n",
" context_builder=context_builder,\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
" model_params=model_params,\n",
" context_builder_params=local_context_params,\n",
" response_type=\"multiple paragraphs\", # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report\n",
@ -426,7 +429,7 @@
"question_generator = LocalQuestionGen(\n",
" model=chat_model,\n",
" context_builder=context_builder,\n",
" token_encoder=token_encoder,\n",
" tokenizer=tokenizer,\n",
" model_params=model_params,\n",
" context_builder_params=local_context_params,\n",
")"
@ -451,7 +454,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": "graphrag",
"language": "python",
"name": "python3"
},
@ -465,7 +468,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.12.10"
}
},
"nbformat": 4,

View File

@ -1,22 +1,18 @@
# Getting Started
⚠️ GraphRAG can consume a lot of LLM resources! We strongly recommend starting with the tutorial dataset here until you understand how the system works, and consider experimenting with fast/inexpensive models first before committing to a big indexing job.
## Requirements
[Python 3.10-3.12](https://www.python.org/downloads/)
To get started with the GraphRAG system, you have a few options:
👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator) <br/>
👉 [Install from pypi](https://pypi.org/project/graphrag/). <br/>
👉 [Use it from source](developing.md)<br/>
## Quickstart
The following is a simple end-to-end example for using the GraphRAG system, using the install from pypi option.
To get started with the GraphRAG system we recommend trying the [Solution Accelerator](https://github.com/Azure-Samples/graphrag-accelerator) package. This provides a user-friendly end-to-end experience with Azure resources.
# Overview
The following is a simple end-to-end example for using the GraphRAG system.
It shows how to use the system to index some text, and then use the indexed data to answer questions about the documents.
# Install GraphRAG
@ -25,45 +21,43 @@ It shows how to use the system to index some text, and then use the indexed data
pip install graphrag
```
The graphrag library includes a CLI for a no-code approach to getting started. Please review the full [CLI documentation](cli.md) for further detail.
# Running the Indexer
We need to set up a data project and some initial configuration. First let's get a sample dataset ready:
```sh
mkdir -p ./ragtest/input
mkdir -p ./christmas/input
```
Get a copy of A Christmas Carol by Charles Dickens from a trusted source:
```sh
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./ragtest/input/book.txt
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt -o ./christmas/input/book.txt
```
## Set Up Your Workspace Variables
To initialize your workspace, first run the `graphrag init` command.
Since we have already configured a directory named `./ragtest` in the previous step, run the following command:
Since we have already configured a directory named `./christmas` in the previous step, run the following command:
```sh
graphrag init --root ./ragtest
graphrag init --root ./christmas
```
This will create two files: `.env` and `settings.yaml` in the `./ragtest` directory.
This will create two files: `.env` and `settings.yaml` in the `./christmas` directory.
- `.env` contains the environment variables required to run the GraphRAG pipeline. If you inspect the file, you'll see a single environment variable defined,
`GRAPHRAG_API_KEY=<API_KEY>`. This is the API key for the OpenAI API or Azure OpenAI endpoint. You can replace this with your own API key. If you are using another form of authentication (i.e. managed identity), please delete this file.
`GRAPHRAG_API_KEY=<API_KEY>`. Replace `<API_KEY>` with your own OpenAI or Azure API key.
- `settings.yaml` contains the settings for the pipeline. You can modify this file to change the settings for the pipeline.
<br/>
#### <ins>OpenAI and Azure OpenAI</ins>
### Using OpenAI
If running in OpenAI mode, update the value of `GRAPHRAG_API_KEY` in the `.env` file with your OpenAI API key.
If running in OpenAI mode, you only need to update the value of `GRAPHRAG_API_KEY` in the `.env` file with your OpenAI API key.
#### <ins>Azure OpenAI</ins>
### Using Azure OpenAI
In addition, Azure OpenAI users should set the following variables in the settings.yaml file. To find the appropriate sections, just search for the `llm:` configuration, you should see two sections, one for the chat endpoint and one for the embeddings endpoint. Here is an example of how to configure the chat endpoint:
In addition to setting your API key, Azure OpenAI users should set the variables below in the settings.yaml file. To find the appropriate sections, just search for the `models:` root configuration; you should see two sections, one for the default chat endpoint and one for the default embeddings endpoint. Here is an example of what to add to the chat model config:
```yaml
type: azure_openai_chat # Or azure_openai_embedding for embeddings
@ -72,34 +66,37 @@ api_version: 2024-02-15-preview # You can customize this for other versions
deployment_name: <azure_model_deployment_name>
```
- For more details about configuring GraphRAG, see the [configuration documentation](config/overview.md).
- To learn more about Initialization, refer to the [Initialization documentation](config/init.md).
- For more details about using the CLI, refer to the [CLI documentation](cli.md).
#### Using Managed Auth on Azure
To use managed auth, edit the auth_type in your model config and *remove* the api_key line:
```yaml
auth_type: azure_managed_identity # Default auth_type is is api_key
```
You will also need to login with [az login](https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli) and select the subscription with your endpoint.
## Running the Indexing pipeline
Finally we'll run the pipeline!
```sh
graphrag index --root ./ragtest
graphrag index --root ./christmas
```
![pipeline executing from the CLI](img/pipeline-running.png)
This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your `settings.yml` file).
Once the pipeline is complete, you should see a new folder called `./ragtest/output` with a series of parquet files.
This process will take some time to run. This depends on the size of your input data, what model you're using, and the text chunk size being used (these can be configured in your `settings.yaml` file).
Once the pipeline is complete, you should see a new folder called `./christmas/output` with a series of parquet files.
# Using the Query Engine
## Running the Query Engine
Now let's ask some questions using this dataset.
Here is an example using Global search to ask a high-level question:
```sh
graphrag query \
--root ./ragtest \
--root ./christmas \
--method global \
--query "What are the top themes in this story?"
```
@ -108,12 +105,16 @@ Here is an example using Local search to ask a more specific question about a pa
```sh
graphrag query \
--root ./ragtest \
--root ./christmas \
--method local \
--query "Who is Scrooge and what are his main relationships?"
```
Please refer to [Query Engine](query/overview.md) docs for detailed information about how to leverage our Local and Global search mechanisms for extracting meaningful insights from data after the Indexer has wrapped up execution.
# Visualizing the Graph
Check out our [visualization guide](visualization_guide.md) for a more interactive experience in debugging and exploring the knowledge graph.
# Going Deeper
- For more details about configuring GraphRAG, see the [configuration documentation](config/overview.md).
- To learn more about Initialization, refer to the [Initialization documentation](config/init.md).
- For more details about using the CLI, refer to the [CLI documentation](cli.md).
- Check out our [visualization guide](visualization_guide.md) for a more interactive experience in debugging and exploring the knowledge graph.

View File

@ -1,7 +1,6 @@
# Welcome to GraphRAG
👉 [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/) <br/>
👉 [GraphRAG Accelerator](https://github.com/Azure-Samples/graphrag-accelerator) <br/>
👉 [GraphRAG Arxiv](https://arxiv.org/pdf/2404.16130)
<p align="center">
@ -16,10 +15,6 @@ approaches using plain text snippets. The GraphRAG process involves extracting a
To learn more about GraphRAG and how it can be used to enhance your language model's ability to reason about your private data, please visit the [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/).
## Solution Accelerator 🚀
To quickstart the GraphRAG system we recommend trying the [Solution Accelerator](https://github.com/Azure-Samples/graphrag-accelerator) package. This provides a user-friendly end-to-end experience with Azure resources.
## Get Started with GraphRAG 🚀
To start using GraphRAG, check out the [_Get Started_](get_started.md) guide.
@ -52,6 +47,7 @@ At query time, these structures are used to provide materials for the LLM contex
- [_Global Search_](query/global_search.md) for reasoning about holistic questions about the corpus by leveraging the community summaries.
- [_Local Search_](query/local_search.md) for reasoning about specific entities by fanning-out to their neighbors and associated concepts.
- [_DRIFT Search_](query/drift_search.md) for reasoning about specific entities by fanning-out to their neighbors and associated concepts, but with the added context of community information.
- _Basic Search_ for those times when your query is best answered by baseline RAG (standard top _k_ vector search).
### Prompt Tuning

View File

@ -32,3 +32,20 @@ The GraphRAG library was designed with LLM interactions in mind, and a common se
Because of these potential error cases, we've added a cache layer around LLM interactions.
When completion requests are made using the same input set (prompt and tuning parameters), we return a cached result if one exists.
This allows our indexer to be more resilient to network issues, to act idempotently, and to provide a more efficient end-user experience.
### Providers & Factories
Several subsystems within GraphRAG use a factory pattern to register and retrieve provider implementations. This allows deep customization to support models, storage, and so on that you may use but isn't built directly into GraphRAG.
The following subsystems use a factory pattern that allows you to register your own implementations:
- [language model](https://github.com/microsoft/graphrag/blob/main/graphrag/language_model/factory.py) - implement your own `chat` and `embed` methods to use a model provider of choice beyond the built-in OpenAI/Azure support
- [cache](https://github.com/microsoft/graphrag/blob/main/graphrag/cache/factory.py) - create your own cache storage location in addition to the file, blob, and CosmosDB ones we provide
- [logger](https://github.com/microsoft/graphrag/blob/main/graphrag/logger/factory.py) - create your own log writing location in addition to the built-in file and blob storage
- [storage](https://github.com/microsoft/graphrag/blob/main/graphrag/storage/factory.py) - create your own storage provider (database, etc.) beyond the file, blob, and CosmosDB ones built in
- [vector store](https://github.com/microsoft/graphrag/blob/main/graphrag/vector_stores/factory.py) - implement your own vector store other than the built-in lancedb, Azure AI Search, and CosmosDB ones built in
- [pipeline + workflows](https://github.com/microsoft/graphrag/blob/main/graphrag/index/workflows/factory.py) - implement your own workflow steps with a custom `run_workflow` function, or register an entire pipeline (list of named workflows)
The links for each of these subsystems point to the source code of the factory, which includes registration of the default built-in implementations. In addition, we have a detailed discussion of [language models](../config/models.md), which includes and example of a custom provider, and a [sample notebook](../examples_notebooks/custom_vector_store.ipynb) that demonstrates a custom vector store.
All of these factories allow you to register an impl using any string name you would like, even overriding built-in ones directly.

70
docs/index/byog.md Normal file
View File

@ -0,0 +1,70 @@
# Bring Your Own Graph
Several users have asked if they can bring their own existing graph and have it summarized for query with GraphRAG. There are many possible ways to do this, but here we'll describe a simple method that aligns with the existing GraphRAG workflows quite easily.
To cover the basic use cases for GraphRAG query, you should have two or three tables derived from your data:
- entities.parquet - this is the list of entities found in the dataset, which are the nodes of the graph.
- relationships.parquet - this is the list of relationships found in the dataset, which are the edges of the graph.
- text_units.parquet - this is the source text chunks the graph was extracted from. This is optional depending on the query method you intend to use (described later).
The approach described here will be to run a custom GraphRAG workflow pipeline that assumes the text chunking, entity extraction, and relationship extraction has already occurred.
## Tables
### Entities
See the full entities [table schema](./outputs.md#entities). For graph summarization purposes, you only need id, title, description, and the list of text_unit_ids.
The additional properties are used for optional graph visualization purposes.
### Relationships
See the full relationships [table schema](./outputs.md#relationships). For graph summarization purposes, you only need id, source, target, description, weight, and the list of text_unit_ids.
> Note: the `weight` field is important because it is used to properly compute Leiden communities!
## Workflow Configuration
GraphRAG includes the ability to specify *only* the specific workflow steps that you need. For basic graph summarization and query, you need the following config in your settings.yaml:
```yaml
workflows: [create_communities, create_community_reports]
```
This will result in only the minimal workflows required for GraphRAG [Global Search](../query/global_search.md).
## Optional Additional Config
If you would like to run [Local](../query/local_search.md), [DRIFT](../query/drift_search.md), or [Basic](../query/overview.md#basic-search) Search, you will need to include text_units and some embeddings.
### Text Units
See the full text_units [table schema](./outputs.md#text_units). Text units are chunks of your documents that are sized to ensure they fit into the context window of your model. Some search methods use these, so you may want to include them if you have them.
### Expanded Config
To perform the other search types above, you need some of the content to be embedded. Simply add the embeddings workflow:
```yaml
workflows: [create_communities, create_community_reports, generate_text_embeddings]
```
### FastGraphRAG
[FastGraphRAG](./methods.md#fastgraphrag) uses text_units for the community reports instead of the entity and relationship descriptions. If your graph is sourced in such a way that it does not have descriptions, this might be a useful alternative. In this case, you would update your workflows list to include the text variant of the community reports workflow:
```yaml
workflows: [create_communities, create_community_reports_text, generate_text_embeddings]
```
This method requires that your entities and relationships tables have valid links to a list of text_unit_ids. Also note that `generate_text_embeddings` is still only required if you are doing searches other than Global Search.
## Setup
Putting it all together:
- `output`: Create an output folder and put your entities and relationships (and optionally text_units) parquet files in it.
- Update your config as noted above to only run the workflows subset you need.
- Run `graphrag index --root <your project root>`

View File

@ -16,6 +16,10 @@ All input formats are loaded within GraphRAG and passed to the indexing pipeline
Also see the [outputs](outputs.md) documentation for the final documents table schema saved to parquet after pipeline completion.
## Bring-your-own DataFrame
As of version 2.6.0, GraphRAG's [indexing API method](https://github.com/microsoft/graphrag/blob/main/graphrag/api/index.py) allows you to pass in your own pandas DataFrame and bypass all of the input loading/parsing described in the next section. This is convenient if you have content in a format or storage location we don't support out-of-the-box. __You must ensure that your input DataFrame conforms to the schema described above.__ All of the chunking behavior described later will proceed exactly the same.
## Formats
We support three file formats out-of-the-box. This covers the overwhelming majority of use cases we have encountered. If you have a different format, we recommend writing a script to convert to one of these, which are widely used and supported by many tools and libraries.

View File

@ -10,7 +10,7 @@ This is the method described in the original [blog post](https://www.microsoft.c
- relationship extraction: LLM is prompted to describe the relationship between each pair of entities in each text unit.
- entity summarization: LLM is prompted to combine the descriptions for every instance of an entity found across the text units into a single summary.
- relationship summarization: LLM is prompted to combine the descriptions for every instance of a relationship found across the text units into a single summary.
- claim extraction (optiona): LLM is prompted to extract and describe claims from each text unit.
- claim extraction (optional): LLM is prompted to extract and describe claims from each text unit.
- community report generation: entity and relationship descriptions (and optionally claims) for each community are collected and used to prompt the LLM to generate a summary report.
`graphrag index --method standard`. This is the default method, so the method param can actual be omitted.
@ -23,7 +23,7 @@ FastGraphRAG is a method that substitutes some of the language model reasoning f
- relationship extraction: relationships are defined as text unit co-occurrence between entity pairs. There is no description.
- entity summarization: not necessary.
- relationship summarization: not necessary.
- claim extraction (optiona): unused.
- claim extraction (optional): unused.
- community report generation: The direct text unit content containing each entity noun phrase is collected and used to prompt the LLM to generate a summary report.
`graphrag index --method fast`
@ -41,4 +41,4 @@ You can install it manually by running `python -m spacy download <model_name>`,
## Choosing a Method
Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG is a reasonable and cheaper alternative.
Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG provides high quality summarization at much less LLM cost.

View File

@ -26,8 +26,7 @@ After you have a config file you can run the pipeline using the CLI or the Pytho
### CLI
```bash
# Via Poetry
poetry run poe index --root <data_root> # default config mode
uv run poe index --root <data_root> # default config mode
```
### Python API

View File

@ -58,13 +58,13 @@ graphrag prompt-tune [--root ROOT] [--config CONFIG] [--domain DOMAIN] [--selec
```bash
python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --domain "environmental news" \
--selection-method random --limit 10 --language English --max-tokens 2048 --chunk-size 256 --min-examples-required 3 \
--no-entity-types --output /path/to/output
--no-discover-entity-types --output /path/to/output
```
or, with minimal configuration (suggested):
```bash
python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --no-entity-types
python -m graphrag prompt-tune --root /path/to/project --config /path/to/settings.yaml --no-discover-entity-types
```
## Document Selection Methods
@ -79,15 +79,7 @@ After that, it uses one of the following selection methods to pick a sample to w
## Modify Env Vars
After running auto tuning, you should modify the following environment variables (or config variables) to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default "prompts" path.
- `GRAPHRAG_ENTITY_EXTRACTION_PROMPT_FILE` = "prompts/entity_extraction.txt"
- `GRAPHRAG_COMMUNITY_REPORT_PROMPT_FILE` = "prompts/community_report.txt"
- `GRAPHRAG_SUMMARIZE_DESCRIPTIONS_PROMPT_FILE` = "prompts/summarize_descriptions.txt"
or in your yaml config file:
After running auto tuning, you should modify the following config variables to pick up the new prompts on your index run. Note: Please make sure to update the correct path to the generated prompts, in this example we are using the default "prompts" path.
```yaml
entity_extraction:

View File

@ -10,7 +10,7 @@ Each of these prompts may be overridden by writing a custom prompt file in plain
### Entity/Relationship Extraction
[Prompt Source](http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/entity_extraction.py)
[Prompt Source](http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/extract_graph.py)
#### Tokens
@ -31,7 +31,7 @@ Each of these prompts may be overridden by writing a custom prompt file in plain
### Claim Extraction
[Prompt Source](http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/claim_extraction.py)
[Prompt Source](http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/extract_claims.py)
#### Tokens

View File

@ -24,7 +24,7 @@ Below are the key parameters of the [DRIFTSearch class](https://github.com/micro
- `llm`: OpenAI model object to be used for response generation
- `context_builder`: [context builder](https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/drift_search/drift_context.py) object to be used for preparing context data from community reports and query information
- `config`: model to define the DRIFT Search hyperparameters. [DRIFT Config model](https://github.com/microsoft/graphrag/blob/main/graphrag/config/models/drift_search_config.py)
- `token_encoder`: token encoder for tracking the budget for the algorithm.
- `tokenizer`: token encoder for tracking the budget for the algorithm.
- `query_state`: a state object as defined in [Query State](https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/drift_search/state.py) that allows to track execution of a DRIFT Search instance, alongside follow ups and [DRIFT actions](https://github.com/microsoft/graphrag/blob/main/graphrag/query/structured_search/drift_search/action.py).
## How to Use

View File

@ -1,6 +1,7 @@
# API Notebooks
- [API Overview Notebook](../../examples_notebooks/api_overview.ipynb)
- [Bring-Your-Own Vector Store](../../examples_notebooks/custom_vector_store.ipynb)
# Query Engine Notebooks

View File

@ -26,6 +26,10 @@ DRIFT Search introduces a new approach to local search queries by including comm
To learn more about DRIFT Search, please refer to the [DRIFT Search](drift_search.md) documentation.
## Basic Search
GraphRAG includes a rudimentary implementation of basic vector RAG to make it easy to compare different search results based on the type of question you are asking. You can specify the top `k` txt unit chunks to include in the summarization context.
## Question Generation
This functionality takes a list of user queries and generates the next candidate questions. This is useful for generating follow-up questions in a conversation or for generating a list of questions for the investigator to dive deeper into the dataset.

View File

@ -13,19 +13,19 @@ snapshots:
embed_graph:
enabled: true # will generate node2vec embeddings for nodes
umap:
enabled: true # will generate UMAP embeddings for nodes
enabled: true # will generate UMAP embeddings for nodes, giving the entities table an x/y position to plot
```
After running the indexing pipeline over your data, there will be an output folder (defined by the `storage.base_dir` setting).
- **Output Folder**: Contains artifacts from the LLMs indexing pass.
## 2. Locate the Knowledge Graph
In the output folder, look for a file named `merged_graph.graphml`. graphml is a standard [file format](http://graphml.graphdrawing.org) supported by many visualization tools. We recommend trying [Gephi](https://gephi.org).
In the output folder, look for a file named `graph.graphml`. graphml is a standard [file format](http://graphml.graphdrawing.org) supported by many visualization tools. We recommend trying [Gephi](https://gephi.org).
## 3. Open the Graph in Gephi
1. Install and open Gephi
2. Navigate to the `output` folder containing the various parquet files.
3. Import the `merged_graph.graphml` file into Gephi. This will result in a fairly plain view of the undirected graph nodes and edges.
3. Import the `graph.graphml` file into Gephi. This will result in a fairly plain view of the undirected graph nodes and edges.
<p align="center">
<img src="../img/viz_guide/gephi-initial-graph-example.png" alt="A basic graph visualization by Gephi" width="300"/>

View File

@ -29,6 +29,9 @@
"\n",
"import pandas as pd\n",
"import tiktoken\n",
"from graphrag.query.llm.oai.chat_openai import ChatOpenAI\n",
"from graphrag.query.llm.oai.embedding import OpenAIEmbedding\n",
"from graphrag.query.llm.oai.typing import OpenaiApiType\n",
"\n",
"from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey\n",
"from graphrag.query.indexer_adapters import (\n",
@ -38,9 +41,6 @@
" read_indexer_reports,\n",
" read_indexer_text_units,\n",
")\n",
"from graphrag.query.llm.oai.chat_openai import ChatOpenAI\n",
"from graphrag.query.llm.oai.embedding import OpenAIEmbedding\n",
"from graphrag.query.llm.oai.typing import OpenaiApiType\n",
"from graphrag.query.structured_search.local_search.mixed_context import (\n",
" LocalSearchMixedContext,\n",
")\n",

View File

@ -9,29 +9,32 @@ Backwards compatibility is not guaranteed at this time.
"""
import logging
from typing import Any
from graphrag.callbacks.reporting import create_pipeline_reporter
import pandas as pd
from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks
from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks
from graphrag.config.enums import IndexingMethod
from graphrag.config.models.graph_rag_config import GraphRagConfig
from graphrag.index.run.run_pipeline import run_pipeline
from graphrag.index.run.utils import create_callback_chain
from graphrag.index.typing.pipeline_run_result import PipelineRunResult
from graphrag.index.typing.workflow import WorkflowFunction
from graphrag.index.workflows.factory import PipelineFactory
from graphrag.logger.base import ProgressLogger
from graphrag.logger.null_progress import NullProgressLogger
from graphrag.logger.standard_logging import init_loggers
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
async def build_index(
config: GraphRagConfig,
method: IndexingMethod = IndexingMethod.Standard,
method: IndexingMethod | str = IndexingMethod.Standard,
is_update_run: bool = False,
memory_profile: bool = False,
callbacks: list[WorkflowCallbacks] | None = None,
progress_logger: ProgressLogger | None = None,
additional_context: dict[str, Any] | None = None,
verbose: bool = False,
input_documents: pd.DataFrame | None = None,
) -> list[PipelineRunResult]:
"""Run the pipeline with the given configuration.
@ -45,26 +48,31 @@ async def build_index(
Whether to enable memory profiling.
callbacks : list[WorkflowCallbacks] | None default=None
A list of callbacks to register.
progress_logger : ProgressLogger | None default=None
The progress logger.
additional_context : dict[str, Any] | None default=None
Additional context to pass to the pipeline run. This can be accessed in the pipeline state under the 'additional_context' key.
input_documents : pd.DataFrame | None default=None.
Override document loading and parsing and supply your own dataframe of documents to index.
Returns
-------
list[PipelineRunResult]
The list of pipeline run results
"""
logger = progress_logger or NullProgressLogger()
# create a pipeline reporter and add to any additional callbacks
callbacks = callbacks or []
callbacks.append(create_pipeline_reporter(config.reporting, None))
init_loggers(config=config, verbose=verbose)
workflow_callbacks = create_callback_chain(callbacks, logger)
# Create callbacks for pipeline lifecycle events if provided
workflow_callbacks = (
create_callback_chain(callbacks) if callbacks else NoopWorkflowCallbacks()
)
outputs: list[PipelineRunResult] = []
if memory_profile:
log.warning("New pipeline does not yet support memory profiling.")
logger.warning("New pipeline does not yet support memory profiling.")
logger.info("Initializing indexing pipeline...")
# todo: this could propagate out to the cli for better clarity, but will be a breaking api change
method = _get_method(method, is_update_run)
pipeline = PipelineFactory.create_pipeline(config, method)
workflow_callbacks.pipeline_start(pipeline.names())
@ -73,20 +81,21 @@ async def build_index(
pipeline,
config,
callbacks=workflow_callbacks,
logger=logger,
is_update_run=is_update_run,
additional_context=additional_context,
input_documents=input_documents,
):
outputs.append(output)
if output.errors and len(output.errors) > 0:
logger.error(output.workflow)
logger.error("Workflow %s completed with errors", output.workflow)
else:
logger.success(output.workflow)
logger.info(str(output.result))
logger.info("Workflow %s completed successfully", output.workflow)
logger.debug(str(output.result))
workflow_callbacks.pipeline_end(outputs)
return outputs
def register_workflow_function(name: str, workflow: WorkflowFunction):
"""Register a custom workflow function. You can then include the name in the settings.yaml workflows list."""
PipelineFactory.register(name, workflow)
def _get_method(method: IndexingMethod | str, is_update_run: bool) -> str:
m = method.value if isinstance(method, IndexingMethod) else method
return f"{m}-update" if is_update_run else m

View File

@ -11,16 +11,17 @@ WARNING: This API is under development and may undergo changes in future release
Backwards compatibility is not guaranteed at this time.
"""
import logging
from typing import Annotated
import annotated_types
from pydantic import PositiveInt, validate_call
from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks
from graphrag.config.defaults import graphrag_config_defaults, language_model_defaults
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.models.graph_rag_config import GraphRagConfig
from graphrag.language_model.manager import ModelManager
from graphrag.logger.base import ProgressLogger
from graphrag.logger.standard_logging import init_loggers
from graphrag.prompt_tune.defaults import MAX_TOKEN_COUNT, PROMPT_TUNING_MODEL_ID
from graphrag.prompt_tune.generator.community_report_rating import (
generate_community_report_rating,
@ -46,13 +47,14 @@ from graphrag.prompt_tune.generator.language import detect_language
from graphrag.prompt_tune.generator.persona import generate_persona
from graphrag.prompt_tune.loader.input import load_docs_in_chunks
from graphrag.prompt_tune.types import DocSelectionType
from graphrag.tokenizer.get_tokenizer import get_tokenizer
logger = logging.getLogger(__name__)
@validate_call(config={"arbitrary_types_allowed": True})
async def generate_indexing_prompts(
config: GraphRagConfig,
logger: ProgressLogger,
root: str,
chunk_size: PositiveInt = graphrag_config_defaults.chunks.size,
overlap: Annotated[
int, annotated_types.Gt(-1)
@ -66,14 +68,13 @@ async def generate_indexing_prompts(
min_examples_required: PositiveInt = 2,
n_subset_max: PositiveInt = 300,
k: PositiveInt = 15,
verbose: bool = False,
) -> tuple[str, str, str]:
"""Generate indexing prompts.
Parameters
----------
- config: The GraphRag configuration.
- logger: The logger to use for progress updates.
- root: The root directory.
- output_path: The path to store the prompts.
- chunk_size: The chunk token size to use for input text units.
- limit: The limit of chunks to load.
@ -90,10 +91,11 @@ async def generate_indexing_prompts(
-------
tuple[str, str, str]: entity extraction prompt, entity summarization prompt, community summarization prompt
"""
init_loggers(config=config, verbose=verbose, filename="prompt-tuning.log")
# Retrieve documents
logger.info("Chunking documents...")
doc_list = await load_docs_in_chunks(
root=root,
config=config,
limit=limit,
select_method=selection_method,
@ -109,15 +111,6 @@ async def generate_indexing_prompts(
logger.info("Retrieving language model configuration...")
default_llm_settings = config.get_language_model_config(PROMPT_TUNING_MODEL_ID)
# if max_retries is not set, inject a dynamically assigned value based on the number of expected LLM calls
# to be made or fallback to a default value in the worst case
if default_llm_settings.max_retries < -1:
default_llm_settings.max_retries = min(
len(doc_list), language_model_defaults.max_retries
)
msg = f"max_retries not set, using default value: {default_llm_settings.max_retries}"
logger.warning(msg)
logger.info("Creating language model...")
llm = ModelManager().register_chat(
name="prompt_tuning",
@ -174,7 +167,7 @@ async def generate_indexing_prompts(
examples=examples,
language=language,
json_mode=False, # config.llm.model_supports_json should be used, but these prompts are used in non-json mode by the index engine
encoding_model=extract_graph_llm_settings.encoding_model,
tokenizer=get_tokenizer(model_config=extract_graph_llm_settings),
max_token_count=max_tokens,
min_examples_required=min_examples_required,
)
@ -198,9 +191,9 @@ async def generate_indexing_prompts(
language=language,
)
logger.info(f"\nGenerated domain: {domain}") # noqa: G004
logger.info(f"\nDetected language: {language}") # noqa: G004
logger.info(f"\nGenerated persona: {persona}") # noqa: G004
logger.debug("Generated domain: %s", domain)
logger.debug("Detected language: %s", language)
logger.debug("Generated persona: %s", persona)
return (
extract_graph_prompt,

View File

@ -17,6 +17,7 @@ WARNING: This API is under development and may undergo changes in future release
Backwards compatibility is not guaranteed at this time.
"""
import logging
from collections.abc import AsyncGenerator
from typing import Any
@ -31,7 +32,7 @@ from graphrag.config.embeddings import (
text_unit_text_embedding,
)
from graphrag.config.models.graph_rag_config import GraphRagConfig
from graphrag.logger.print_progress import PrintProgressLogger
from graphrag.logger.standard_logging import init_loggers
from graphrag.query.factory import (
get_basic_search_engine,
get_drift_search_engine,
@ -50,11 +51,13 @@ from graphrag.query.indexer_adapters import (
from graphrag.utils.api import (
get_embedding_store,
load_search_prompt,
truncate,
update_context_data,
)
from graphrag.utils.cli import redact
logger = PrintProgressLogger("")
# Initialize standard logger
logger = logging.getLogger(__name__)
@validate_call(config={"arbitrary_types_allowed": True})
@ -68,6 +71,7 @@ async def global_search(
response_type: str,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -88,11 +92,9 @@ async def global_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
callbacks = callbacks or []
full_response = ""
context_data = {}
@ -105,6 +107,7 @@ async def global_search(
local_callbacks.on_context = on_context
callbacks.append(local_callbacks)
logger.debug("Executing global search query: %s", query)
async for chunk in global_search_streaming(
config=config,
entities=entities,
@ -117,6 +120,7 @@ async def global_search(
callbacks=callbacks,
):
full_response += chunk
logger.debug("Query response: %s", truncate(full_response, 400))
return full_response, context_data
@ -131,6 +135,7 @@ def global_search_streaming(
response_type: str,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> AsyncGenerator:
"""Perform a global search and return the context data and response via a generator.
@ -150,11 +155,9 @@ def global_search_streaming(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
communities_ = read_indexer_communities(communities, community_reports)
reports = read_indexer_reports(
community_reports,
@ -173,6 +176,7 @@ def global_search_streaming(
config.root_dir, config.global_search.knowledge_prompt
)
logger.debug("Executing streaming global search query: %s", query)
search_engine = get_global_search_engine(
config,
reports=reports,
@ -201,6 +205,7 @@ async def multi_index_global_search(
streaming: bool,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -223,11 +228,13 @@ async def multi_index_global_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
logger.warning(
"Multi-index search is deprecated and will be removed in GraphRAG v3."
)
# Streaming not supported yet
if streaming:
message = "Streaming not yet implemented for multi_global_search"
@ -311,6 +318,7 @@ async def multi_index_global_search(
communities_dfs, axis=0, ignore_index=True, sort=False
)
logger.debug("Executing multi-index global search query: %s", query)
result = await global_search(
config,
entities=entities_combined,
@ -326,6 +334,7 @@ async def multi_index_global_search(
# Update the context data by linking index names and community ids
context = update_context_data(result[1], links)
logger.debug("Query response: %s", truncate(result[0], 400)) # type: ignore
return (result[0], context)
@ -342,6 +351,7 @@ async def local_search(
response_type: str,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -362,11 +372,9 @@ async def local_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
callbacks = callbacks or []
full_response = ""
context_data = {}
@ -379,6 +387,7 @@ async def local_search(
local_callbacks.on_context = on_context
callbacks.append(local_callbacks)
logger.debug("Executing local search query: %s", query)
async for chunk in local_search_streaming(
config=config,
entities=entities,
@ -393,6 +402,7 @@ async def local_search(
callbacks=callbacks,
):
full_response += chunk
logger.debug("Query response: %s", truncate(full_response, 400))
return full_response, context_data
@ -409,6 +419,7 @@ def local_search_streaming(
response_type: str,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> AsyncGenerator:
"""Perform a local search and return the context data and response via a generator.
@ -427,16 +438,14 @@ def local_search_streaming(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
vector_store_args = {}
for index, store in config.vector_store.items():
vector_store_args[index] = store.model_dump()
msg = f"Vector Store Args: {redact(vector_store_args)}"
logger.info(msg)
logger.debug(msg)
description_embedding_store = get_embedding_store(
config_args=vector_store_args,
@ -447,6 +456,7 @@ def local_search_streaming(
covariates_ = read_indexer_covariates(covariates) if covariates is not None else []
prompt = load_search_prompt(config.root_dir, config.local_search.prompt)
logger.debug("Executing streaming local search query: %s", query)
search_engine = get_local_search_engine(
config=config,
reports=read_indexer_reports(community_reports, communities, community_level),
@ -477,6 +487,7 @@ async def multi_index_local_search(
streaming: bool,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -500,11 +511,12 @@ async def multi_index_local_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
logger.warning(
"Multi-index search is deprecated and will be removed in GraphRAG v3."
)
# Streaming not supported yet
if streaming:
message = "Streaming not yet implemented for multi_index_local_search"
@ -670,6 +682,7 @@ async def multi_index_local_search(
covariates_combined = pd.concat(
covariates_dfs, axis=0, ignore_index=True, sort=False
)
logger.debug("Executing multi-index local search query: %s", query)
result = await local_search(
config,
entities=entities_combined,
@ -687,6 +700,7 @@ async def multi_index_local_search(
# Update the context data by linking index names and community ids
context = update_context_data(result[1], links)
logger.debug("Query response: %s", truncate(result[0], 400)) # type: ignore
return (result[0], context)
@ -702,6 +716,7 @@ async def drift_search(
response_type: str,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -721,11 +736,9 @@ async def drift_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
callbacks = callbacks or []
full_response = ""
context_data = {}
@ -738,6 +751,7 @@ async def drift_search(
local_callbacks.on_context = on_context
callbacks.append(local_callbacks)
logger.debug("Executing drift search query: %s", query)
async for chunk in drift_search_streaming(
config=config,
entities=entities,
@ -751,6 +765,7 @@ async def drift_search(
callbacks=callbacks,
):
full_response += chunk
logger.debug("Query response: %s", truncate(full_response, 400))
return full_response, context_data
@ -766,6 +781,7 @@ def drift_search_streaming(
response_type: str,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> AsyncGenerator:
"""Perform a DRIFT search and return the context data and response.
@ -782,16 +798,14 @@ def drift_search_streaming(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
vector_store_args = {}
for index, store in config.vector_store.items():
vector_store_args[index] = store.model_dump()
msg = f"Vector Store Args: {redact(vector_store_args)}"
logger.info(msg)
logger.debug(msg)
description_embedding_store = get_embedding_store(
config_args=vector_store_args,
@ -811,6 +825,7 @@ def drift_search_streaming(
config.root_dir, config.drift_search.reduce_prompt
)
logger.debug("Executing streaming drift search query: %s", query)
search_engine = get_drift_search_engine(
config=config,
reports=reports,
@ -840,6 +855,7 @@ async def multi_index_drift_search(
streaming: bool,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -862,11 +878,13 @@ async def multi_index_drift_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
logger.warning(
"Multi-index search is deprecated and will be removed in GraphRAG v3."
)
# Streaming not supported yet
if streaming:
message = "Streaming not yet implemented for multi_drift_search"
@ -1009,6 +1027,7 @@ async def multi_index_drift_search(
text_units_dfs, axis=0, ignore_index=True, sort=False
)
logger.debug("Executing multi-index drift search query: %s", query)
result = await drift_search(
config,
entities=entities_combined,
@ -1029,6 +1048,8 @@ async def multi_index_drift_search(
context[key] = update_context_data(result[1][key], links)
else:
context = result[1]
logger.debug("Query response: %s", truncate(result[0], 400)) # type: ignore
return (result[0], context)
@ -1038,6 +1059,7 @@ async def basic_search(
text_units: pd.DataFrame,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -1053,11 +1075,9 @@ async def basic_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
callbacks = callbacks or []
full_response = ""
context_data = {}
@ -1070,6 +1090,7 @@ async def basic_search(
local_callbacks.on_context = on_context
callbacks.append(local_callbacks)
logger.debug("Executing basic search query: %s", query)
async for chunk in basic_search_streaming(
config=config,
text_units=text_units,
@ -1077,6 +1098,7 @@ async def basic_search(
callbacks=callbacks,
):
full_response += chunk
logger.debug("Query response: %s", truncate(full_response, 400))
return full_response, context_data
@ -1086,6 +1108,7 @@ def basic_search_streaming(
text_units: pd.DataFrame,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> AsyncGenerator:
"""Perform a local search and return the context data and response via a generator.
@ -1098,28 +1121,27 @@ def basic_search_streaming(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
vector_store_args = {}
for index, store in config.vector_store.items():
vector_store_args[index] = store.model_dump()
msg = f"Vector Store Args: {redact(vector_store_args)}"
logger.info(msg)
logger.debug(msg)
description_embedding_store = get_embedding_store(
embedding_store = get_embedding_store(
config_args=vector_store_args,
embedding_name=text_unit_text_embedding,
)
prompt = load_search_prompt(config.root_dir, config.basic_search.prompt)
logger.debug("Executing streaming basic search query: %s", query)
search_engine = get_basic_search_engine(
config=config,
text_units=read_indexer_text_units(text_units),
text_unit_embeddings=description_embedding_store,
text_unit_embeddings=embedding_store,
system_prompt=prompt,
callbacks=callbacks,
)
@ -1134,6 +1156,7 @@ async def multi_index_basic_search(
streaming: bool,
query: str,
callbacks: list[QueryCallbacks] | None = None,
verbose: bool = False,
) -> tuple[
str | dict[str, Any] | list[dict[str, Any]],
str | list[pd.DataFrame] | dict[str, pd.DataFrame],
@ -1151,11 +1174,13 @@ async def multi_index_basic_search(
Returns
-------
TODO: Document the search response type and format.
Raises
------
TODO: Document any exceptions to expect.
"""
init_loggers(config=config, verbose=verbose, filename="query.log")
logger.warning(
"Multi-index search is deprecated and will be removed in GraphRAG v3."
)
# Streaming not supported yet
if streaming:
message = "Streaming not yet implemented for multi_basic_search"
@ -1192,6 +1217,7 @@ async def multi_index_basic_search(
text_units_dfs, axis=0, ignore_index=True, sort=False
)
logger.debug("Executing multi-index basic search query: %s", query)
return await basic_search(
config,
text_units=text_units_combined,

View File

@ -1,23 +1,24 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""A module containing create_cache method definition."""
"""Factory functions for creating a cache."""
from __future__ import annotations
from typing import TYPE_CHECKING, ClassVar
from graphrag.config.enums import CacheType
from graphrag.storage.blob_pipeline_storage import create_blob_storage
from graphrag.storage.cosmosdb_pipeline_storage import create_cosmosdb_storage
from graphrag.storage.file_pipeline_storage import FilePipelineStorage
if TYPE_CHECKING:
from graphrag.cache.pipeline_cache import PipelineCache
from graphrag.cache.json_pipeline_cache import JsonPipelineCache
from graphrag.cache.memory_pipeline_cache import InMemoryCache
from graphrag.cache.noop_pipeline_cache import NoopPipelineCache
from graphrag.config.enums import CacheType
from graphrag.storage.blob_pipeline_storage import BlobPipelineStorage
from graphrag.storage.cosmosdb_pipeline_storage import CosmosDBPipelineStorage
from graphrag.storage.file_pipeline_storage import FilePipelineStorage
if TYPE_CHECKING:
from collections.abc import Callable
from graphrag.cache.pipeline_cache import PipelineCache
class CacheFactory:
@ -25,39 +26,90 @@ class CacheFactory:
Includes a method for users to register a custom cache implementation.
Configuration arguments are passed to each cache implementation as kwargs (where possible)
Configuration arguments are passed to each cache implementation as kwargs
for individual enforcement of required/optional arguments.
"""
cache_types: ClassVar[dict[str, type]] = {}
_registry: ClassVar[dict[str, Callable[..., PipelineCache]]] = {}
@classmethod
def register(cls, cache_type: str, cache: type):
"""Register a custom cache implementation."""
cls.cache_types[cache_type] = cache
def register(cls, cache_type: str, creator: Callable[..., PipelineCache]) -> None:
"""Register a custom cache implementation.
Args:
cache_type: The type identifier for the cache.
creator: A class or callable that creates an instance of PipelineCache.
"""
cls._registry[cache_type] = creator
@classmethod
def create_cache(
cls, cache_type: CacheType | str | None, root_dir: str, kwargs: dict
) -> PipelineCache:
"""Create or get a cache from the provided type."""
if not cache_type:
return NoopPipelineCache()
match cache_type:
case CacheType.none:
return NoopPipelineCache()
case CacheType.memory:
return InMemoryCache()
case CacheType.file:
return JsonPipelineCache(
FilePipelineStorage(root_dir=root_dir).child(kwargs["base_dir"])
)
case CacheType.blob:
return JsonPipelineCache(create_blob_storage(**kwargs))
case CacheType.cosmosdb:
return JsonPipelineCache(create_cosmosdb_storage(**kwargs))
case _:
if cache_type in cls.cache_types:
return cls.cache_types[cache_type](**kwargs)
msg = f"Unknown cache type: {cache_type}"
raise ValueError(msg)
def create_cache(cls, cache_type: str, kwargs: dict) -> PipelineCache:
"""Create a cache object from the provided type.
Args:
cache_type: The type of cache to create.
root_dir: The root directory for file-based caches.
kwargs: Additional keyword arguments for the cache constructor.
Returns
-------
A PipelineCache instance.
Raises
------
ValueError: If the cache type is not registered.
"""
if cache_type not in cls._registry:
msg = f"Unknown cache type: {cache_type}"
raise ValueError(msg)
return cls._registry[cache_type](**kwargs)
@classmethod
def get_cache_types(cls) -> list[str]:
"""Get the registered cache implementations."""
return list(cls._registry.keys())
@classmethod
def is_supported_type(cls, cache_type: str) -> bool:
"""Check if the given cache type is supported."""
return cache_type in cls._registry
# --- register built-in cache implementations ---
def create_file_cache(root_dir: str, base_dir: str, **kwargs) -> PipelineCache:
"""Create a file-based cache implementation."""
# Create storage with base_dir in kwargs since FilePipelineStorage expects it there
storage_kwargs = {"base_dir": root_dir, **kwargs}
storage = FilePipelineStorage(**storage_kwargs).child(base_dir)
return JsonPipelineCache(storage)
def create_blob_cache(**kwargs) -> PipelineCache:
"""Create a blob storage-based cache implementation."""
storage = BlobPipelineStorage(**kwargs)
return JsonPipelineCache(storage)
def create_cosmosdb_cache(**kwargs) -> PipelineCache:
"""Create a CosmosDB-based cache implementation."""
storage = CosmosDBPipelineStorage(**kwargs)
return JsonPipelineCache(storage)
def create_noop_cache(**_kwargs) -> PipelineCache:
"""Create a no-op cache implementation."""
return NoopPipelineCache()
def create_memory_cache(**kwargs) -> PipelineCache:
"""Create a memory cache implementation."""
return InMemoryCache(**kwargs)
# --- register built-in cache implementations ---
CacheFactory.register(CacheType.none.value, create_noop_cache)
CacheFactory.register(CacheType.memory.value, create_memory_cache)
CacheFactory.register(CacheType.file.value, create_file_cache)
CacheFactory.register(CacheType.blob.value, create_blob_cache)
CacheFactory.register(CacheType.cosmosdb.value, create_cosmosdb_cache)

View File

@ -4,29 +4,43 @@
"""A logger that emits updates from the indexing engine to the console."""
from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks
from graphrag.index.typing.pipeline_run_result import PipelineRunResult
from graphrag.logger.progress import Progress
# ruff: noqa: T201
class ConsoleWorkflowCallbacks(NoopWorkflowCallbacks):
"""A logger that writes to a console."""
def error(
self,
message: str,
cause: BaseException | None = None,
stack: str | None = None,
details: dict | None = None,
):
"""Handle when an error occurs."""
print(message, str(cause), stack, details) # noqa T201
_verbose = False
def warning(self, message: str, details: dict | None = None):
"""Handle when a warning occurs."""
_print_warning(message)
def __init__(self, verbose=False):
self._verbose = verbose
def log(self, message: str, details: dict | None = None):
"""Handle when a log message is produced."""
print(message, details) # noqa T201
def pipeline_start(self, names: list[str]) -> None:
"""Execute this callback to signal when the entire pipeline starts."""
print("Starting pipeline with workflows:", ", ".join(names))
def pipeline_end(self, results: list[PipelineRunResult]) -> None:
"""Execute this callback to signal when the entire pipeline ends."""
print("Pipeline complete")
def _print_warning(skk):
print("\033[93m {}\033[00m".format(skk)) # noqa T201
def workflow_start(self, name: str, instance: object) -> None:
"""Execute this callback when a workflow starts."""
print(f"Starting workflow: {name}")
def workflow_end(self, name: str, instance: object) -> None:
"""Execute this callback when a workflow ends."""
print("") # account for potential return on prior progress
print(f"Workflow complete: {name}")
if self._verbose:
print(instance)
def progress(self, progress: Progress) -> None:
"""Handle when progress occurs."""
complete = progress.completed_items or 0
total = progress.total_items or 1
percent = round((complete / total) * 100)
start = f" {complete} / {total} "
print(f"{start:{'.'}<{percent}}", flush=True, end="\r")

View File

@ -1,78 +0,0 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""A logger that emits updates from the indexing engine to a local file."""
import json
import logging
from io import TextIOWrapper
from pathlib import Path
from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks
log = logging.getLogger(__name__)
class FileWorkflowCallbacks(NoopWorkflowCallbacks):
"""A logger that writes to a local file."""
_out_stream: TextIOWrapper
def __init__(self, directory: str):
"""Create a new file-based workflow logger."""
Path(directory).mkdir(parents=True, exist_ok=True)
self._out_stream = open( # noqa: PTH123, SIM115
Path(directory) / "logs.json", "a", encoding="utf-8", errors="strict"
)
def error(
self,
message: str,
cause: BaseException | None = None,
stack: str | None = None,
details: dict | None = None,
):
"""Handle when an error occurs."""
self._out_stream.write(
json.dumps(
{
"type": "error",
"data": message,
"stack": stack,
"source": str(cause),
"details": details,
},
indent=4,
ensure_ascii=False,
)
+ "\n"
)
message = f"{message} details={details}"
log.info(message)
def warning(self, message: str, details: dict | None = None):
"""Handle when a warning occurs."""
self._out_stream.write(
json.dumps(
{"type": "warning", "data": message, "details": details},
ensure_ascii=False,
)
+ "\n"
)
_print_warning(message)
def log(self, message: str, details: dict | None = None):
"""Handle when a log message is produced."""
self._out_stream.write(
json.dumps(
{"type": "log", "data": message, "details": details}, ensure_ascii=False
)
+ "\n"
)
message = f"{message} details={details}"
log.info(message)
def _print_warning(skk):
log.warning(skk)

View File

@ -9,13 +9,13 @@ from graphrag.logger.progress import Progress
class NoopWorkflowCallbacks(WorkflowCallbacks):
"""A no-op implementation of WorkflowCallbacks."""
"""A no-op implementation of WorkflowCallbacks that logs all events to standard logging."""
def pipeline_start(self, names: list[str]) -> None:
"""Execute this callback when a the entire pipeline starts."""
"""Execute this callback to signal when the entire pipeline starts."""
def pipeline_end(self, results: list[PipelineRunResult]) -> None:
"""Execute this callback when the entire pipeline ends."""
"""Execute this callback to signal when the entire pipeline ends."""
def workflow_start(self, name: str, instance: object) -> None:
"""Execute this callback when a workflow starts."""
@ -25,18 +25,3 @@ class NoopWorkflowCallbacks(WorkflowCallbacks):
def progress(self, progress: Progress) -> None:
"""Handle when progress occurs."""
def error(
self,
message: str,
cause: BaseException | None = None,
stack: str | None = None,
details: dict | None = None,
) -> None:
"""Handle when an error occurs."""
def warning(self, message: str, details: dict | None = None) -> None:
"""Handle when a warning occurs."""
def log(self, message: str, details: dict | None = None) -> None:
"""Handle when a log message occurs."""

View File

@ -1,42 +0,0 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""A workflow callback manager that emits updates."""
from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks
from graphrag.logger.base import ProgressLogger
from graphrag.logger.progress import Progress
class ProgressWorkflowCallbacks(NoopWorkflowCallbacks):
"""A callbackmanager that delegates to a ProgressLogger."""
_root_progress: ProgressLogger
_progress_stack: list[ProgressLogger]
def __init__(self, progress: ProgressLogger) -> None:
"""Create a new ProgressWorkflowCallbacks."""
self._progress = progress
self._progress_stack = [progress]
def _pop(self) -> None:
self._progress_stack.pop()
def _push(self, name: str) -> None:
self._progress_stack.append(self._latest.child(name))
@property
def _latest(self) -> ProgressLogger:
return self._progress_stack[-1]
def workflow_start(self, name: str, instance: object) -> None:
"""Execute this callback when a workflow starts."""
self._push(name)
def workflow_end(self, name: str, instance: object) -> None:
"""Execute this callback when a workflow ends."""
self._pop()
def progress(self, progress: Progress) -> None:
"""Handle when progress occurs."""
self._latest(progress)

View File

@ -1,39 +0,0 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""A module containing the pipeline reporter factory."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
from graphrag.callbacks.blob_workflow_callbacks import BlobWorkflowCallbacks
from graphrag.callbacks.console_workflow_callbacks import ConsoleWorkflowCallbacks
from graphrag.callbacks.file_workflow_callbacks import FileWorkflowCallbacks
from graphrag.config.enums import ReportingType
from graphrag.config.models.reporting_config import ReportingConfig
if TYPE_CHECKING:
from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks
def create_pipeline_reporter(
config: ReportingConfig | None, root_dir: str | None
) -> WorkflowCallbacks:
"""Create a logger for the given pipeline config."""
config = config or ReportingConfig(base_dir="logs", type=ReportingType.file)
match config.type:
case ReportingType.file:
return FileWorkflowCallbacks(
str(Path(root_dir or "") / (config.base_dir or ""))
)
case ReportingType.console:
return ConsoleWorkflowCallbacks()
case ReportingType.blob:
return BlobWorkflowCallbacks(
config.connection_string,
config.container_name,
base_dir=config.base_dir,
storage_account_blob_url=config.storage_account_blob_url,
)

View File

@ -35,21 +35,3 @@ class WorkflowCallbacks(Protocol):
def progress(self, progress: Progress) -> None:
"""Handle when progress occurs."""
...
def error(
self,
message: str,
cause: BaseException | None = None,
stack: str | None = None,
details: dict | None = None,
) -> None:
"""Handle when an error occurs."""
...
def warning(self, message: str, details: dict | None = None) -> None:
"""Handle when a warning occurs."""
...
def log(self, message: str, details: dict | None = None) -> None:
"""Handle when a log message occurs."""
...

View File

@ -50,27 +50,3 @@ class WorkflowCallbacksManager(WorkflowCallbacks):
for callback in self._callbacks:
if hasattr(callback, "progress"):
callback.progress(progress)
def error(
self,
message: str,
cause: BaseException | None = None,
stack: str | None = None,
details: dict | None = None,
) -> None:
"""Handle when an error occurs."""
for callback in self._callbacks:
if hasattr(callback, "error"):
callback.error(message, cause, stack, details)
def warning(self, message: str, details: dict | None = None) -> None:
"""Handle when a warning occurs."""
for callback in self._callbacks:
if hasattr(callback, "warning"):
callback.warning(message, details)
def log(self, message: str, details: dict | None = None) -> None:
"""Handle when a log message occurs."""
for callback in self._callbacks:
if hasattr(callback, "log"):
callback.log(message, details)

View File

@ -10,49 +10,27 @@ import warnings
from pathlib import Path
import graphrag.api as api
from graphrag.callbacks.console_workflow_callbacks import ConsoleWorkflowCallbacks
from graphrag.config.enums import CacheType, IndexingMethod
from graphrag.config.load_config import load_config
from graphrag.config.logging import enable_logging_with_config
from graphrag.index.validate_config import validate_config_names
from graphrag.logger.base import ProgressLogger
from graphrag.logger.factory import LoggerFactory, LoggerType
from graphrag.utils.cli import redact
# Ignore warnings from numba
warnings.filterwarnings("ignore", message=".*NumbaDeprecationWarning.*")
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
def _logger(logger: ProgressLogger):
def info(msg: str, verbose: bool = False):
log.info(msg)
if verbose:
logger.info(msg)
def error(msg: str, verbose: bool = False):
log.error(msg)
if verbose:
logger.error(msg)
def success(msg: str, verbose: bool = False):
log.info(msg)
if verbose:
logger.success(msg)
return info, error, success
def _register_signal_handlers(logger: ProgressLogger):
def _register_signal_handlers():
import signal
def handle_signal(signum, _):
# Handle the signal here
logger.info(f"Received signal {signum}, exiting...") # noqa: G004
logger.dispose()
logger.debug(f"Received signal {signum}, exiting...") # noqa: G004
for task in asyncio.all_tasks():
task.cancel()
logger.info("All tasks cancelled. Exiting...")
logger.debug("All tasks cancelled. Exiting...")
# Register signal handlers for SIGINT and SIGHUP
signal.signal(signal.SIGINT, handle_signal)
@ -67,7 +45,6 @@ def index_cli(
verbose: bool,
memprofile: bool,
cache: bool,
logger: LoggerType,
config_filepath: Path | None,
dry_run: bool,
skip_validation: bool,
@ -80,7 +57,6 @@ def index_cli(
cli_overrides["reporting.base_dir"] = str(output_dir)
cli_overrides["update_index_output.base_dir"] = str(output_dir)
config = load_config(root_dir, config_filepath, cli_overrides)
_run_index(
config=config,
method=method,
@ -88,7 +64,6 @@ def index_cli(
verbose=verbose,
memprofile=memprofile,
cache=cache,
logger=logger,
dry_run=dry_run,
skip_validation=skip_validation,
)
@ -100,7 +75,6 @@ def update_cli(
verbose: bool,
memprofile: bool,
cache: bool,
logger: LoggerType,
config_filepath: Path | None,
skip_validation: bool,
output_dir: Path | None,
@ -121,7 +95,6 @@ def update_cli(
verbose=verbose,
memprofile=memprofile,
cache=cache,
logger=logger,
dry_run=False,
skip_validation=skip_validation,
)
@ -134,39 +107,35 @@ def _run_index(
verbose,
memprofile,
cache,
logger,
dry_run,
skip_validation,
):
progress_logger = LoggerFactory().create_logger(logger)
info, error, success = _logger(progress_logger)
# Configure the root logger with the specified log level
from graphrag.logger.standard_logging import init_loggers
# Initialize loggers and reporting config
init_loggers(
config=config,
verbose=verbose,
)
if not cache:
config.cache.type = CacheType.none
enabled_logging, log_path = enable_logging_with_config(config, verbose)
if enabled_logging:
info(f"Logging enabled at {log_path}", True)
else:
info(
f"Logging not enabled for config {redact(config.model_dump())}",
True,
)
if not skip_validation:
validate_config_names(progress_logger, config)
validate_config_names(config)
info(f"Starting pipeline run. {dry_run=}", verbose)
info(
f"Using default configuration: {redact(config.model_dump())}",
verbose,
logger.info("Starting pipeline run. %s", dry_run)
logger.info(
"Using default configuration: %s",
redact(config.model_dump()),
)
if dry_run:
info("Dry run complete, exiting...", True)
logger.info("Dry run complete, exiting...", True)
sys.exit(0)
_register_signal_handlers(progress_logger)
_register_signal_handlers()
outputs = asyncio.run(
api.build_index(
@ -174,19 +143,19 @@ def _run_index(
method=method,
is_update_run=is_update_run,
memory_profile=memprofile,
progress_logger=progress_logger,
callbacks=[ConsoleWorkflowCallbacks(verbose=verbose)],
verbose=verbose,
)
)
encountered_errors = any(
output.errors and len(output.errors) > 0 for output in outputs
)
progress_logger.stop()
if encountered_errors:
error(
"Errors occurred during the pipeline run, see logs for more details.", True
logger.error(
"Errors occurred during the pipeline run, see logs for more details."
)
else:
success("All workflows completed successfully.", True)
logger.info("All workflows completed successfully.")
sys.exit(1 if encountered_errors else 0)

View File

@ -3,10 +3,10 @@
"""CLI implementation of the initialization subcommand."""
import logging
from pathlib import Path
from graphrag.config.init_content import INIT_DOTENV, INIT_YAML
from graphrag.logger.factory import LoggerFactory, LoggerType
from graphrag.prompts.index.community_report import (
COMMUNITY_REPORT_PROMPT,
)
@ -31,6 +31,8 @@ from graphrag.prompts.query.global_search_reduce_system_prompt import (
from graphrag.prompts.query.local_search_system_prompt import LOCAL_SEARCH_SYSTEM_PROMPT
from graphrag.prompts.query.question_gen_system_prompt import QUESTION_SYSTEM_PROMPT
logger = logging.getLogger(__name__)
def initialize_project_at(path: Path, force: bool) -> None:
"""
@ -48,8 +50,7 @@ def initialize_project_at(path: Path, force: bool) -> None:
ValueError
If the project already exists and force is False.
"""
progress_logger = LoggerFactory().create_logger(LoggerType.RICH)
progress_logger.info(f"Initializing project at {path}") # noqa: G004
logger.info("Initializing project at %s", path)
root = Path(path)
if not root.exists():
root.mkdir(parents=True, exist_ok=True)

View File

@ -7,13 +7,11 @@ import os
import re
from collections.abc import Callable
from pathlib import Path
from typing import Annotated
import typer
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import IndexingMethod, SearchMethod
from graphrag.logger.types import LoggerType
from graphrag.prompt_tune.defaults import LIMIT, MAX_TOKEN_COUNT, N_SUBSET_MAX, K
from graphrag.prompt_tune.types import DocSelectionType
@ -78,25 +76,40 @@ def path_autocomplete(
return completer
CONFIG_AUTOCOMPLETE = path_autocomplete(
file_okay=True,
dir_okay=False,
match_wildcard="*.yaml",
readable=True,
)
ROOT_AUTOCOMPLETE = path_autocomplete(
file_okay=False,
dir_okay=True,
writable=True,
match_wildcard="*",
)
@app.command("init")
def _initialize_cli(
root: Annotated[
Path,
typer.Option(
help="The project root directory.",
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=path_autocomplete(
file_okay=False, dir_okay=True, writable=True, match_wildcard="*"
),
),
],
force: Annotated[
bool,
typer.Option(help="Force initialization even if the project already exists."),
] = False,
):
root: Path = typer.Option(
Path(),
"--root",
"-r",
help="The project root directory.",
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=ROOT_AUTOCOMPLETE,
),
force: bool = typer.Option(
False,
"--force",
"-f",
help="Force initialization even if the project already exists.",
),
) -> None:
"""Generate a default configuration file."""
from graphrag.cli.initialize import initialize_project_at
@ -105,60 +118,75 @@ def _initialize_cli(
@app.command("index")
def _index_cli(
config: Annotated[
Path | None,
typer.Option(
help="The configuration to use.", exists=True, file_okay=True, readable=True
config: Path | None = typer.Option(
None,
"--config",
"-c",
help="The configuration to use.",
exists=True,
file_okay=True,
readable=True,
autocompletion=CONFIG_AUTOCOMPLETE,
),
root: Path = typer.Option(
Path(),
"--root",
"-r",
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=ROOT_AUTOCOMPLETE,
),
method: IndexingMethod = typer.Option(
IndexingMethod.Standard.value,
"--method",
"-m",
help="The indexing method to use.",
),
verbose: bool = typer.Option(
False,
"--verbose",
"-v",
help="Run the indexing pipeline with verbose logging",
),
memprofile: bool = typer.Option(
False,
"--memprofile",
help="Run the indexing pipeline with memory profiling",
),
dry_run: bool = typer.Option(
False,
"--dry-run",
help=(
"Run the indexing pipeline without executing any steps "
"to inspect and validate the configuration."
),
] = None,
root: Annotated[
Path,
typer.Option(
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=path_autocomplete(
file_okay=False, dir_okay=True, writable=True, match_wildcard="*"
),
),
cache: bool = typer.Option(
True,
"--cache/--no-cache",
help="Use LLM cache.",
),
skip_validation: bool = typer.Option(
False,
"--skip-validation",
help="Skip any preflight validation. Useful when running no LLM steps.",
),
output: Path | None = typer.Option(
None,
"--output",
"-o",
help=(
"Indexing pipeline output directory. "
"Overrides output.base_dir in the configuration file."
),
] = Path(), # set default to current directory
method: Annotated[
IndexingMethod, typer.Option(help="The indexing method to use.")
] = IndexingMethod.Standard,
verbose: Annotated[
bool, typer.Option(help="Run the indexing pipeline with verbose logging")
] = False,
memprofile: Annotated[
bool, typer.Option(help="Run the indexing pipeline with memory profiling")
] = False,
logger: Annotated[
LoggerType, typer.Option(help="The progress logger to use.")
] = LoggerType.RICH,
dry_run: Annotated[
bool,
typer.Option(
help="Run the indexing pipeline without executing any steps to inspect and validate the configuration."
),
] = False,
cache: Annotated[bool, typer.Option(help="Use LLM cache.")] = True,
skip_validation: Annotated[
bool,
typer.Option(
help="Skip any preflight validation. Useful when running no LLM steps."
),
] = False,
output: Annotated[
Path | None,
typer.Option(
help="Indexing pipeline output directory. Overrides output.base_dir in the configuration file.",
dir_okay=True,
writable=True,
resolve_path=True,
),
] = None,
):
dir_okay=True,
writable=True,
resolve_path=True,
),
) -> None:
"""Build a knowledge graph index."""
from graphrag.cli.index import index_cli
@ -167,7 +195,6 @@ def _index_cli(
verbose=verbose,
memprofile=memprofile,
cache=cache,
logger=LoggerType(logger),
config_filepath=config,
dry_run=dry_run,
skip_validation=skip_validation,
@ -178,51 +205,67 @@ def _index_cli(
@app.command("update")
def _update_cli(
config: Annotated[
Path | None,
typer.Option(
help="The configuration to use.", exists=True, file_okay=True, readable=True
config: Path | None = typer.Option(
None,
"--config",
"-c",
help="The configuration to use.",
exists=True,
file_okay=True,
readable=True,
autocompletion=CONFIG_AUTOCOMPLETE,
),
root: Path = typer.Option(
Path(),
"--root",
"-r",
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=ROOT_AUTOCOMPLETE,
),
method: IndexingMethod = typer.Option(
IndexingMethod.Standard.value,
"--method",
"-m",
help="The indexing method to use.",
),
verbose: bool = typer.Option(
False,
"--verbose",
"-v",
help="Run the indexing pipeline with verbose logging.",
),
memprofile: bool = typer.Option(
False,
"--memprofile",
help="Run the indexing pipeline with memory profiling.",
),
cache: bool = typer.Option(
True,
"--cache/--no-cache",
help="Use LLM cache.",
),
skip_validation: bool = typer.Option(
False,
"--skip-validation",
help="Skip any preflight validation. Useful when running no LLM steps.",
),
output: Path | None = typer.Option(
None,
"--output",
"-o",
help=(
"Indexing pipeline output directory. "
"Overrides output.base_dir in the configuration file."
),
] = None,
root: Annotated[
Path,
typer.Option(
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
),
] = Path(), # set default to current directory
method: Annotated[
IndexingMethod, typer.Option(help="The indexing method to use.")
] = IndexingMethod.Standard,
verbose: Annotated[
bool, typer.Option(help="Run the indexing pipeline with verbose logging")
] = False,
memprofile: Annotated[
bool, typer.Option(help="Run the indexing pipeline with memory profiling")
] = False,
logger: Annotated[
LoggerType, typer.Option(help="The progress logger to use.")
] = LoggerType.RICH,
cache: Annotated[bool, typer.Option(help="Use LLM cache.")] = True,
skip_validation: Annotated[
bool,
typer.Option(
help="Skip any preflight validation. Useful when running no LLM steps."
),
] = False,
output: Annotated[
Path | None,
typer.Option(
help="Indexing pipeline output directory. Overrides output.base_dir in the configuration file.",
dir_okay=True,
writable=True,
resolve_path=True,
),
] = None,
):
dir_okay=True,
writable=True,
resolve_path=True,
),
) -> None:
"""
Update an existing knowledge graph index.
@ -235,7 +278,6 @@ def _update_cli(
verbose=verbose,
memprofile=memprofile,
cache=cache,
logger=LoggerType(logger),
config_filepath=config,
skip_validation=skip_validation,
output_dir=output,
@ -245,104 +287,102 @@ def _update_cli(
@app.command("prompt-tune")
def _prompt_tune_cli(
root: Annotated[
Path,
typer.Option(
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=path_autocomplete(
file_okay=False, dir_okay=True, writable=True, match_wildcard="*"
),
root: Path = typer.Option(
Path(),
"--root",
"-r",
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=ROOT_AUTOCOMPLETE,
),
config: Path | None = typer.Option(
None,
"--config",
"-c",
help="The configuration to use.",
exists=True,
file_okay=True,
readable=True,
autocompletion=CONFIG_AUTOCOMPLETE,
),
verbose: bool = typer.Option(
False,
"--verbose",
"-v",
help="Run the prompt tuning pipeline with verbose logging.",
),
domain: str | None = typer.Option(
None,
"--domain",
help=(
"The domain your input data is related to. "
"For example 'space science', 'microbiology', 'environmental news'. "
"If not defined, a domain will be inferred from the input data."
),
] = Path(), # set default to current directory
config: Annotated[
Path | None,
typer.Option(
help="The configuration to use.",
exists=True,
file_okay=True,
readable=True,
autocompletion=path_autocomplete(
file_okay=True, dir_okay=False, match_wildcard="*"
),
),
] = None,
verbose: Annotated[
bool, typer.Option(help="Run the prompt tuning pipeline with verbose logging")
] = False,
logger: Annotated[
LoggerType, typer.Option(help="The progress logger to use.")
] = LoggerType.RICH,
domain: Annotated[
str | None,
typer.Option(
help="The domain your input data is related to. For example 'space science', 'microbiology', 'environmental news'. If not defined, a domain will be inferred from the input data."
),
] = None,
selection_method: Annotated[
DocSelectionType, typer.Option(help="The text chunk selection method.")
] = DocSelectionType.RANDOM,
n_subset_max: Annotated[
int,
typer.Option(
help="The number of text chunks to embed when --selection-method=auto."
),
] = N_SUBSET_MAX,
k: Annotated[
int,
typer.Option(
help="The maximum number of documents to select from each centroid when --selection-method=auto."
),
] = K,
limit: Annotated[
int,
typer.Option(
help="The number of documents to load when --selection-method={random,top}."
),
] = LIMIT,
max_tokens: Annotated[
int, typer.Option(help="The max token count for prompt generation.")
] = MAX_TOKEN_COUNT,
min_examples_required: Annotated[
int,
typer.Option(
help="The minimum number of examples to generate/include in the entity extraction prompt."
),
] = 2,
chunk_size: Annotated[
int,
typer.Option(
help="The size of each example text chunk. Overrides chunks.size in the configuration file."
),
] = graphrag_config_defaults.chunks.size,
overlap: Annotated[
int,
typer.Option(
help="The overlap size for chunking documents. Overrides chunks.overlap in the configuration file"
),
] = graphrag_config_defaults.chunks.overlap,
language: Annotated[
str | None,
typer.Option(
help="The primary language used for inputs and outputs in graphrag prompts."
),
] = None,
discover_entity_types: Annotated[
bool, typer.Option(help="Discover and extract unspecified entity types.")
] = True,
output: Annotated[
Path,
typer.Option(
help="The directory to save prompts to, relative to the project root directory.",
dir_okay=True,
writable=True,
resolve_path=True,
),
] = Path("prompts"),
):
),
selection_method: DocSelectionType = typer.Option(
DocSelectionType.RANDOM.value,
"--selection-method",
help="The text chunk selection method.",
),
n_subset_max: int = typer.Option(
N_SUBSET_MAX,
"--n-subset-max",
help="The number of text chunks to embed when --selection-method=auto.",
),
k: int = typer.Option(
K,
"--k",
help="The maximum number of documents to select from each centroid when --selection-method=auto.",
),
limit: int = typer.Option(
LIMIT,
"--limit",
help="The number of documents to load when --selection-method={random,top}.",
),
max_tokens: int = typer.Option(
MAX_TOKEN_COUNT,
"--max-tokens",
help="The max token count for prompt generation.",
),
min_examples_required: int = typer.Option(
2,
"--min-examples-required",
help="The minimum number of examples to generate/include in the entity extraction prompt.",
),
chunk_size: int = typer.Option(
graphrag_config_defaults.chunks.size,
"--chunk-size",
help="The size of each example text chunk. Overrides chunks.size in the configuration file.",
),
overlap: int = typer.Option(
graphrag_config_defaults.chunks.overlap,
"--overlap",
help="The overlap size for chunking documents. Overrides chunks.overlap in the configuration file.",
),
language: str | None = typer.Option(
None,
"--language",
help="The primary language used for inputs and outputs in graphrag prompts.",
),
discover_entity_types: bool = typer.Option(
True,
"--discover-entity-types/--no-discover-entity-types",
help="Discover and extract unspecified entity types.",
),
output: Path = typer.Option(
Path("prompts"),
"--output",
"-o",
help="The directory to save prompts to, relative to the project root directory.",
dir_okay=True,
writable=True,
resolve_path=True,
),
) -> None:
"""Generate custom graphrag prompts with your own data (i.e. auto templating)."""
import asyncio
@ -355,7 +395,6 @@ def _prompt_tune_cli(
config=config,
domain=domain,
verbose=verbose,
logger=logger,
selection_method=selection_method,
limit=limit,
max_tokens=max_tokens,
@ -373,66 +412,83 @@ def _prompt_tune_cli(
@app.command("query")
def _query_cli(
method: Annotated[SearchMethod, typer.Option(help="The query algorithm to use.")],
query: Annotated[str, typer.Option(help="The query to execute.")],
config: Annotated[
Path | None,
typer.Option(
help="The configuration to use.",
exists=True,
file_okay=True,
readable=True,
autocompletion=path_autocomplete(
file_okay=True, dir_okay=False, match_wildcard="*"
),
method: SearchMethod = typer.Option(
...,
"--method",
"-m",
help="The query algorithm to use.",
),
query: str = typer.Option(
...,
"--query",
"-q",
help="The query to execute.",
),
config: Path | None = typer.Option(
None,
"--config",
"-c",
help="The configuration to use.",
exists=True,
file_okay=True,
readable=True,
autocompletion=CONFIG_AUTOCOMPLETE,
),
verbose: bool = typer.Option(
False,
"--verbose",
"-v",
help="Run the query with verbose logging.",
),
data: Path | None = typer.Option(
None,
"--data",
"-d",
help="Index output directory (contains the parquet files).",
exists=True,
dir_okay=True,
readable=True,
resolve_path=True,
autocompletion=ROOT_AUTOCOMPLETE,
),
root: Path = typer.Option(
Path(),
"--root",
"-r",
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=ROOT_AUTOCOMPLETE,
),
community_level: int = typer.Option(
2,
"--community-level",
help=(
"Leiden hierarchy level from which to load community reports. "
"Higher values represent smaller communities."
),
] = None,
data: Annotated[
Path | None,
typer.Option(
help="Indexing pipeline output directory (i.e. contains the parquet files).",
exists=True,
dir_okay=True,
readable=True,
resolve_path=True,
autocompletion=path_autocomplete(
file_okay=False, dir_okay=True, match_wildcard="*"
),
),
dynamic_community_selection: bool = typer.Option(
False,
"--dynamic-community-selection/--no-dynamic-selection",
help="Use global search with dynamic community selection.",
),
response_type: str = typer.Option(
"Multiple Paragraphs",
"--response-type",
help=(
"Free-form description of the desired response format "
"(e.g. 'Single Sentence', 'List of 3-7 Points', etc.)."
),
] = None,
root: Annotated[
Path,
typer.Option(
help="The project root directory.",
exists=True,
dir_okay=True,
writable=True,
resolve_path=True,
autocompletion=path_autocomplete(
file_okay=False, dir_okay=True, match_wildcard="*"
),
),
] = Path(), # set default to current directory
community_level: Annotated[
int,
typer.Option(
help="The community level in the Leiden community hierarchy from which to load community reports. Higher values represent reports from smaller communities."
),
] = 2,
dynamic_community_selection: Annotated[
bool,
typer.Option(help="Use global search with dynamic community selection."),
] = False,
response_type: Annotated[
str,
typer.Option(
help="Free form text describing the response type and format, can be anything, e.g. Multiple Paragraphs, Single Paragraph, Single Sentence, List of 3-7 Points, Single Page, Multi-Page Report. Default: Multiple Paragraphs"
),
] = "Multiple Paragraphs",
streaming: Annotated[
bool, typer.Option(help="Print response in a streaming manner.")
] = False,
):
),
streaming: bool = typer.Option(
False,
"--streaming/--no-streaming",
help="Print the response in a streaming manner.",
),
) -> None:
"""Query a knowledge graph index."""
from graphrag.cli.query import (
run_basic_search,
@ -451,6 +507,7 @@ def _query_cli(
response_type=response_type,
streaming=streaming,
query=query,
verbose=verbose,
)
case SearchMethod.GLOBAL:
run_global_search(
@ -462,6 +519,7 @@ def _query_cli(
response_type=response_type,
streaming=streaming,
query=query,
verbose=verbose,
)
case SearchMethod.DRIFT:
run_drift_search(
@ -472,6 +530,7 @@ def _query_cli(
streaming=streaming,
response_type=response_type,
query=query,
verbose=verbose,
)
case SearchMethod.BASIC:
run_basic_search(
@ -480,6 +539,7 @@ def _query_cli(
root_dir=root,
streaming=streaming,
query=query,
verbose=verbose,
)
case _:
raise ValueError(INVALID_METHOD_ERROR)

View File

@ -3,13 +3,11 @@
"""CLI implementation of the prompt-tune subcommand."""
import logging
from pathlib import Path
import graphrag.api as api
from graphrag.cli.index import _logger
from graphrag.config.load_config import load_config
from graphrag.config.logging import enable_logging_with_config
from graphrag.logger.factory import LoggerFactory, LoggerType
from graphrag.prompt_tune.generator.community_report_summarization import (
COMMUNITY_SUMMARIZATION_FILENAME,
)
@ -21,13 +19,14 @@ from graphrag.prompt_tune.generator.extract_graph_prompt import (
)
from graphrag.utils.cli import redact
logger = logging.getLogger(__name__)
async def prompt_tune(
root: Path,
config: Path | None,
domain: str | None,
verbose: bool,
logger: LoggerType,
selection_method: api.DocSelectionType,
limit: int,
max_tokens: int,
@ -47,8 +46,7 @@ async def prompt_tune(
- config: The configuration file.
- root: The root directory.
- domain: The domain to map the input documents to.
- verbose: Whether to enable verbose logging.
- logger: The logger to use.
- verbose: Enable verbose logging.
- selection_method: The chunk selection method.
- limit: The limit of chunks to load.
- max_tokens: The maximum number of tokens to use on entity extraction prompts.
@ -70,24 +68,20 @@ async def prompt_tune(
if overlap != graph_config.chunks.overlap:
graph_config.chunks.overlap = overlap
progress_logger = LoggerFactory().create_logger(logger)
info, error, success = _logger(progress_logger)
# configure the root logger with the specified log level
from graphrag.logger.standard_logging import init_loggers
enabled_logging, log_path = enable_logging_with_config(
graph_config, verbose, filename="prompt-tune.log"
# initialize loggers with config
init_loggers(config=graph_config, verbose=verbose, filename="prompt-tuning.log")
logger.info("Starting prompt tune.")
logger.info(
"Using default configuration: %s",
redact(graph_config.model_dump()),
)
if enabled_logging:
info(f"Logging enabled at {log_path}", verbose)
else:
info(
f"Logging not enabled for config {redact(graph_config.model_dump())}",
verbose,
)
prompts = await api.generate_indexing_prompts(
config=graph_config,
root=str(root_path),
logger=progress_logger,
chunk_size=chunk_size,
overlap=overlap,
limit=limit,
@ -99,24 +93,25 @@ async def prompt_tune(
min_examples_required=min_examples_required,
n_subset_max=n_subset_max,
k=k,
verbose=verbose,
)
output_path = output.resolve()
if output_path:
info(f"Writing prompts to {output_path}")
logger.info("Writing prompts to %s", output_path)
output_path.mkdir(parents=True, exist_ok=True)
extract_graph_prompt_path = output_path / EXTRACT_GRAPH_FILENAME
entity_summarization_prompt_path = output_path / ENTITY_SUMMARIZATION_FILENAME
community_summarization_prompt_path = (
output_path / COMMUNITY_SUMMARIZATION_FILENAME
)
# Write files to output path
# write files to output path
with extract_graph_prompt_path.open("wb") as file:
file.write(prompts[0].encode(encoding="utf-8", errors="strict"))
with entity_summarization_prompt_path.open("wb") as file:
file.write(prompts[1].encode(encoding="utf-8", errors="strict"))
with community_summarization_prompt_path.open("wb") as file:
file.write(prompts[2].encode(encoding="utf-8", errors="strict"))
success(f"Prompts written to {output_path}")
logger.info("Prompts written to %s", output_path)
else:
error("No output path provided. Skipping writing prompts.")
logger.error("No output path provided. Skipping writing prompts.")

View File

@ -12,14 +12,13 @@ import graphrag.api as api
from graphrag.callbacks.noop_query_callbacks import NoopQueryCallbacks
from graphrag.config.load_config import load_config
from graphrag.config.models.graph_rag_config import GraphRagConfig
from graphrag.logger.print_progress import PrintProgressLogger
from graphrag.utils.api import create_storage_from_config
from graphrag.utils.storage import load_table_from_storage, storage_has_table
if TYPE_CHECKING:
import pandas as pd
logger = PrintProgressLogger("")
# ruff: noqa: T201
def run_global_search(
@ -31,6 +30,7 @@ def run_global_search(
response_type: str,
streaming: bool,
query: str,
verbose: bool,
):
"""Perform a global search with a given query.
@ -59,10 +59,6 @@ def run_global_search(
final_community_reports_list = dataframe_dict["community_reports"]
index_names = dataframe_dict["index_names"]
logger.success(
f"Running Multi-index Global Search: {dataframe_dict['index_names']}"
)
response, context_data = asyncio.run(
api.multi_index_global_search(
config=config,
@ -75,11 +71,10 @@ def run_global_search(
response_type=response_type,
streaming=streaming,
query=query,
verbose=verbose,
)
)
logger.success(f"Global Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
# Otherwise, call the Single-Index Global Search API
@ -110,11 +105,12 @@ def run_global_search(
response_type=response_type,
query=query,
callbacks=[callbacks],
verbose=verbose,
):
full_response += stream_chunk
print(stream_chunk, end="") # noqa: T201
sys.stdout.flush() # flush output buffer to display text immediately
print() # noqa: T201
print(stream_chunk, end="")
sys.stdout.flush()
print()
return full_response, context_data
return asyncio.run(run_streaming_search())
@ -129,11 +125,11 @@ def run_global_search(
dynamic_community_selection=dynamic_community_selection,
response_type=response_type,
query=query,
verbose=verbose,
)
)
logger.success(f"Global Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
@ -145,6 +141,7 @@ def run_local_search(
response_type: str,
streaming: bool,
query: str,
verbose: bool,
):
"""Perform a local search with a given query.
@ -178,10 +175,6 @@ def run_local_search(
final_relationships_list = dataframe_dict["relationships"]
index_names = dataframe_dict["index_names"]
logger.success(
f"Running Multi-index Local Search: {dataframe_dict['index_names']}"
)
# If any covariates tables are missing from any index, set the covariates list to None
if len(dataframe_dict["covariates"]) != dataframe_dict["num_indexes"]:
final_covariates_list = None
@ -202,11 +195,11 @@ def run_local_search(
response_type=response_type,
streaming=streaming,
query=query,
verbose=verbose,
)
)
logger.success(f"Local Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
# Otherwise, call the Single-Index Local Search API
@ -242,11 +235,12 @@ def run_local_search(
response_type=response_type,
query=query,
callbacks=[callbacks],
verbose=verbose,
):
full_response += stream_chunk
print(stream_chunk, end="") # noqa: T201
sys.stdout.flush() # flush output buffer to display text immediately
print() # noqa: T201
print(stream_chunk, end="")
sys.stdout.flush()
print()
return full_response, context_data
return asyncio.run(run_streaming_search())
@ -263,11 +257,11 @@ def run_local_search(
community_level=community_level,
response_type=response_type,
query=query,
verbose=verbose,
)
)
logger.success(f"Local Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
@ -279,6 +273,7 @@ def run_drift_search(
response_type: str,
streaming: bool,
query: str,
verbose: bool,
):
"""Perform a local search with a given query.
@ -310,10 +305,6 @@ def run_drift_search(
final_relationships_list = dataframe_dict["relationships"]
index_names = dataframe_dict["index_names"]
logger.success(
f"Running Multi-index Drift Search: {dataframe_dict['index_names']}"
)
response, context_data = asyncio.run(
api.multi_index_drift_search(
config=config,
@ -327,11 +318,11 @@ def run_drift_search(
response_type=response_type,
streaming=streaming,
query=query,
verbose=verbose,
)
)
logger.success(f"DRIFT Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
# Otherwise, call the Single-Index Drift Search API
@ -365,11 +356,12 @@ def run_drift_search(
response_type=response_type,
query=query,
callbacks=[callbacks],
verbose=verbose,
):
full_response += stream_chunk
print(stream_chunk, end="") # noqa: T201
sys.stdout.flush() # flush output buffer to display text immediately
print() # noqa: T201
print(stream_chunk, end="")
sys.stdout.flush()
print()
return full_response, context_data
return asyncio.run(run_streaming_search())
@ -386,11 +378,11 @@ def run_drift_search(
community_level=community_level,
response_type=response_type,
query=query,
verbose=verbose,
)
)
logger.success(f"DRIFT Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
@ -400,6 +392,7 @@ def run_basic_search(
root_dir: Path,
streaming: bool,
query: str,
verbose: bool,
):
"""Perform a basics search with a given query.
@ -423,10 +416,6 @@ def run_basic_search(
final_text_units_list = dataframe_dict["text_units"]
index_names = dataframe_dict["index_names"]
logger.success(
f"Running Multi-index Basic Search: {dataframe_dict['index_names']}"
)
response, context_data = asyncio.run(
api.multi_index_basic_search(
config=config,
@ -434,11 +423,11 @@ def run_basic_search(
index_names=index_names,
streaming=streaming,
query=query,
verbose=verbose,
)
)
logger.success(f"Basic Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data
# Otherwise, call the Single-Index Basic Search API
@ -461,11 +450,13 @@ def run_basic_search(
config=config,
text_units=final_text_units,
query=query,
callbacks=[callbacks],
verbose=verbose,
):
full_response += stream_chunk
print(stream_chunk, end="") # noqa: T201
sys.stdout.flush() # flush output buffer to display text immediately
print() # noqa: T201
print(stream_chunk, end="")
sys.stdout.flush()
print()
return full_response, context_data
return asyncio.run(run_streaming_search())
@ -475,11 +466,11 @@ def run_basic_search(
config=config,
text_units=final_text_units,
query=query,
verbose=verbose,
)
)
logger.success(f"Basic Search Response:\n{response}")
# NOTE: we return the response and context data here purely as a complete demonstration of the API.
# External users should use the API directly to get the response and context data.
print(response)
return response, context_data

View File

@ -3,42 +3,75 @@
"""Common default configuration values."""
from collections.abc import Callable
from dataclasses import dataclass, field
from pathlib import Path
from typing import ClassVar
from graphrag.config.embeddings import default_embeddings
from graphrag.config.enums import (
AsyncType,
AuthType,
CacheType,
ChunkStrategyType,
InputFileType,
InputType,
ModelType,
NounPhraseExtractorType,
OutputType,
ReportingType,
TextEmbeddingTarget,
StorageType,
VectorStoreType,
)
from graphrag.index.operations.build_noun_graph.np_extractors.stop_words import (
EN_STOP_WORDS,
)
from graphrag.vector_stores.factory import VectorStoreType
from graphrag.language_model.providers.litellm.services.rate_limiter.rate_limiter import (
RateLimiter,
)
from graphrag.language_model.providers.litellm.services.rate_limiter.static_rate_limiter import (
StaticRateLimiter,
)
from graphrag.language_model.providers.litellm.services.retry.exponential_retry import (
ExponentialRetry,
)
from graphrag.language_model.providers.litellm.services.retry.incremental_wait_retry import (
IncrementalWaitRetry,
)
from graphrag.language_model.providers.litellm.services.retry.native_wait_retry import (
NativeRetry,
)
from graphrag.language_model.providers.litellm.services.retry.random_wait_retry import (
RandomWaitRetry,
)
from graphrag.language_model.providers.litellm.services.retry.retry import Retry
DEFAULT_OUTPUT_BASE_DIR = "output"
DEFAULT_CHAT_MODEL_ID = "default_chat_model"
DEFAULT_CHAT_MODEL_TYPE = ModelType.OpenAIChat
DEFAULT_CHAT_MODEL_TYPE = ModelType.Chat
DEFAULT_CHAT_MODEL = "gpt-4-turbo-preview"
DEFAULT_CHAT_MODEL_AUTH_TYPE = AuthType.APIKey
DEFAULT_EMBEDDING_MODEL_ID = "default_embedding_model"
DEFAULT_EMBEDDING_MODEL_TYPE = ModelType.OpenAIEmbedding
DEFAULT_EMBEDDING_MODEL_TYPE = ModelType.Embedding
DEFAULT_EMBEDDING_MODEL = "text-embedding-3-small"
DEFAULT_EMBEDDING_MODEL_AUTH_TYPE = AuthType.APIKey
DEFAULT_MODEL_PROVIDER = "openai"
DEFAULT_VECTOR_STORE_ID = "default_vector_store"
ENCODING_MODEL = "cl100k_base"
COGNITIVE_SERVICES_AUDIENCE = "https://cognitiveservices.azure.com/.default"
DEFAULT_RETRY_SERVICES: dict[str, Callable[..., Retry]] = {
"native": NativeRetry,
"exponential_backoff": ExponentialRetry,
"random_wait": RandomWaitRetry,
"incremental_wait": IncrementalWaitRetry,
}
DEFAULT_RATE_LIMITER_SERVICES: dict[str, Callable[..., RateLimiter]] = {
"static": StaticRateLimiter,
}
@dataclass
class BasicSearchDefaults:
"""Default values for basic search."""
@ -54,7 +87,7 @@ class BasicSearchDefaults:
class CacheDefaults:
"""Default values for cache."""
type = CacheType.file
type: ClassVar[CacheType] = CacheType.file
base_dir: str = "cache"
connection_string: None = None
container_name: None = None
@ -69,7 +102,7 @@ class ChunksDefaults:
size: int = 1200
overlap: int = 100
group_by_columns: list[str] = field(default_factory=lambda: ["id"])
strategy = ChunkStrategyType.tokens
strategy: ClassVar[ChunkStrategyType] = ChunkStrategyType.tokens
encoding_model: str = "cl100k_base"
prepend_metadata: bool = False
chunk_size_includes_metadata: bool = False
@ -119,8 +152,8 @@ class DriftSearchDefaults:
local_search_temperature: float = 0
local_search_top_p: float = 1
local_search_n: int = 1
local_search_llm_max_gen_tokens = None
local_search_llm_max_gen_completion_tokens = None
local_search_llm_max_gen_tokens: int | None = None
local_search_llm_max_gen_completion_tokens: int | None = None
chat_model_id: str = DEFAULT_CHAT_MODEL_ID
embedding_model_id: str = DEFAULT_EMBEDDING_MODEL_ID
@ -146,9 +179,8 @@ class EmbedTextDefaults:
model: str = "text-embedding-3-small"
batch_size: int = 16
batch_max_tokens: int = 8191
target = TextEmbeddingTarget.required
model_id: str = DEFAULT_EMBEDDING_MODEL_ID
names: list[str] = field(default_factory=list)
names: list[str] = field(default_factory=lambda: default_embeddings)
strategy: None = None
vector_store_id: str = DEFAULT_VECTOR_STORE_ID
@ -184,7 +216,9 @@ class ExtractGraphDefaults:
class TextAnalyzerDefaults:
"""Default values for text analyzer."""
extractor_type = NounPhraseExtractorType.RegexEnglish
extractor_type: ClassVar[NounPhraseExtractorType] = (
NounPhraseExtractorType.RegexEnglish
)
model_name: str = "en_core_web_md"
max_word_length: int = 15
word_delimiter: str = " "
@ -213,6 +247,7 @@ class ExtractGraphNLPDefaults:
normalize_edge_weights: bool = True
text_analyzer: TextAnalyzerDefaults = field(default_factory=TextAnalyzerDefaults)
concurrent_requests: int = 25
async_mode: AsyncType = AsyncType.Threaded
@dataclass
@ -234,16 +269,31 @@ class GlobalSearchDefaults:
chat_model_id: str = DEFAULT_CHAT_MODEL_ID
@dataclass
class StorageDefaults:
"""Default values for storage."""
type: ClassVar[StorageType] = StorageType.file
base_dir: str = DEFAULT_OUTPUT_BASE_DIR
connection_string: None = None
container_name: None = None
storage_account_blob_url: None = None
cosmosdb_account_url: None = None
@dataclass
class InputStorageDefaults(StorageDefaults):
"""Default values for input storage."""
base_dir: str = "input"
@dataclass
class InputDefaults:
"""Default values for input."""
type = InputType.file
file_type = InputFileType.text
base_dir: str = "input"
connection_string: None = None
storage_account_blob_url: None = None
container_name: None = None
storage: InputStorageDefaults = field(default_factory=InputStorageDefaults)
file_type: ClassVar[InputFileType] = InputFileType.text
encoding: str = "utf-8"
file_pattern: str = ""
file_filter: None = None
@ -257,7 +307,8 @@ class LanguageModelDefaults:
"""Default values for language model."""
api_key: None = None
auth_type = AuthType.APIKey
auth_type: ClassVar[AuthType] = AuthType.APIKey
model_provider: str | None = None
encoding_model: str = ""
max_tokens: int | None = None
temperature: float = 0
@ -275,9 +326,10 @@ class LanguageModelDefaults:
proxy: None = None
audience: None = None
model_supports_json: None = None
tokens_per_minute: int = 50_000
requests_per_minute: int = 1_000
retry_strategy: str = "native"
tokens_per_minute: None = None
requests_per_minute: None = None
rate_limit_strategy: str | None = "static"
retry_strategy: str = "exponential_backoff"
max_retries: int = 10
max_retry_wait: float = 10.0
concurrent_requests: int = 25
@ -301,15 +353,10 @@ class LocalSearchDefaults:
@dataclass
class OutputDefaults:
class OutputDefaults(StorageDefaults):
"""Default values for output."""
type = OutputType.file
base_dir: str = DEFAULT_OUTPUT_BASE_DIR
connection_string: None = None
container_name: None = None
storage_account_blob_url: None = None
cosmosdb_account_url: None = None
@dataclass
@ -329,7 +376,7 @@ class PruneGraphDefaults:
class ReportingDefaults:
"""Default values for reporting."""
type = ReportingType.file
type: ClassVar[ReportingType] = ReportingType.file
base_dir: str = "logs"
connection_string: None = None
container_name: None = None
@ -364,21 +411,17 @@ class UmapDefaults:
@dataclass
class UpdateIndexOutputDefaults:
class UpdateIndexOutputDefaults(StorageDefaults):
"""Default values for update index output."""
type = OutputType.file
base_dir: str = "update_output"
connection_string: None = None
container_name: None = None
storage_account_blob_url: None = None
@dataclass
class VectorStoreDefaults:
"""Default values for vector stores."""
type = VectorStoreType.LanceDB.value
type: ClassVar[str] = VectorStoreType.LanceDB.value
db_uri: str = str(Path(DEFAULT_OUTPUT_BASE_DIR) / "lancedb")
container_name: str = "default"
overwrite: bool = True
@ -386,6 +429,7 @@ class VectorStoreDefaults:
api_key: None = None
audience: None = None
database_name: None = None
schema: None = None
@dataclass
@ -395,6 +439,7 @@ class GraphRagConfigDefaults:
root_dir: str = ""
models: dict = field(default_factory=dict)
reporting: ReportingDefaults = field(default_factory=ReportingDefaults)
storage: StorageDefaults = field(default_factory=StorageDefaults)
output: OutputDefaults = field(default_factory=OutputDefaults)
outputs: None = None
update_index_output: UpdateIndexOutputDefaults = field(

View File

@ -3,9 +3,6 @@
"""A module containing embeddings values."""
from graphrag.config.enums import TextEmbeddingTarget
from graphrag.config.models.graph_rag_config import GraphRagConfig
entity_title_embedding = "entity.title"
entity_description_embedding = "entity.description"
relationship_description_embedding = "relationship.description"
@ -25,70 +22,21 @@ all_embeddings: set[str] = {
community_full_content_embedding,
text_unit_text_embedding,
}
required_embeddings: set[str] = {
default_embeddings: list[str] = [
entity_description_embedding,
community_full_content_embedding,
text_unit_text_embedding,
}
]
def get_embedded_fields(settings: GraphRagConfig) -> set[str]:
"""Get the fields to embed based on the enum or specifically selected embeddings."""
match settings.embed_text.target:
case TextEmbeddingTarget.all:
return all_embeddings
case TextEmbeddingTarget.required:
return required_embeddings
case TextEmbeddingTarget.selected:
return set(settings.embed_text.names)
case TextEmbeddingTarget.none:
return set()
case _:
msg = f"Unknown embeddings target: {settings.embed_text.target}"
raise ValueError(msg)
def get_embedding_settings(
settings: GraphRagConfig,
vector_store_params: dict | None = None,
) -> dict:
"""Transform GraphRAG config into settings for workflows."""
# TEMP
embeddings_llm_settings = settings.get_language_model_config(
settings.embed_text.model_id
)
vector_store_settings = settings.get_vector_store_config(
settings.embed_text.vector_store_id
).model_dump()
#
# If we get to this point, settings.vector_store is defined, and there's a specific setting for this embedding.
# settings.vector_store.base contains connection information, or may be undefined
# settings.vector_store.<vector_name> contains the specific settings for this embedding
#
strategy = settings.embed_text.resolved_strategy(
embeddings_llm_settings
) # get the default strategy
strategy.update({
"vector_store": {
**(vector_store_params or {}),
**(vector_store_settings),
}
}) # update the default strategy with the vector store settings
# This ensures the vector store config is part of the strategy and not the global config
return {
"strategy": strategy,
}
def create_collection_name(
def create_index_name(
container_name: str, embedding_name: str, validate: bool = True
) -> str:
"""
Create a collection name for the embedding store.
Create a index name for the embedding store.
Within any given vector store, we can have multiple sets of embeddings organized into projects.
The `container` param is used for this partitioning, and is added as a prefix to the collection name for differentiation.
The `container` param is used for this partitioning, and is added as a prefix to the index name for differentiation.
The embedding name is fixed, with the available list defined in graphrag.index.config.embeddings

View File

@ -42,20 +42,7 @@ class InputFileType(str, Enum):
return f'"{self.value}"'
class InputType(str, Enum):
"""The input type for the pipeline."""
file = "file"
"""The file storage type."""
blob = "blob"
"""The blob storage type."""
def __repr__(self):
"""Get a string representation."""
return f'"{self.value}"'
class OutputType(str, Enum):
class StorageType(str, Enum):
"""The output type for the pipeline."""
file = "file"
@ -72,13 +59,19 @@ class OutputType(str, Enum):
return f'"{self.value}"'
class VectorStoreType(str, Enum):
"""The supported vector store types."""
LanceDB = "lancedb"
AzureAISearch = "azure_ai_search"
CosmosDB = "cosmosdb"
class ReportingType(str, Enum):
"""The reporting configuration type for the pipeline."""
file = "file"
"""The file reporting configuration type."""
console = "console"
"""The console reporting configuration type."""
blob = "blob"
"""The blob reporting configuration type."""
@ -87,29 +80,18 @@ class ReportingType(str, Enum):
return f'"{self.value}"'
class TextEmbeddingTarget(str, Enum):
"""The target to use for text embeddings."""
all = "all"
required = "required"
selected = "selected"
none = "none"
def __repr__(self):
"""Get a string representation."""
return f'"{self.value}"'
class ModelType(str, Enum):
"""LLMType enum class definition."""
# Embeddings
OpenAIEmbedding = "openai_embedding"
AzureOpenAIEmbedding = "azure_openai_embedding"
Embedding = "embedding"
# Chat Completion
OpenAIChat = "openai_chat"
AzureOpenAIChat = "azure_openai_chat"
Chat = "chat"
# Debug
MockChat = "mock_chat"
@ -165,6 +147,10 @@ class IndexingMethod(str, Enum):
"""Traditional GraphRAG indexing, with all graph construction and summarization performed by a language model."""
Fast = "fast"
"""Fast indexing, using NLP for graph construction and language model for summarization."""
StandardUpdate = "standard-update"
"""Incremental update with standard indexing."""
FastUpdate = "fast-update"
"""Incremental update with fast indexing."""
class NounPhraseExtractorType(str, Enum):
@ -176,3 +162,15 @@ class NounPhraseExtractorType(str, Enum):
"""Noun phrase extractor based on dependency parsing and NER using SpaCy."""
CFG = "cfg"
"""Noun phrase extractor combining CFG-based noun-chunk extraction and NER."""
class ModularityMetric(str, Enum):
"""Enum for the modularity metric to use."""
Graph = "graph"
"""Graph modularity metric."""
LCC = "lcc"
WeightedComponents = "weighted_components"
"""Weighted components modularity metric."""

View File

@ -33,15 +33,6 @@ class AzureApiVersionMissingError(ValueError):
super().__init__(msg)
class AzureDeploymentNameMissingError(ValueError):
"""Azure Deployment Name missing error."""
def __init__(self, llm_type: str) -> None:
"""Init method definition."""
msg = f"Deployment name is required for {llm_type}. Please rerun `graphrag init` set the deployment_name."
super().__init__(msg)
class LanguageModelConfigMissingError(ValueError):
"""Missing model configuration error."""

View File

@ -0,0 +1,38 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""A module containing get_embedding_settings."""
from graphrag.config.models.graph_rag_config import GraphRagConfig
def get_embedding_settings(
settings: GraphRagConfig,
vector_store_params: dict | None = None,
) -> dict:
"""Transform GraphRAG config into settings for workflows."""
embeddings_llm_settings = settings.get_language_model_config(
settings.embed_text.model_id
)
vector_store_settings = settings.get_vector_store_config(
settings.embed_text.vector_store_id
).model_dump()
#
# If we get to this point, settings.vector_store is defined, and there's a specific setting for this embedding.
# settings.vector_store.base contains connection information, or may be undefined
# settings.vector_store.<vector_name> contains the specific settings for this embedding
#
strategy = settings.embed_text.resolved_strategy(
embeddings_llm_settings
) # get the default strategy
strategy.update({
"vector_store": {
**(vector_store_params or {}),
**(vector_store_settings),
}
}) # update the default strategy with the vector store settings
# This ensures the vector store config is part of the strategy and not the global config
return {
"strategy": strategy,
}

View File

@ -19,48 +19,42 @@ INIT_YAML = f"""\
models:
{defs.DEFAULT_CHAT_MODEL_ID}:
type: {defs.DEFAULT_CHAT_MODEL_TYPE.value} # or azure_openai_chat
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
type: {defs.DEFAULT_CHAT_MODEL_TYPE.value}
model_provider: {defs.DEFAULT_MODEL_PROVIDER}
auth_type: {defs.DEFAULT_CHAT_MODEL_AUTH_TYPE.value} # or azure_managed_identity
api_key: ${{GRAPHRAG_API_KEY}} # set this in the generated .env file
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
api_key: ${{GRAPHRAG_API_KEY}} # set this in the generated .env file, or remove if managed identity
model: {defs.DEFAULT_CHAT_MODEL}
# deployment_name: <azure_model_deployment_name>
# encoding_model: {defs.ENCODING_MODEL} # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: {language_model_defaults.concurrent_requests} # max number of simultaneous LLM requests allowed
async_mode: {language_model_defaults.async_mode.value} # or asyncio
retry_strategy: native
max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response)
tokens_per_minute: 0 # set to 0 to disable rate limiting
requests_per_minute: 0 # set to 0 to disable rate limiting
{defs.DEFAULT_EMBEDDING_MODEL_ID}:
type: {defs.DEFAULT_EMBEDDING_MODEL_TYPE.value} # or azure_openai_embedding
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
auth_type: {defs.DEFAULT_EMBEDDING_MODEL_AUTH_TYPE.value} # or azure_managed_identity
api_key: ${{GRAPHRAG_API_KEY}}
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: {defs.DEFAULT_EMBEDDING_MODEL}
# deployment_name: <azure_model_deployment_name>
# encoding_model: {defs.ENCODING_MODEL} # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: {language_model_defaults.concurrent_requests} # max number of simultaneous LLM requests allowed
concurrent_requests: {language_model_defaults.concurrent_requests}
async_mode: {language_model_defaults.async_mode.value} # or asyncio
retry_strategy: native
max_retries: -1 # set to -1 for dynamic retry logic (most optimal setting based on server response)
tokens_per_minute: 0 # set to 0 to disable rate limiting
requests_per_minute: 0 # set to 0 to disable rate limiting
retry_strategy: {language_model_defaults.retry_strategy}
max_retries: {language_model_defaults.max_retries}
tokens_per_minute: null
requests_per_minute: null
{defs.DEFAULT_EMBEDDING_MODEL_ID}:
type: {defs.DEFAULT_EMBEDDING_MODEL_TYPE.value}
model_provider: {defs.DEFAULT_MODEL_PROVIDER}
auth_type: {defs.DEFAULT_EMBEDDING_MODEL_AUTH_TYPE.value}
api_key: ${{GRAPHRAG_API_KEY}}
model: {defs.DEFAULT_EMBEDDING_MODEL}
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-05-01-preview
concurrent_requests: {language_model_defaults.concurrent_requests}
async_mode: {language_model_defaults.async_mode.value} # or asyncio
retry_strategy: {language_model_defaults.retry_strategy}
max_retries: {language_model_defaults.max_retries}
tokens_per_minute: null
requests_per_minute: null
### Input settings ###
input:
type: {graphrag_config_defaults.input.type.value} # or blob
storage:
type: {graphrag_config_defaults.input.storage.type.value} # or blob
base_dir: "{graphrag_config_defaults.input.storage.base_dir}"
file_type: {graphrag_config_defaults.input.file_type.value} # [csv, text, json]
base_dir: "{graphrag_config_defaults.input.base_dir}"
chunks:
size: {graphrag_config_defaults.chunks.size}
@ -80,7 +74,7 @@ cache:
base_dir: "{graphrag_config_defaults.cache.base_dir}"
reporting:
type: {graphrag_config_defaults.reporting.type.value} # [file, blob, cosmosdb]
type: {graphrag_config_defaults.reporting.type.value} # [file, blob]
base_dir: "{graphrag_config_defaults.reporting.base_dir}"
vector_store:
@ -88,7 +82,6 @@ vector_store:
type: {vector_store_defaults.type}
db_uri: {vector_store_defaults.db_uri}
container_name: {vector_store_defaults.container_name}
overwrite: {vector_store_defaults.overwrite}
### Workflow settings ###
@ -110,6 +103,7 @@ summarize_descriptions:
extract_graph_nlp:
text_analyzer:
extractor_type: {graphrag_config_defaults.extract_graph_nlp.text_analyzer.extractor_type.value} # [regex_english, syntactic_parser, cfg]
async_mode: {graphrag_config_defaults.extract_graph_nlp.async_mode.value} # or asyncio
cluster_graph:
max_cluster_size: {graphrag_config_defaults.cluster_graph.max_cluster_size}

View File

@ -1,61 +0,0 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""Logging utilities. A unified way for enabling logging."""
import logging
from pathlib import Path
from graphrag.config.enums import ReportingType
from graphrag.config.models.graph_rag_config import GraphRagConfig
def enable_logging(log_filepath: str | Path, verbose: bool = False) -> None:
"""Enable logging to a file.
Parameters
----------
log_filepath : str | Path
The path to the log file.
verbose : bool, default=False
Whether to log debug messages.
"""
log_filepath = Path(log_filepath)
log_filepath.parent.mkdir(parents=True, exist_ok=True)
log_filepath.touch(exist_ok=True)
logging.basicConfig(
filename=log_filepath,
filemode="a",
format="%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s",
datefmt="%H:%M:%S",
level=logging.DEBUG if verbose else logging.INFO,
)
def enable_logging_with_config(
config: GraphRagConfig, verbose: bool = False, filename: str = "indexing-engine.log"
) -> tuple[bool, str]:
"""Enable logging to a file based on the config.
Parameters
----------
config : GraphRagConfig
The configuration.
timestamp_value : str
The timestamp value representing the directory to place the log files.
verbose : bool, default=False
Whether to log debug messages.
Returns
-------
tuple[bool, str]
A tuple of a boolean indicating if logging was enabled and the path to the log file.
(False, "") if logging was not enabled.
(True, str) if logging was enabled.
"""
if config.reporting.type == ReportingType.file:
log_path = Path(config.reporting.base_dir) / filename
enable_logging(log_path, verbose)
return (True, str(log_path))
return (False, "")

View File

@ -12,7 +12,7 @@ from graphrag.config.enums import CacheType
class CacheConfig(BaseModel):
"""The default configuration section for Cache."""
type: CacheType = Field(
type: CacheType | str = Field(
description="The cache type to use.",
default=graphrag_config_defaults.cache.type,
)

View File

@ -6,7 +6,7 @@
from pydantic import BaseModel, Field
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import NounPhraseExtractorType
from graphrag.config.enums import AsyncType, NounPhraseExtractorType
class TextAnalyzerConfig(BaseModel):
@ -68,3 +68,7 @@ class ExtractGraphNLPConfig(BaseModel):
description="The number of threads to use for the extraction process.",
default=graphrag_config_defaults.extract_graph_nlp.concurrent_requests,
)
async_mode: AsyncType = Field(
description="The async mode to use.",
default=graphrag_config_defaults.extract_graph_nlp.async_mode,
)

View File

@ -11,6 +11,7 @@ from pydantic import BaseModel, Field, model_validator
import graphrag.config.defaults as defs
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import VectorStoreType
from graphrag.config.errors import LanguageModelConfigMissingError
from graphrag.config.models.basic_search_config import BasicSearchConfig
from graphrag.config.models.cache_config import CacheConfig
@ -26,17 +27,22 @@ from graphrag.config.models.global_search_config import GlobalSearchConfig
from graphrag.config.models.input_config import InputConfig
from graphrag.config.models.language_model_config import LanguageModelConfig
from graphrag.config.models.local_search_config import LocalSearchConfig
from graphrag.config.models.output_config import OutputConfig
from graphrag.config.models.prune_graph_config import PruneGraphConfig
from graphrag.config.models.reporting_config import ReportingConfig
from graphrag.config.models.snapshots_config import SnapshotsConfig
from graphrag.config.models.storage_config import StorageConfig
from graphrag.config.models.summarize_descriptions_config import (
SummarizeDescriptionsConfig,
)
from graphrag.config.models.text_embedding_config import TextEmbeddingConfig
from graphrag.config.models.umap_config import UmapConfig
from graphrag.config.models.vector_store_config import VectorStoreConfig
from graphrag.vector_stores.factory import VectorStoreType
from graphrag.language_model.providers.litellm.services.rate_limiter.rate_limiter_factory import (
RateLimiterFactory,
)
from graphrag.language_model.providers.litellm.services.retry.retry_factory import (
RetryFactory,
)
class GraphRagConfig(BaseModel):
@ -89,6 +95,47 @@ class GraphRagConfig(BaseModel):
if defs.DEFAULT_EMBEDDING_MODEL_ID not in self.models:
raise LanguageModelConfigMissingError(defs.DEFAULT_EMBEDDING_MODEL_ID)
def _validate_retry_services(self) -> None:
"""Validate the retry services configuration."""
retry_factory = RetryFactory()
for model_id, model in self.models.items():
if model.retry_strategy != "none":
if model.retry_strategy not in retry_factory:
msg = f"Retry strategy '{model.retry_strategy}' for model '{model_id}' is not registered. Available strategies: {', '.join(retry_factory.keys())}"
raise ValueError(msg)
_ = retry_factory.create(
strategy=model.retry_strategy,
max_retries=model.max_retries,
max_retry_wait=model.max_retry_wait,
)
def _validate_rate_limiter_services(self) -> None:
"""Validate the rate limiter services configuration."""
rate_limiter_factory = RateLimiterFactory()
for model_id, model in self.models.items():
if model.rate_limit_strategy is not None:
if model.rate_limit_strategy not in rate_limiter_factory:
msg = f"Rate Limiter strategy '{model.rate_limit_strategy}' for model '{model_id}' is not registered. Available strategies: {', '.join(rate_limiter_factory.keys())}"
raise ValueError(msg)
rpm = (
model.requests_per_minute
if type(model.requests_per_minute) is int
else None
)
tpm = (
model.tokens_per_minute
if type(model.tokens_per_minute) is int
else None
)
if rpm is not None or tpm is not None:
_ = rate_limiter_factory.create(
strategy=model.rate_limit_strategy, rpm=rpm, tpm=tpm
)
input: InputConfig = Field(
description="The input configuration.", default=InputConfig()
)
@ -102,21 +149,31 @@ class GraphRagConfig(BaseModel):
else:
self.input.file_pattern = f".*\\.{self.input.file_type.value}$"
def _validate_input_base_dir(self) -> None:
"""Validate the input base directory."""
if self.input.storage.type == defs.StorageType.file:
if self.input.storage.base_dir.strip() == "":
msg = "input storage base directory is required for file input storage. Please rerun `graphrag init` and set the input storage configuration."
raise ValueError(msg)
self.input.storage.base_dir = str(
(Path(self.root_dir) / self.input.storage.base_dir).resolve()
)
chunks: ChunkingConfig = Field(
description="The chunking configuration to use.",
default=ChunkingConfig(),
)
"""The chunking configuration to use."""
output: OutputConfig = Field(
output: StorageConfig = Field(
description="The output configuration.",
default=OutputConfig(),
default=StorageConfig(),
)
"""The output configuration."""
def _validate_output_base_dir(self) -> None:
"""Validate the output base directory."""
if self.output.type == defs.OutputType.file:
if self.output.type == defs.StorageType.file:
if self.output.base_dir.strip() == "":
msg = "output base directory is required for file output. Please rerun `graphrag init` and set the output configuration."
raise ValueError(msg)
@ -124,7 +181,7 @@ class GraphRagConfig(BaseModel):
(Path(self.root_dir) / self.output.base_dir).resolve()
)
outputs: dict[str, OutputConfig] | None = Field(
outputs: dict[str, StorageConfig] | None = Field(
description="A list of output configurations used for multi-index query.",
default=graphrag_config_defaults.outputs,
)
@ -133,7 +190,7 @@ class GraphRagConfig(BaseModel):
"""Validate the outputs dict base directories."""
if self.outputs:
for output in self.outputs.values():
if output.type == defs.OutputType.file:
if output.type == defs.StorageType.file:
if output.base_dir.strip() == "":
msg = "Output base directory is required for file output. Please rerun `graphrag init` and set the output configuration."
raise ValueError(msg)
@ -141,10 +198,9 @@ class GraphRagConfig(BaseModel):
(Path(self.root_dir) / output.base_dir).resolve()
)
update_index_output: OutputConfig = Field(
update_index_output: StorageConfig = Field(
description="The output configuration for the updated index.",
default=OutputConfig(
type=graphrag_config_defaults.update_index_output.type,
default=StorageConfig(
base_dir=graphrag_config_defaults.update_index_output.base_dir,
),
)
@ -152,7 +208,7 @@ class GraphRagConfig(BaseModel):
def _validate_update_index_output_base_dir(self) -> None:
"""Validate the update index output base directory."""
if self.update_index_output.type == defs.OutputType.file:
if self.update_index_output.type == defs.StorageType.file:
if self.update_index_output.base_dir.strip() == "":
msg = "update_index_output base directory is required for file output. Please rerun `graphrag init` and set the update_index_output configuration."
raise ValueError(msg)
@ -291,6 +347,11 @@ class GraphRagConfig(BaseModel):
raise ValueError(msg)
store.db_uri = str((Path(self.root_dir) / store.db_uri).resolve())
def _validate_factories(self) -> None:
"""Validate the factories used in the configuration."""
self._validate_retry_services()
self._validate_rate_limiter_services()
def get_language_model_config(self, model_id: str) -> LanguageModelConfig:
"""Get a model configuration by ID.
@ -345,9 +406,11 @@ class GraphRagConfig(BaseModel):
self._validate_root_dir()
self._validate_models()
self._validate_input_pattern()
self._validate_input_base_dir()
self._validate_reporting_base_dir()
self._validate_output_base_dir()
self._validate_multi_output_base_dirs()
self._validate_update_index_output_base_dir()
self._validate_vector_store_db_uri()
self._validate_factories()
return self

View File

@ -7,36 +7,23 @@ from pydantic import BaseModel, Field
import graphrag.config.defaults as defs
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import InputFileType, InputType
from graphrag.config.enums import InputFileType
from graphrag.config.models.storage_config import StorageConfig
class InputConfig(BaseModel):
"""The default configuration section for Input."""
type: InputType = Field(
description="The input type to use.",
default=graphrag_config_defaults.input.type,
storage: StorageConfig = Field(
description="The storage configuration to use for reading input documents.",
default=StorageConfig(
base_dir=graphrag_config_defaults.input.storage.base_dir,
),
)
file_type: InputFileType = Field(
description="The input file type to use.",
default=graphrag_config_defaults.input.file_type,
)
base_dir: str = Field(
description="The input base directory to use.",
default=graphrag_config_defaults.input.base_dir,
)
connection_string: str | None = Field(
description="The azure blob storage connection string to use.",
default=graphrag_config_defaults.input.connection_string,
)
storage_account_blob_url: str | None = Field(
description="The storage account blob url to use.",
default=graphrag_config_defaults.input.storage_account_blob_url,
)
container_name: str | None = Field(
description="The azure blob storage container name to use.",
default=graphrag_config_defaults.input.container_name,
)
encoding: str = Field(
description="The input file encoding to use.",
default=defs.graphrag_config_defaults.input.encoding,

View File

@ -3,6 +3,9 @@
"""Language model configuration."""
import logging
from typing import Literal
import tiktoken
from pydantic import BaseModel, Field, model_validator
@ -12,11 +15,12 @@ from graphrag.config.errors import (
ApiKeyMissingError,
AzureApiBaseMissingError,
AzureApiVersionMissingError,
AzureDeploymentNameMissingError,
ConflictingSettingsError,
)
from graphrag.language_model.factory import ModelFactory
logger = logging.getLogger(__name__)
class LanguageModelConfig(BaseModel):
"""Language model configuration."""
@ -71,8 +75,11 @@ class LanguageModelConfig(BaseModel):
ConflictingSettingsError
If the Azure authentication type conflicts with the model being used.
"""
if self.auth_type == AuthType.AzureManagedIdentity and (
self.type == ModelType.OpenAIChat or self.type == ModelType.OpenAIEmbedding
if (
self.auth_type == AuthType.AzureManagedIdentity
and self.type != ModelType.AzureOpenAIChat
and self.type != ModelType.AzureOpenAIEmbedding
and self.model_provider != "azure" # indicates Litellm + AOI
):
msg = f"auth_type of azure_managed_identity is not supported for model type {self.type}. Please rerun `graphrag init` and set the auth_type to api_key."
raise ConflictingSettingsError(msg)
@ -91,6 +98,35 @@ class LanguageModelConfig(BaseModel):
if not ModelFactory.is_supported_model(self.type):
msg = f"Model type {self.type} is not recognized, must be one of {ModelFactory.get_chat_models() + ModelFactory.get_embedding_models()}."
raise KeyError(msg)
if self.type in [
"openai_chat",
"openai_embedding",
"azure_openai_chat",
"azure_openai_embedding",
]:
msg = f"Model config based on fnllm is deprecated and will be removed in GraphRAG v3, please use {ModelType.Chat} or {ModelType.Embedding} instead to switch to LiteLLM config."
logger.warning(msg)
model_provider: str | None = Field(
description="The model provider to use.",
default=language_model_defaults.model_provider,
)
def _validate_model_provider(self) -> None:
"""Validate the model provider.
Required when using Litellm.
Raises
------
KeyError
If the model provider is not recognized.
"""
if (self.type == ModelType.Chat or self.type == ModelType.Embedding) and (
self.model_provider is None or self.model_provider.strip() == ""
):
msg = f"Model provider must be specified when using type == {self.type}."
raise KeyError(msg)
model: str = Field(description="The LLM model to use.")
encoding_model: str = Field(
@ -101,12 +137,27 @@ class LanguageModelConfig(BaseModel):
def _validate_encoding_model(self) -> None:
"""Validate the encoding model.
The default behavior is to use an encoding model that matches the LLM model.
LiteLLM supports 100+ models and their tokenization. There is no need to
set the encoding model when using the new LiteLLM provider as was done with fnllm provider.
Users can still manually specify a tiktoken based encoding model to use even with the LiteLLM provider
in which case the specified encoding model will be used regardless of the LLM model being used, even if
it is not an openai based model.
If not using LiteLLM provider, set the encoding model based on the LLM model name.
This is for backward compatibility with existing fnllm provider until fnllm is removed.
Raises
------
KeyError
If the model name is not recognized.
"""
if self.encoding_model.strip() == "":
if (
self.type != ModelType.Chat
and self.type != ModelType.Embedding
and self.encoding_model.strip() == ""
):
self.encoding_model = tiktoken.encoding_name_for_model(self.model)
api_base: str | None = Field(
@ -127,6 +178,7 @@ class LanguageModelConfig(BaseModel):
if (
self.type == ModelType.AzureOpenAIChat
or self.type == ModelType.AzureOpenAIEmbedding
or self.model_provider == "azure" # indicates Litellm + AOI
) and (self.api_base is None or self.api_base.strip() == ""):
raise AzureApiBaseMissingError(self.type)
@ -148,6 +200,7 @@ class LanguageModelConfig(BaseModel):
if (
self.type == ModelType.AzureOpenAIChat
or self.type == ModelType.AzureOpenAIEmbedding
or self.model_provider == "azure" # indicates Litellm + AOI
) and (self.api_version is None or self.api_version.strip() == ""):
raise AzureApiVersionMissingError(self.type)
@ -169,8 +222,10 @@ class LanguageModelConfig(BaseModel):
if (
self.type == ModelType.AzureOpenAIChat
or self.type == ModelType.AzureOpenAIEmbedding
or self.model_provider == "azure" # indicates Litellm + AOI
) and (self.deployment_name is None or self.deployment_name.strip() == ""):
raise AzureDeploymentNameMissingError(self.type)
msg = f"deployment_name is not set for Azure-hosted model. This will default to your model name ({self.model}). If different, this should be set."
logger.debug(msg)
organization: str | None = Field(
description="The organization to use for the LLM service.",
@ -192,14 +247,63 @@ class LanguageModelConfig(BaseModel):
description="The request timeout to use.",
default=language_model_defaults.request_timeout,
)
tokens_per_minute: int = Field(
tokens_per_minute: int | Literal["auto"] | None = Field(
description="The number of tokens per minute to use for the LLM service.",
default=language_model_defaults.tokens_per_minute,
)
requests_per_minute: int = Field(
def _validate_tokens_per_minute(self) -> None:
"""Validate the tokens per minute.
Raises
------
ValueError
If the tokens per minute is less than 0.
"""
# If the value is a number, check if it is less than 1
if isinstance(self.tokens_per_minute, int) and self.tokens_per_minute < 1:
msg = f"Tokens per minute must be a non zero positive number, 'auto' or null. Suggested value: {language_model_defaults.tokens_per_minute}."
raise ValueError(msg)
if (
(self.type == ModelType.Chat or self.type == ModelType.Embedding)
and self.rate_limit_strategy is not None
and self.tokens_per_minute == "auto"
):
msg = f"tokens_per_minute cannot be set to 'auto' when using type '{self.type}'. Please set it to a positive integer or null to disable."
raise ValueError(msg)
requests_per_minute: int | Literal["auto"] | None = Field(
description="The number of requests per minute to use for the LLM service.",
default=language_model_defaults.requests_per_minute,
)
def _validate_requests_per_minute(self) -> None:
"""Validate the requests per minute.
Raises
------
ValueError
If the requests per minute is less than 0.
"""
# If the value is a number, check if it is less than 1
if isinstance(self.requests_per_minute, int) and self.requests_per_minute < 1:
msg = f"Requests per minute must be a non zero positive number, 'auto' or null. Suggested value: {language_model_defaults.requests_per_minute}."
raise ValueError(msg)
if (
(self.type == ModelType.Chat or self.type == ModelType.Embedding)
and self.rate_limit_strategy is not None
and self.requests_per_minute == "auto"
):
msg = f"requests_per_minute cannot be set to 'auto' when using type '{self.type}'. Please set it to a positive integer or null to disable."
raise ValueError(msg)
rate_limit_strategy: str | None = Field(
description="The rate limit strategy to use for the LLM service.",
default=language_model_defaults.rate_limit_strategy,
)
retry_strategy: str = Field(
description="The retry strategy to use for the LLM service.",
default=language_model_defaults.retry_strategy,
@ -208,6 +312,19 @@ class LanguageModelConfig(BaseModel):
description="The maximum number of retries to use for the LLM service.",
default=language_model_defaults.max_retries,
)
def _validate_max_retries(self) -> None:
"""Validate the maximum retries.
Raises
------
ValueError
If the maximum retries is less than 0.
"""
if self.max_retries < 1:
msg = f"Maximum retries must be greater than or equal to 1. Suggested value: {language_model_defaults.max_retries}."
raise ValueError(msg)
max_retry_wait: float = Field(
description="The maximum retry wait to use for the LLM service.",
default=language_model_defaults.max_retry_wait,
@ -275,8 +392,12 @@ class LanguageModelConfig(BaseModel):
@model_validator(mode="after")
def _validate_model(self):
self._validate_type()
self._validate_model_provider()
self._validate_auth_type()
self._validate_api_key()
self._validate_tokens_per_minute()
self._validate_requests_per_minute()
self._validate_max_retries()
self._validate_azure_settings()
self._validate_encoding_model()
return self

View File

@ -1,38 +0,0 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""Parameterization settings for the default configuration."""
from pydantic import BaseModel, Field
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import OutputType
class OutputConfig(BaseModel):
"""The default configuration section for Output."""
type: OutputType = Field(
description="The output type to use.",
default=graphrag_config_defaults.output.type,
)
base_dir: str = Field(
description="The base directory for the output.",
default=graphrag_config_defaults.output.base_dir,
)
connection_string: str | None = Field(
description="The storage connection string to use.",
default=graphrag_config_defaults.output.connection_string,
)
container_name: str | None = Field(
description="The storage container name to use.",
default=graphrag_config_defaults.output.container_name,
)
storage_account_blob_url: str | None = Field(
description="The storage account blob url to use.",
default=graphrag_config_defaults.output.storage_account_blob_url,
)
cosmosdb_account_url: str | None = Field(
description="The cosmosdb account url to use.",
default=graphrag_config_defaults.output.cosmosdb_account_url,
)

View File

@ -12,7 +12,7 @@ from graphrag.config.enums import ReportingType
class ReportingConfig(BaseModel):
"""The default configuration section for Reporting."""
type: ReportingType = Field(
type: ReportingType | str = Field(
description="The reporting type to use.",
default=graphrag_config_defaults.reporting.type,
)

View File

@ -0,0 +1,52 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""Parameterization settings for the default configuration."""
from pathlib import Path
from pydantic import BaseModel, Field, field_validator
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import StorageType
class StorageConfig(BaseModel):
"""The default configuration section for storage."""
type: StorageType | str = Field(
description="The storage type to use.",
default=graphrag_config_defaults.storage.type,
)
base_dir: str = Field(
description="The base directory for the output.",
default=graphrag_config_defaults.storage.base_dir,
)
# Validate the base dir for multiple OS (use Path)
# if not using a cloud storage type.
@field_validator("base_dir", mode="before")
@classmethod
def validate_base_dir(cls, value, info):
"""Ensure that base_dir is a valid filesystem path when using local storage."""
# info.data contains other field values, including 'type'
if info.data.get("type") != StorageType.file:
return value
return str(Path(value))
connection_string: str | None = Field(
description="The storage connection string to use.",
default=graphrag_config_defaults.storage.connection_string,
)
container_name: str | None = Field(
description="The storage container name to use.",
default=graphrag_config_defaults.storage.container_name,
)
storage_account_blob_url: str | None = Field(
description="The storage account blob url to use.",
default=graphrag_config_defaults.storage.storage_account_blob_url,
)
cosmosdb_account_url: str | None = Field(
description="The cosmosdb account url to use.",
default=graphrag_config_defaults.storage.cosmosdb_account_url,
)

View File

@ -39,7 +39,7 @@ class SummarizeDescriptionsConfig(BaseModel):
self, root_dir: str, model_config: LanguageModelConfig
) -> dict:
"""Get the resolved description summarization strategy."""
from graphrag.index.operations.summarize_descriptions import (
from graphrag.index.operations.summarize_descriptions.summarize_descriptions import (
SummarizeStrategyType,
)

View File

@ -6,7 +6,6 @@
from pydantic import BaseModel, Field
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.enums import TextEmbeddingTarget
from graphrag.config.models.language_model_config import LanguageModelConfig
@ -29,10 +28,6 @@ class TextEmbeddingConfig(BaseModel):
description="The batch max tokens to use.",
default=graphrag_config_defaults.embed_text.batch_max_tokens,
)
target: TextEmbeddingTarget = Field(
description="The target to use. 'all', 'required', 'selected', or 'none'.",
default=graphrag_config_defaults.embed_text.target,
)
names: list[str] = Field(
description="The specific embeddings to perform.",
default=graphrag_config_defaults.embed_text.names,
@ -44,7 +39,7 @@ class TextEmbeddingConfig(BaseModel):
def resolved_strategy(self, model_config: LanguageModelConfig) -> dict:
"""Get the resolved text embedding strategy."""
from graphrag.index.operations.embed_text import (
from graphrag.index.operations.embed_text.embed_text import (
TextEmbedStrategyType,
)

View File

@ -6,7 +6,9 @@
from pydantic import BaseModel, Field, model_validator
from graphrag.config.defaults import vector_store_defaults
from graphrag.vector_stores.factory import VectorStoreType
from graphrag.config.embeddings import all_embeddings
from graphrag.config.enums import VectorStoreType
from graphrag.config.models.vector_store_schema_config import VectorStoreSchemaConfig
class VectorStoreConfig(BaseModel):
@ -85,9 +87,25 @@ class VectorStoreConfig(BaseModel):
default=vector_store_defaults.overwrite,
)
embeddings_schema: dict[str, VectorStoreSchemaConfig] = {}
def _validate_embeddings_schema(self) -> None:
"""Validate the embeddings schema."""
for name in self.embeddings_schema:
if name not in all_embeddings:
msg = f"vector_store.embeddings_schema contains an invalid embedding schema name: {name}. Please update your settings.yaml and select the correct embedding schema names."
raise ValueError(msg)
if self.type == VectorStoreType.CosmosDB:
for id_field in self.embeddings_schema:
if id_field != "id":
msg = "When using CosmosDB, the id_field in embeddings_schema must be 'id'. Please update your settings.yaml and set the id_field to 'id'."
raise ValueError(msg)
@model_validator(mode="after")
def _validate_model(self):
"""Validate the model."""
self._validate_db_uri()
self._validate_url()
self._validate_embeddings_schema()
return self

View File

@ -0,0 +1,66 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""Parameterization settings for the default configuration."""
import re
from pydantic import BaseModel, Field, model_validator
DEFAULT_VECTOR_SIZE: int = 1536
VALID_IDENTIFIER_REGEX = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
def is_valid_field_name(field: str) -> bool:
"""Check if a field name is valid for CosmosDB."""
return bool(VALID_IDENTIFIER_REGEX.match(field))
class VectorStoreSchemaConfig(BaseModel):
"""The default configuration section for Vector Store Schema."""
id_field: str = Field(
description="The ID field to use.",
default="id",
)
vector_field: str = Field(
description="The vector field to use.",
default="vector",
)
text_field: str = Field(
description="The text field to use.",
default="text",
)
attributes_field: str = Field(
description="The attributes field to use.",
default="attributes",
)
vector_size: int = Field(
description="The vector size to use.",
default=DEFAULT_VECTOR_SIZE,
)
index_name: str | None = Field(description="The index name to use.", default=None)
def _validate_schema(self) -> None:
"""Validate the schema."""
for field in [
self.id_field,
self.vector_field,
self.text_field,
self.attributes_field,
]:
if not is_valid_field_name(field):
msg = f"Unsafe or invalid field name: {field}"
raise ValueError(msg)
@model_validator(mode="after")
def _validate_model(self):
"""Validate the model."""
self._validate_schema()
return self

View File

@ -9,17 +9,17 @@ from pathlib import Path
from dotenv import dotenv_values
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
def read_dotenv(root: str) -> None:
"""Read a .env file in the given root path."""
env_path = Path(root) / ".env"
if env_path.exists():
log.info("Loading pipeline .env file")
logger.info("Loading pipeline .env file")
env_config = dotenv_values(f"{env_path}")
for key, value in env_config.items():
if key not in os.environ:
os.environ[key] = value or ""
else:
log.info("No .env file found at %s", root)
logger.info("No .env file found at %s", root)

View File

@ -28,6 +28,9 @@ class Community(Named):
relationship_ids: list[str] | None = None
"""List of relationship IDs related to the community (optional)."""
text_unit_ids: list[str] | None = None
"""List of text unit IDs related to the community (optional)."""
covariate_ids: dict[str, list[str]] | None = None
"""Dictionary of different types of covariates related to the community (optional), e.g. claims"""
@ -50,6 +53,7 @@ class Community(Named):
level_key: str = "level",
entities_key: str = "entity_ids",
relationships_key: str = "relationship_ids",
text_units_key: str = "text_unit_ids",
covariates_key: str = "covariate_ids",
parent_key: str = "parent",
children_key: str = "children",
@ -67,6 +71,7 @@ class Community(Named):
short_id=d.get(short_id_key),
entity_ids=d.get(entities_key),
relationship_ids=d.get(relationships_key),
text_unit_ids=d.get(text_units_key),
covariate_ids=d.get(covariates_key),
attributes=d.get(attributes_key),
size=d.get(size_key),

View File

@ -0,0 +1,4 @@
# Copyright (c) 2025 Microsoft Corporation.
# Licensed under the MIT License
"""Factory module."""

View File

@ -0,0 +1,68 @@
# Copyright (c) 2025 Microsoft Corporation.
# Licensed under the MIT License
"""Factory ABC."""
from abc import ABC
from collections.abc import Callable
from typing import Any, ClassVar, Generic, TypeVar
T = TypeVar("T", covariant=True)
class Factory(ABC, Generic[T]):
"""Abstract base class for factories."""
_instance: ClassVar["Factory | None"] = None
def __new__(cls, *args: Any, **kwargs: Any) -> "Factory":
"""Create a new instance of Factory if it does not exist."""
if cls._instance is None:
cls._instance = super().__new__(cls, *args, **kwargs)
return cls._instance
def __init__(self):
if not hasattr(self, "_initialized"):
self._services: dict[str, Callable[..., T]] = {}
self._initialized = True
def __contains__(self, strategy: str) -> bool:
"""Check if a strategy is registered."""
return strategy in self._services
def keys(self) -> list[str]:
"""Get a list of registered strategy names."""
return list(self._services.keys())
def register(self, *, strategy: str, service_initializer: Callable[..., T]) -> None:
"""
Register a new service.
Args
----
strategy: The name of the strategy.
service_initializer: A callable that creates an instance of T.
"""
self._services[strategy] = service_initializer
def create(self, *, strategy: str, **kwargs: Any) -> T:
"""
Create a service instance based on the strategy.
Args
----
strategy: The name of the strategy.
**kwargs: Additional arguments to pass to the service initializer.
Returns
-------
An instance of T.
Raises
------
ValueError: If the strategy is not registered.
"""
if strategy not in self._services:
msg = f"Strategy '{strategy}' is not registered."
raise ValueError(msg)
return self._services[strategy](**kwargs)

View File

@ -10,19 +10,17 @@ import pandas as pd
from graphrag.config.models.input_config import InputConfig
from graphrag.index.input.util import load_files, process_data_columns
from graphrag.logger.base import ProgressLogger
from graphrag.storage.pipeline_storage import PipelineStorage
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
async def load_csv(
config: InputConfig,
progress: ProgressLogger | None,
storage: PipelineStorage,
) -> pd.DataFrame:
"""Load csv inputs from a directory."""
log.info("Loading csv files from %s", config.base_dir)
logger.info("Loading csv files from %s", config.storage.base_dir)
async def load_file(path: str, group: dict | None) -> pd.DataFrame:
if group is None:
@ -42,4 +40,4 @@ async def load_csv(
return data
return await load_files(load_file, config, storage, progress)
return await load_files(load_file, config, storage)

View File

@ -5,22 +5,18 @@
import logging
from collections.abc import Awaitable, Callable
from pathlib import Path
from typing import cast
import pandas as pd
from graphrag.config.enums import InputFileType, InputType
from graphrag.config.enums import InputFileType
from graphrag.config.models.input_config import InputConfig
from graphrag.index.input.csv import load_csv
from graphrag.index.input.json import load_json
from graphrag.index.input.text import load_text
from graphrag.logger.base import ProgressLogger
from graphrag.logger.null_progress import NullProgressLogger
from graphrag.storage.blob_pipeline_storage import BlobPipelineStorage
from graphrag.storage.file_pipeline_storage import FilePipelineStorage
from graphrag.storage.pipeline_storage import PipelineStorage
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
loaders: dict[str, Callable[..., Awaitable[pd.DataFrame]]] = {
InputFileType.text: load_text,
InputFileType.csv: load_csv,
@ -30,49 +26,15 @@ loaders: dict[str, Callable[..., Awaitable[pd.DataFrame]]] = {
async def create_input(
config: InputConfig,
progress_reporter: ProgressLogger | None = None,
root_dir: str | None = None,
storage: PipelineStorage,
) -> pd.DataFrame:
"""Instantiate input data for a pipeline."""
root_dir = root_dir or ""
log.info("loading input from root_dir=%s", config.base_dir)
progress_reporter = progress_reporter or NullProgressLogger()
match config.type:
case InputType.blob:
log.info("using blob storage input")
if config.container_name is None:
msg = "Container name required for blob storage"
raise ValueError(msg)
if (
config.connection_string is None
and config.storage_account_blob_url is None
):
msg = "Connection string or storage account blob url required for blob storage"
raise ValueError(msg)
storage = BlobPipelineStorage(
connection_string=config.connection_string,
storage_account_blob_url=config.storage_account_blob_url,
container_name=config.container_name,
path_prefix=config.base_dir,
)
case InputType.file:
log.info("using file storage for input")
storage = FilePipelineStorage(
root_dir=str(Path(root_dir) / (config.base_dir or ""))
)
case _:
log.info("using file storage for input")
storage = FilePipelineStorage(
root_dir=str(Path(root_dir) / (config.base_dir or ""))
)
logger.info("loading input from root_dir=%s", config.storage.base_dir)
if config.file_type in loaders:
progress = progress_reporter.child(
f"Loading Input ({config.file_type})", transient=False
)
logger.info("Loading Input %s", config.file_type)
loader = loaders[config.file_type]
result = await loader(config, progress, storage)
result = await loader(config, storage)
# Convert metadata columns to strings and collapse them into a JSON object
if config.metadata:
if all(col in result.columns for col in config.metadata):

View File

@ -10,19 +10,17 @@ import pandas as pd
from graphrag.config.models.input_config import InputConfig
from graphrag.index.input.util import load_files, process_data_columns
from graphrag.logger.base import ProgressLogger
from graphrag.storage.pipeline_storage import PipelineStorage
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
async def load_json(
config: InputConfig,
progress: ProgressLogger | None,
storage: PipelineStorage,
) -> pd.DataFrame:
"""Load json inputs from a directory."""
log.info("Loading json files from %s", config.base_dir)
logger.info("Loading json files from %s", config.storage.base_dir)
async def load_file(path: str, group: dict | None) -> pd.DataFrame:
if group is None:
@ -46,4 +44,4 @@ async def load_json(
return data
return await load_files(load_file, config, storage, progress)
return await load_files(load_file, config, storage)

View File

@ -11,15 +11,13 @@ import pandas as pd
from graphrag.config.models.input_config import InputConfig
from graphrag.index.input.util import load_files
from graphrag.index.utils.hashing import gen_sha512_hash
from graphrag.logger.base import ProgressLogger
from graphrag.storage.pipeline_storage import PipelineStorage
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
async def load_text(
config: InputConfig,
progress: ProgressLogger | None,
storage: PipelineStorage,
) -> pd.DataFrame:
"""Load text inputs from a directory."""
@ -34,4 +32,4 @@ async def load_text(
new_item["creation_date"] = await storage.get_creation_date(path)
return pd.DataFrame([new_item])
return await load_files(load_file, config, storage, progress)
return await load_files(load_file, config, storage)

View File

@ -11,29 +11,26 @@ import pandas as pd
from graphrag.config.models.input_config import InputConfig
from graphrag.index.utils.hashing import gen_sha512_hash
from graphrag.logger.base import ProgressLogger
from graphrag.storage.pipeline_storage import PipelineStorage
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
async def load_files(
loader: Any,
config: InputConfig,
storage: PipelineStorage,
progress: ProgressLogger | None,
) -> pd.DataFrame:
"""Load files from storage and apply a loader function."""
files = list(
storage.find(
re.compile(config.file_pattern),
progress=progress,
file_filter=config.file_filter,
)
)
if len(files) == 0:
msg = f"No {config.file_type} files found in {config.base_dir}"
msg = f"No {config.file_type} files found in {config.storage.base_dir}"
raise ValueError(msg)
files_loaded = []
@ -42,17 +39,17 @@ async def load_files(
try:
files_loaded.append(await loader(file, group))
except Exception as e: # noqa: BLE001 (catching Exception is fine here)
log.warning("Warning! Error loading file %s. Skipping...", file)
log.warning("Error: %s", e)
logger.warning("Warning! Error loading file %s. Skipping...", file)
logger.warning("Error: %s", e)
log.info(
logger.info(
"Found %d %s files, loading %d", len(files), config.file_type, len(files_loaded)
)
result = pd.concat(files_loaded)
total_files_log = (
f"Total number of unfiltered {config.file_type} rows: {len(result)}"
)
log.info(total_files_log)
logger.info(total_files_log)
return result
@ -66,7 +63,7 @@ def process_data_columns(
)
if config.text_column is not None and "text" not in documents.columns:
if config.text_column not in documents.columns:
log.warning(
logger.warning(
"text_column %s not found in csv file %s",
config.text_column,
path,
@ -75,7 +72,7 @@ def process_data_columns(
documents["text"] = documents.apply(lambda x: x[config.text_column], axis=1)
if config.title_column is not None:
if config.title_column not in documents.columns:
log.warning(
logger.warning(
"title_column %s not found in csv file %s",
config.title_column,
path,

View File

@ -15,6 +15,7 @@ from graphrag.index.operations.build_noun_graph.np_extractors.base import (
BaseNounPhraseExtractor,
)
from graphrag.index.utils.derive_from_rows import derive_from_rows
from graphrag.index.utils.graphs import calculate_pmi_edge_weights
from graphrag.index.utils.hashing import gen_sha512_hash
@ -23,12 +24,17 @@ async def build_noun_graph(
text_analyzer: BaseNounPhraseExtractor,
normalize_edge_weights: bool,
num_threads: int = 4,
async_mode: AsyncType = AsyncType.Threaded,
cache: PipelineCache | None = None,
) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Build a noun graph from text units."""
text_units = text_unit_df.loc[:, ["id", "text"]]
nodes_df = await _extract_nodes(
text_units, text_analyzer, num_threads=num_threads, cache=cache
text_units,
text_analyzer,
num_threads=num_threads,
async_mode=async_mode,
cache=cache,
)
edges_df = _extract_edges(nodes_df, normalize_edge_weights=normalize_edge_weights)
return (nodes_df, edges_df)
@ -38,6 +44,7 @@ async def _extract_nodes(
text_unit_df: pd.DataFrame,
text_analyzer: BaseNounPhraseExtractor,
num_threads: int = 4,
async_mode: AsyncType = AsyncType.Threaded,
cache: PipelineCache | None = None,
) -> pd.DataFrame:
"""
@ -63,7 +70,8 @@ async def _extract_nodes(
text_unit_df,
extract,
num_threads=num_threads,
async_type=AsyncType.Threaded,
async_type=async_mode,
progress_msg="extract noun phrases progress: ",
)
noun_node_df = text_unit_df.explode("noun_phrases")
@ -127,52 +135,6 @@ def _extract_edges(
]
if normalize_edge_weights:
# use PMI weight instead of raw weight
grouped_edge_df = _calculate_pmi_edge_weights(nodes_df, grouped_edge_df)
grouped_edge_df = calculate_pmi_edge_weights(nodes_df, grouped_edge_df)
return grouped_edge_df
def _calculate_pmi_edge_weights(
nodes_df: pd.DataFrame,
edges_df: pd.DataFrame,
node_name_col="title",
node_freq_col="frequency",
edge_weight_col="weight",
edge_source_col="source",
edge_target_col="target",
) -> pd.DataFrame:
"""
Calculate pointwise mutual information (PMI) edge weights.
pmi(x,y) = log2(p(x,y) / (p(x)p(y)))
p(x,y) = edge_weight(x,y) / total_edge_weights
p(x) = freq_occurrence(x) / total_freq_occurrences
"""
copied_nodes_df = nodes_df[[node_name_col, node_freq_col]]
total_edge_weights = edges_df[edge_weight_col].sum()
total_freq_occurrences = nodes_df[node_freq_col].sum()
copied_nodes_df["prop_occurrence"] = (
copied_nodes_df[node_freq_col] / total_freq_occurrences
)
copied_nodes_df = copied_nodes_df.loc[:, [node_name_col, "prop_occurrence"]]
edges_df["prop_weight"] = edges_df[edge_weight_col] / total_edge_weights
edges_df = (
edges_df.merge(
copied_nodes_df, left_on=edge_source_col, right_on=node_name_col, how="left"
)
.drop(columns=[node_name_col])
.rename(columns={"prop_occurrence": "source_prop"})
)
edges_df = (
edges_df.merge(
copied_nodes_df, left_on=edge_target_col, right_on=node_name_col, how="left"
)
.drop(columns=[node_name_col])
.rename(columns={"prop_occurrence": "target_prop"})
)
edges_df[edge_weight_col] = edges_df["prop_weight"] * np.log2(
edges_df["prop_weight"] / (edges_df["source_prop"] * edges_df["target_prop"])
)
return edges_df.drop(columns=["prop_weight", "source_prop", "target_prop"])

View File

@ -8,7 +8,7 @@ from abc import ABCMeta, abstractmethod
import spacy
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
class BaseNounPhraseExtractor(metaclass=ABCMeta):
@ -54,7 +54,7 @@ class BaseNounPhraseExtractor(metaclass=ABCMeta):
return spacy.load(model_name, exclude=exclude)
except OSError:
msg = f"Model `{model_name}` not found. Attempting to download..."
log.info(msg)
logger.info(msg)
from spacy.cli.download import download
download(model_name)

View File

@ -11,7 +11,7 @@ import tiktoken
from graphrag.config.models.chunking_config import ChunkingConfig
from graphrag.index.operations.chunk_text.typing import TextChunk
from graphrag.index.text_splitting.text_splitting import (
Tokenizer,
TokenChunkerOptions,
split_multiple_texts_on_tokens,
)
from graphrag.logger.progress import ProgressTicker
@ -45,7 +45,7 @@ def run_tokens(
encode, decode = get_encoding_fn(encoding_name)
return split_multiple_texts_on_tokens(
input,
Tokenizer(
TokenChunkerOptions(
chunk_overlap=chunk_overlap,
tokens_per_chunk=tokens_per_chunk,
encode=encode,

View File

@ -6,13 +6,14 @@
import logging
import networkx as nx
from graspologic.partition import hierarchical_leiden
from graphrag.index.utils.stable_lcc import stable_largest_connected_component
Communities = list[tuple[int, int, int, list[str]]]
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
def cluster_graph(
@ -23,7 +24,7 @@ def cluster_graph(
) -> Communities:
"""Apply a hierarchical clustering algorithm to a graph."""
if len(graph.nodes) == 0:
log.warning("Graph has no nodes")
logger.warning("Graph has no nodes")
return []
node_id_to_community_map, parent_mapping = _compute_leiden_communities(
@ -60,9 +61,6 @@ def _compute_leiden_communities(
seed: int | None = None,
) -> tuple[dict[int, dict[str, int]], dict[int, int]]:
"""Return Leiden root communities and their hierarchy mapping."""
# NOTE: This import is done here to reduce the initial import time of the graphrag package
from graspologic.partition import hierarchical_leiden
if use_lcc:
graph = stable_largest_connected_component(graph)

View File

@ -2,10 +2,3 @@
# Licensed under the MIT License
"""The Indexing Engine text embed package root."""
from graphrag.index.operations.embed_text.embed_text import (
TextEmbedStrategyType,
embed_text,
)
__all__ = ["TextEmbedStrategyType", "embed_text"]

View File

@ -12,12 +12,13 @@ import pandas as pd
from graphrag.cache.pipeline_cache import PipelineCache
from graphrag.callbacks.workflow_callbacks import WorkflowCallbacks
from graphrag.config.embeddings import create_collection_name
from graphrag.config.embeddings import create_index_name
from graphrag.config.models.vector_store_schema_config import VectorStoreSchemaConfig
from graphrag.index.operations.embed_text.strategies.typing import TextEmbeddingStrategy
from graphrag.vector_stores.base import BaseVectorStore, VectorStoreDocument
from graphrag.vector_stores.factory import VectorStoreFactory
log = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
# Per Azure OpenAI Limits
# https://learn.microsoft.com/en-us/azure/ai-services/openai/reference
@ -49,9 +50,9 @@ async def embed_text(
vector_store_config = strategy.get("vector_store")
if vector_store_config:
collection_name = _get_collection_name(vector_store_config, embedding_name)
index_name = _get_index_name(vector_store_config, embedding_name)
vector_store: BaseVectorStore = _create_vector_store(
vector_store_config, collection_name
vector_store_config, index_name, embedding_name
)
vector_store_workflow_config = vector_store_config.get(
embedding_name, vector_store_config
@ -109,10 +110,6 @@ async def _text_embed_with_vector_store(
strategy_exec = load_strategy(strategy_type)
strategy_config = {**strategy}
# if max_retries is not set, inject a dynamically assigned value based on the total number of expected LLM calls to be made
if strategy_config.get("llm") and strategy_config["llm"]["max_retries"] == -1:
strategy_config["llm"]["max_retries"] = len(input)
# Get vector-storage configuration
insert_batch_size: int = (
vector_store_config.get("batch_size") or DEFAULT_EMBEDDING_BATCH_SIZE
@ -145,7 +142,14 @@ async def _text_embed_with_vector_store(
all_results = []
num_total_batches = (input.shape[0] + insert_batch_size - 1) // insert_batch_size
while insert_batch_size * i < input.shape[0]:
logger.info(
"uploading text embeddings batch %d/%d of size %d to vector store",
i + 1,
num_total_batches,
insert_batch_size,
)
batch = input.iloc[insert_batch_size * i : insert_batch_size * (i + 1)]
texts: list[str] = batch[embed_column].to_numpy().tolist()
titles: list[str] = batch[title].to_numpy().tolist()
@ -180,27 +184,46 @@ async def _text_embed_with_vector_store(
def _create_vector_store(
vector_store_config: dict, collection_name: str
vector_store_config: dict, index_name: str, embedding_name: str | None = None
) -> BaseVectorStore:
vector_store_type: str = str(vector_store_config.get("type"))
if collection_name:
vector_store_config.update({"collection_name": collection_name})
embeddings_schema: dict[str, VectorStoreSchemaConfig] = vector_store_config.get(
"embeddings_schema", {}
)
single_embedding_config: VectorStoreSchemaConfig = VectorStoreSchemaConfig()
if (
embeddings_schema is not None
and embedding_name is not None
and embedding_name in embeddings_schema
):
raw_config = embeddings_schema[embedding_name]
if isinstance(raw_config, dict):
single_embedding_config = VectorStoreSchemaConfig(**raw_config)
else:
single_embedding_config = raw_config
if single_embedding_config.index_name is None:
single_embedding_config.index_name = index_name
vector_store = VectorStoreFactory().create_vector_store(
vector_store_type, kwargs=vector_store_config
vector_store_schema_config=single_embedding_config,
vector_store_type=vector_store_type,
**vector_store_config,
)
vector_store.connect(**vector_store_config)
return vector_store
def _get_collection_name(vector_store_config: dict, embedding_name: str) -> str:
def _get_index_name(vector_store_config: dict, embedding_name: str) -> str:
container_name = vector_store_config.get("container_name", "default")
collection_name = create_collection_name(container_name, embedding_name)
index_name = create_index_name(container_name, embedding_name)
msg = f"using vector store {vector_store_config.get('type')} with container_name {container_name} for embedding {embedding_name}: {collection_name}"
log.info(msg)
return collection_name
msg = f"using vector store {vector_store_config.get('type')} with container_name {container_name} for embedding {embedding_name}: {index_name}"
logger.info(msg)
return index_name
def load_strategy(strategy: TextEmbedStrategyType) -> TextEmbeddingStrategy:

View File

@ -21,7 +21,9 @@ async def run( # noqa RUF029 async is required for interface
) -> TextEmbeddingResult:
"""Run the Claim extraction chain."""
input = input if isinstance(input, Iterable) else [input]
ticker = progress_ticker(callbacks.progress, len(input))
ticker = progress_ticker(
callbacks.progress, len(input), description="generate embeddings progress: "
)
return TextEmbeddingResult(
embeddings=[_embed_text(cache, text, ticker) for text in input]
)

Some files were not shown because too many files have changed in this diff Show More