mirror of https://github.com/microsoft/graphrag.git synced 2026-01-14 00:57:23 +08:00

Add Cosmos DB storage/cache option (#1431 )

* added cosmosdb constructor and database methods

* added rest of abstract method headers

* added cosmos db container methods

* implemented has and delete methods

* finished implementing abstract class methods

* integrated class into storage factory

* integrated cosmosdb class into cache factory

* added support for new config file fields

* replaced primary key cosmosdb initialization with connection strings

* modified cosmosdb setter to require json

* Fix non-default emitters

* Format

* Ruff

* ruff

* first successful run of cosmosdb indexing

* removed extraneous container_name setting

* require base_dir to be typed as str

* reverted merged changed from closed branch

* removed nested try statement

* readded initial non-parquet emitter fix

* added basic support for parquet emitter using internal conversions

* merged with main and resolved conflicts

* fixed more merge conflicts

* added cosmosdb functionality to query pipeline

* tested query for cosmosdb

* collapsed cosmosdb schema to use minimal containers and databases

* simplified create_database and create_container functions

* ruff fixes and semversioner

* spellcheck and ci fixes

* updated pyproject toml and lock file

* apply fixes after merge from main

* add temporary comments

* refactor cache factory

* refactored storage factory

* minor formatting

* update dictionary

* fix spellcheck typo

* fix default value

* fix pydantic model defaults

* update pydantic models

* fix init_content

* cleanup how factory passes parameters to file storage

* remove unnecessary output file type

* update pydantic model

* cleanup code

* implemented clear method

* fix merge from main

* add test stub for cosmosdb

* regenerate lock file

* modified set method to collapse parquet rows

* modified get method to collapse parquet rows

* updated has and delete methods and docstrings to adhere to new schema

* added prefix helper function

* replaced delimiter for prefixed id

* verified empty tests are passing

* fix merges from main

* add find test

* update cicd step name

* tested querying for new schema

* resolved errors from merge conflicts

* refactored set method to handle cache in new schema

* refactored get method to handle cache in new schema

* force unique ids to be written to cosmos for nodes

* found bug with has and delete methods

* modified has and delete to work with cache in new schema

* fix the merge from main

* minor typo fixes

* update lock file

* spellcheck fix

* fix init function signature

* minor formatting updates

* remove https protocol

* change localhost to 127.0.0.1 address

* update pytest to use bacj engine

* verified cache tests

* improved speed of has function

* resolved pytest error with find function

* added test for child method

* make container_name variable private as _container_name

* minor variable name fix

* cleanup cosmos pytest and make the cosmosdb storage class operations more efficient

* update cicd to use different cosmosdb emulator

* test with http protocol

* added pytest for clear()

* add longer timeout for cosmosdb emulator startup

* revert http connection back to https

* add comments to cicd code for future dev usage

* set to container and database clients to none upon deletion

* ruff changes

* add comments to cicd code

* removed unneeded None statements and ruff fixes

* more ruff fixes

* Update test_run.py

* remove unnecessary call to delete container

* ruff format updates

* Reverted test_run.py

* fix ruff formatter errors

* cleanup variable names to be more consistent

* remove extra semversioner file

* revert pydantic model changes

* revert pydantic model change

* revert pydantic model change

* re-enable inline formatting rule

* update documentation in dev guide

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>

2024-12-19 13:43:21 -06:00

5.5 KiB

Raw Permalink Blame History

GraphRAG Development

Requirements

Name	Installation	Purpose
Python 3.10 or 3.11	Download	The library is Python-based.
Poetry	Instructions	Poetry is used for package management and virtualenv management in Python codebases

Getting Started

Install Dependencies

# install python dependencies
poetry install

Execute the indexing engine

poetry run poe index <...args>

Execute prompt tuning

poetry run poe prompt_tune <...args>

Execute Queries

poetry run poe query <...args>

Repository Structure

An overview of the repository's top-level folder structure is provided below, detailing the overall design and purpose. We leverage a factory design pattern where possible, enabling a variety of implementations for each core component of graphrag.

graphrag
├── api             # library API definitions
├── cache           # cache module supporting several options
│    └─ factory.py  #  └─ main entrypoint to create a cache
├── callbacks       # a collection of commonly used callback functions
├── cli             # library CLI
│    └─ main.py     #  └─ primary CLI entrypoint
├── config          # configuration management
├── index           # indexing engine
|    └─ run/run.py  #  main entrypoint to build an index
├── logger          # logger module supporting several options
│    └─ factory.py  #  └─ main entrypoint to create a logger
├── model           # data model definitions associated with the knowledge graph
├── prompt_tune     # prompt tuning module 
├── prompts         # a collection of all the system prompts used by graphrag
├── query           # query engine
├── storage         # storage module supporting several options
│    └─ factory.py  #  └─ main entrypoint to create/load a storage endpoint
├── utils           # helper functions used throughout the library
└── vector_stores   # vector store module containing a few options
     └─ factory.py  #  └─ main entrypoint to create a vector store

Where appropriate, the factories expose a registration method for users to provide their own custom implementations if desired.

Versioning

We use semversioner to automate and enforce semantic versioning in the release process. Our CI/CD pipeline checks that all PR's include a json file generated by semversioner. When submitting a PR, please run:

poetry run semversioner add-change -t patch -d "<a small sentence describing changes made>."

Azurite

Some unit and smoke tests use Azurite to emulate Azure resources. This can be started by running:

./scripts/start-azurite.sh

or by simply running azurite in the terminal if already installed globally. See the Azurite documentation for more information about how to install and use Azurite.

Lifecycle Scripts

Our Python package utilizes Poetry to manage dependencies and poethepoet to manage custom build scripts.

Available scripts are:

poetry run poe index - Run the Indexing CLI
poetry run poe query - Run the Query CLI
poetry build - This invokes poetry build, which will build a wheel file and other distributable artifacts.
poetry run poe test - This will execute all tests.
poetry run poe test_unit - This will execute unit tests.
poetry run poe test_integration - This will execute integration tests.
poetry run poe test_smoke - This will execute smoke tests.
poetry run poe check - This will perform a suite of static checks across the package, including:
- formatting
- documentation formatting
- linting
- security patterns
- type-checking
poetry run poe fix - This will apply any available auto-fixes to the package. Usually this is just formatting fixes.
poetry run poe fix_unsafe - This will apply any available auto-fixes to the package, including those that may be unsafe.
poetry run poe format - Explicitly run the formatter across the package.

Troubleshooting

"RuntimeError: llvm-config failed executing, please point LLVM_CONFIG to the path for llvm-config" when running poetry install

Make sure llvm-9 and llvm-9-dev are installed:

sudo apt-get install llvm-9 llvm-9-dev

and then in your bashrc, add

export LLVM_CONFIG=/usr/bin/llvm-config-9

"numba/_pymodule.h:6:10: fatal error: Python.h: No such file or directory" when running poetry install

Make sure you have python3.10-dev installed or more generally python<version>-dev

sudo apt-get install python3.10-dev

LLM call constantly exceeds TPM, RPM or time limits

GRAPHRAG_LLM_THREAD_COUNT and GRAPHRAG_EMBEDDING_THREAD_COUNT are both set to 50 by default. You can modify this values to reduce concurrency. Please refer to the Configuration Documents

5.5 KiB Raw Permalink Blame History