graphrag/graphrag/config/models/text_embedding_config.py
Copilot 2030f94eb4
Some checks failed
gh-pages / build (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled
Spellcheck / spellcheck (push) Has been cancelled
Refactor CacheFactory, StorageFactory, and VectorStoreFactory to use consistent registration patterns and add custom vector store documentation (#2006)
* Initial plan

* Refactor VectorStoreFactory to use registration functionality like StorageFactory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues in VectorStoreFactory refactoring

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove backward compatibility support from VectorStoreFactory and StorageFactory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Run ruff check --fix and ruff format, add semversioner file

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

* Fix pytest errors in storage factory tests by updating PipelineStorage interface implementation

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff formatting fixes

* update storage factory design

* Refactor CacheFactory to use registration functionality like StorageFactory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* revert copilot changes

* fix copilot changes

* update comments

* Fix failing pytest compatibility for factory tests

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update class instantiation issue

* ruff fixes

* fix pytest

* add default value

* ruff formatting changes

* ruff fixes

* revert minor changes

* cleanup cache factory

* Update CacheFactory tests to match consistent factory pattern

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update pytest thresholds

* adjust threshold levels

* Add custom vector store implementation notebook

Create comprehensive notebook demonstrating how to implement and register custom vector stores with GraphRAG as a plug-and-play framework. Includes:

- Complete implementation of SimpleInMemoryVectorStore
- Registration with VectorStoreFactory
- Testing and validation examples
- Configuration examples for GraphRAG settings
- Advanced features and best practices
- Production considerations checklist

The notebook provides a complete walkthrough for developers to understand and implement their own vector store backends.

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove sample notebook for now

* update tests

* fix cache pytests

* add pandas-stub to dev dependencies

* disable warning check for well known key

* skip tests when running on ubuntu

* add documentation for custom vector store implementations

* ignore ruff findings in notebooks

* fix merge breakages

* speedup CLI import statements

* remove unnecessary import statements in init file

* Add str type option on storage/cache type

* Fix store name

* Add LoggerFactory

* Fix up logging setup across CLI/API

* Add LoggerFactory test

* Fix err message

* Semver

* Remove enums from factory methods

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
2025-08-28 13:53:07 -07:00

53 lines
1.9 KiB
Python

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
"""Parameterization settings for the default configuration."""
from pydantic import BaseModel, Field
from graphrag.config.defaults import graphrag_config_defaults
from graphrag.config.models.language_model_config import LanguageModelConfig
class TextEmbeddingConfig(BaseModel):
"""Configuration section for text embeddings."""
model_id: str = Field(
description="The model ID to use for text embeddings.",
default=graphrag_config_defaults.embed_text.model_id,
)
vector_store_id: str = Field(
description="The vector store ID to use for text embeddings.",
default=graphrag_config_defaults.embed_text.vector_store_id,
)
batch_size: int = Field(
description="The batch size to use.",
default=graphrag_config_defaults.embed_text.batch_size,
)
batch_max_tokens: int = Field(
description="The batch max tokens to use.",
default=graphrag_config_defaults.embed_text.batch_max_tokens,
)
names: list[str] = Field(
description="The specific embeddings to perform.",
default=graphrag_config_defaults.embed_text.names,
)
strategy: dict | None = Field(
description="The override strategy to use.",
default=graphrag_config_defaults.embed_text.strategy,
)
def resolved_strategy(self, model_config: LanguageModelConfig) -> dict:
"""Get the resolved text embedding strategy."""
from graphrag.index.operations.embed_text.embed_text import (
TextEmbedStrategyType,
)
return self.strategy or {
"type": TextEmbedStrategyType.openai,
"llm": model_config.model_dump(),
"num_threads": model_config.concurrent_requests,
"batch_size": self.batch_size,
"batch_max_tokens": self.batch_max_tokens,
}