graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2026-01-14 09:07:20 +08:00

Author	SHA1	Message	Date
Nathan Evans	66c2cfb3ce	Support JSON input files (#1777 ) * Add csv loader tests * Add test loader tests * Add json input support * Remove temp path constraint * Reuse loader cose * Semver * Set file pattern automatically based on type, if empty * Remove pattern from smoke test config * Spelling --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-03-10 14:04:07 -07:00
Nathan Evans	bd06d8b4f0	Context property bag ("state") (#1774 ) * Add pipeline state property bag to run context * Move state creation out of context util * Move callbacks into PipelineRunContext * Semver * Rename state.json to context.json to avoid confusion with stats.json * Expand smoke test row count * Add util to create storage and cache	2025-02-28 09:31:48 -08:00
Nathan Evans	e40476153d	Speed up smoke tests (#1736 ) * Move verb tests to regular CI * Clean up env vars * Update smoke runtime expectations * Rework artifact assertions * Fix plural in name * remove redundant artifact len check * Remove redundant artifact len check * Adjust graph output expectations * Update community expectations * Include all workflow output * Adjust text unit expectations * Adjust assertions per dataset * Fix test config param name * Update nan allowed for optional model fields --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-02-25 13:24:35 -08:00
Nathan Evans	96219a2182	Register workflows (#1691 ) * Add workflow registration * Add ability to mutate config by workflows * Separate graph finalization * Separate graph pruning * Semver * Update tests * Update smoke tests * Fix iterrows on create_graph * Remove prune_graph from llm construction * Update test data * Remove prune_graph from smoke tests	2025-02-14 13:21:31 -08:00
Josh Bradley	f14cda2b6d	Improve default llm retry logic to be more optimized (#1701 )	2025-02-13 16:56:37 -05:00
Nathan Evans	c02ab0984a	Streamline workflows (#1674 ) * Remove create_final_nodes * Rename final entity output to "entities" * Remove duplicate code from graph extraction * Rename create_final_relationships output to "relationships" * Rename create_final_communities output to "communities" * Combine compute_communities and create_final_communities * Rename create_final_covariates output to "covariates" * Rename create_final_community_reports output to "community_reports" * Rename create_final_text_units output to "text_units" * Rename create_final_documents output to "documents" * Remove transient snapshots config * Move create_final_entities to finalize_entities operation * Move create_final_relationships flow to finalize_relationships operation * Reuse some community report functions * Collapse most of graph and text unit-based report generation * Unify schemas files * Move community reports extractor * Move NLP report prompt to prompts folder * Fix a few pandas warnings * Rename embeddings config to embed_text * Rename claim_extraction config to extract_claims * Remove nltk from standard graph extraction * Fix verb tests * Fix extract graph config naming * Fix moved file reference * Create v1-to-v2 migration notebook * Semver * Fix smoke test artifact count * Raise tpm/rpm on smoke tests * Update drift settings for smoke tests * Reuse project directory var in api notebook * Format * Format	2025-02-07 11:11:03 -08:00
Alonso Guevara	0805924a35	Fix/drift n depth (#1676 ) * Fix n_depth param * Semver * Change smoke tests params for drift * Reduce log printing for expected exceptions	2025-02-05 17:22:34 -06:00
Derek Worthen	94bd2bb816	Require explicit azure auth settings when using AOI. (#1665 ) * Require explicit azure auth settings when using AOI. - Must set LanguageModel.azure_auth_type to either "api_key" or "managed_identity" when using AOI. * Fix smoke tests * Use general auth_type property instead of azure_auth_type * Remove unused error type * Update validation * Update validation comment	2025-01-29 12:28:47 -08:00
Derek Worthen	eeee84e9d9	Add vector store id reference to embeddings config. (#1662 )	2025-01-28 10:46:41 -08:00
KennyZhang1	1bbce33f42	Multi-index querying for API layer (#1644 ) * added multi-global-query function header * ported over code for merging dataframes * added connection to global streaming api function * added function header for update context helper * implemented and incorperated update_context function * Updated to make sure 'parent' column in final_communities gets incremented for multi index. * first cut at multi_local_seach function * several minor changes and fixes * Updated multi index local search. * Cleaned up code. * fixed lambda function ruff errors * fixed more ruff errors * moved query api helpers to util file * moved index api helpers to util file * merged in code left out of conflict * changed GraphRagConfig object to support lists of vector stores * Updated with fixes for multi_local_search. * Minor updates. * Minor updates. * Updates for ruff check. * Minor updates. * removed redundant vector_store_configs arg * ruff formatting changes * semversioner * Minor fix. * spellcheck fixes * ruff * test fix for cicd errors * another test fix * added explicit typing for ci tests * added dict type check for vector_store during indexing * more ruff fixes * moved type check * Removed streaming. Added multi drift and basic searches. * Formatting changes. * Updates for pyright. * Update for ruff. * Ruff formatted. * first cut at fixing vector store typing errors * got multi local search working with new config * ruff and test fixes * added fix for embeddings type error * renamed multi index api functions * ruff * convert config model to dict[VectorStoreConfig] * modified tests to support new vector_store model * ruff fixes * changed some test setups to match new model * changed ci/cd settings files to match new structure * Fix stderror check * fixed bug in vector_store_config validation * ruff * add database_name field to vectorstoreconfig * removed print statements * small refactoring for PR comments * modified default config in test * modified vector store config unit test --------- Co-authored-by: dorbaker <dorbaker@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-27 17:26:38 -05:00
Alonso Guevara	6b33977360	Add smoke tests for drift (#1658 )	2025-01-24 12:31:37 -06:00
Derek Worthen	c644338bae	Refactor config (#1593 ) * Refactor config - Add new ModelConfig to represent LLM settings - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode - Add top level models config that is a list of available LLM ModelConfigs - Remove LLMConfig inheritance and delete LLMConfig - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config - Remove all fallbacks and hydration logic from create_graphrag_config - This removes the automatic env variable overrides - Support env variables within config files using Templating - This requires "$" to be escaped with extra "$" so ".\\.txt$" becomes ".\\.txt$$" - Update init content to initialize new config file with the ModelConfig structure * Use dict of ModelConfig instead of list * Add model validations and unit tests * Fix ruff checks * Add semversioner change * Fix unit tests * validate root_dir in pydantic model * Rename ModelConfig to LanguageModelConfig * Rename ModelConfigMissingError to LanguageModelConfigMissingError * Add validationg for unexpected API keys * Allow skipping pydantic validation for testing/mocking purposes. * Add default lm configs to verb tests * smoke test * remove config from flows to fix llm arg mapping * Fix embedding llm arg mapping * Remove timestamp from smoke test outputs * Remove unused "subworkflows" smoke test properties * Add models to smoke test configs * Update smoke test output path * Send logs to logs folder * Fix output path * Fix csv test file pattern * Update placeholder * Format * Instantiate default model configs * Fix unit tests for config defaults * Fix migration notebook * Remove create_pipeline_config * Remove several unused config models * Remove indexing embedding and input configs * Move embeddings function to config * Remove skip_workflows * Remove skip embeddings in favor of explicit naming * fix unit test spelling mistake * self.models[model_id] is already a language model. Remove redundant casting. * update validation errors to instruct users to rerun graphrag init * instantiate LanguageModelConfigs with validation * skip validation in unit tests * update verb tests to use default model settings instead of skipping validation * test using llm settings * cleanup verb tests * remove unsafe default model config * remove the ability to skip pydantic validation * remove None union types when default values are set * move vector_store from embeddings to top level of config and delete resolve_paths * update vector store settings * fix vector store and smoke tests * fix serializing vector_store settings * fix vector_store usage * fix vector_store type * support cli overrides for loading graphrag config * rename storage to output * Add --force flag to init * Remove run_id and resume, fix Drift config assignment * Ruff --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-21 17:52:06 -06:00
Nathan Evans	47adfe16f0	Fix DRIFT search on Azure AI Search (#1645 ) * Add vector field to retrievable fields for Azure AI Search * Add DRIFT and Basic search to smoke tests * Semver * Format * Remove DRIFT smoke test for now (brittle)	2025-01-21 17:28:46 -06:00
Nathan Evans	c1c09bab80	Flow cleanup (#1510 ) * Move snapshots out of flows into verbs * Move degree compute out of extract_graph * Move entity/relationship df merging into extract * Move "title" to extraction source * Move text_unit_ids agg closer to extraction * Move data definition * Update test data * Semver * Update smoke tests * Fix empty degree field and update smoke tests and verb data * Move extractors (#1516) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> * Fix syntax typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 18:07:44 -08:00
Nathan Evans	1d68af308b	Community workflow (#1495 ) * Create separate communities workflow * Add test for new workflow * Rename workflows * Collapse subflows into parents * Rename flows, reuse variables * Semver * Fix integration test * Fix smoke tests * Fix megapipeline format * Rename missed files --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-11 15:41:16 -06:00
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Nathan Evans	634e3ed62a	Transient entity graph (#1349 ) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests	2024-11-04 17:23:29 -08:00
gaudyb	17658c5df8	New workflow to generate embeddings in a single workflow (#1296 ) * New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-11-01 15:01:35 -07:00
Josh Bradley	0cc79b9cf7	Add backwards compatibility patch for vector store (#1334 )	2024-10-29 14:54:08 -04:00
Josh Bradley	d6e6f5c077	Convert CLI to Typer app (#1305 )	2024-10-24 14:22:32 -04:00
Nathan Evans	94f1e62e5c	Rework workflow architecture (#1311 ) * Rename pipeline_storage file * Add runtime storage option to context * Fix import * Switch to memory storage for runtime * Infra for workflow runtime storage * Migrate base_text_units to runtime storage * Fix comment * Semver * Remove whitespace * Remove subflow smoke tests and ignore transient artifacts * Remove entity graph from transient list (not yet implemented) * Increase smoke runtime allotment for create_base_entity_graph * Revert format fix * Remove noqa	2024-10-24 10:20:03 -07:00
Alonso Guevara	8a6d4e66fe	DRIFT Search (#1285 ) * drift search * args for drift global query in local search * accept drift context in search base * optionally parse embeddings from df when creating CommunityReport * abstract class for drift context * pathing for drift config * drift config * add defs for drift config * formatting * capture generated tokens in token count * semversion * Formatting and ruff * Some algorithmic refactors * Ruff * Format * Use asdict() * Address comments * Update smoke tests * Update smoke tests * Update smoke tests part 2 --------- Co-authored-by: Julian Whiting <j2whitin@gmail.com>	2024-10-21 17:22:11 -06:00
KennyZhang1	e0840a2dc4	Fix vector store logic and refactor audience parameter (#1259 )	2024-10-21 16:56:56 -04:00
Nathan Evans	ce5b1207e0	Collapse graph documents workflows (#1284 ) * Copy base documents logic into final documents * Delete create_base_documents * Combine graph creation under create_base_entity_graph * Delete collapsed workflows * Migrate most graph internals to nx.Graph * Fix None edge case * Semver * Remove comment typo * Fix smoke tests	2024-10-15 13:58:58 -06:00
Nathan Evans	9070ea5c3c	Collapse create base extracted entities (#1235 ) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests	2024-09-30 17:32:56 -07:00
Nathan Evans	5220bb7ecc	Collapse create base entity graph (#1233 ) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment	2024-09-30 15:39:42 -07:00
Nathan Evans	00d5e77568	Collapse create final community reports (#1227 ) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set	2024-09-30 10:46:07 -07:00
Nathan Evans	ce71bcf7fb	Collapse create final entities (#1220 ) * Collapse create_final_entities * Update smoke tests * Semver * Remove prints * Update embedding assertions	2024-09-25 17:35:44 -07:00
Nathan Evans	3217013019	Revisit create final text units (#1216 ) * Add embeddings to collapsed subflow * Semver * Fix smoke tests	2024-09-25 16:55:27 -07:00
Nathan Evans	73e709b686	Collapse create final covariates (#1215 ) * Add covariate test * Add detailed mock assertions * Collapse create_final_covariates * Delete unused doc_id field * Semver * Update smoke test * Remove unused subject/object type columns	2024-09-25 16:30:22 -07:00
Nathan Evans	f518c8b80b	Collapse relationship embeddings (#1199 ) * Merge text_embed into a single relationships subflow * Update smoke tests * Semver * Spelling	2024-09-24 15:03:26 -07:00
Nathan Evans	1755afbdec	Collapse create base text units (#1178 ) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Setup initial test and config * Collapse create_base_text_units * Semver * Spelling * Fix smoke tests * Addres PR comments --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-23 16:55:53 -07:00
Nathan Evans	fbc483e4e5	Collapse create base documents (#1176 ) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests	2024-09-23 13:24:06 -07:00
Nathan Evans	f8ab1b30dc	Collapse create_final_nodes (#1171 ) * Collapse create_final_nodes * Update smoke tests * Typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-20 13:48:56 -07:00
Nathan Evans	ae094bb144	Collapse create final relationships (#1158 ) * Collapse pre/post embedding workflows * Semver * Fix smoke tests --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-19 17:38:01 -06:00
Derek Worthen	3b09df6e07	Migrate towards using static output directories (#1113 ) * Migrate towards using static output directories - Fixes load_config eagering resolving directories. Directories are only resolved when the output directories are local. - Add support for `--output` and `--reporting` flags for index CLI. To achieve previous output structure `index --output run1/artifacts --reports run1/reports`. - Use static output directories when initializing a new project. - Maintains backward compatibility for those using timestamp outputs locally. * fix smoke tests * update query cli to work with static directories * remove eager path resolution from load_config. Support CLI overrides that can be resolved. * add docs and output logs/artifacts to same directory * use match statement * switch back to if statement --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-18 17:36:50 -06:00
Nathan Evans	aa5b426f1d	Collapse final communities workflow (#1150 ) * Collapse create_final_communities * Semver * Spellcheck * Clean up filtering * Add space in title * Format * Cleanup imports and format * Spruce up the tests * Update dictionary.txt * Spellcheck --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-17 17:04:42 -07:00
Nathan Evans	a473265580	Collapse verbs: create_final_text_units (#1143 ) * Load default config in verb tests * Load proper workflow config * Collapse text unit pre-embedding steps * Format * Update smoke tests * Semver * Format * Merge join* subflows into create_final_text_units * Remove join_text_units_to_covariate_ids * Format * Remove join_text_units_to_entity_ids * Remove join_text_units_to_relationship_ids * Clean up merges and aggregations * Remove unnecessary cast	2024-09-17 10:32:25 -07:00
Nathan Evans	d22c0e7836	Covariate collapse (#1142 ) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test * Update test parquets for covariates * Collapse covariate join * Rework subtasks for per-flow customization * Format * Semver * Fix smoke test	2024-09-16 12:35:45 -07:00
Nathan Evans	2de302ff0d	Verb merge nre1 (#1140 ) * Setup basic verb test runner * Replace join_text_units_to_entity_ids with subflow * Update comments * Replace join_text_units_to_relationship_ids subflow * Roll in final select * Reuse assertion util * Small fix + format * Format/typing * Semver * Format/typing * Semver * Revert format changes * Fix smoke test subworkflow count * Edit subworkflows for another smoke test	2024-09-16 12:10:29 -07:00
Alonso Guevara	0b7c5a6ae9	Add cast check on schema validation for community reports (#932 ) * Add support for both float and int on schema validation for community report generation * Cast instead of type check * Add mising file * Add prompt with ints to smoke tests * Fix unit tests * Fix unit tests	2024-08-14 16:40:47 -06:00
Alonso Guevara	c451aa0093	Update smoke tests (#861 ) * Run smoke tests on 4o * Shorten dulce for smoke tests * Update secrets for consistency	2024-08-08 13:07:44 -06:00
Alonso Guevara	81b81cf60b	Initial Release	2024-07-01 15:25:30 -06:00

43 Commits