graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2026-01-14 09:07:20 +08:00

Author	SHA1	Message	Date
Nathan Evans	a35cb12741	Remove datashaper strip code (#1581 ) Remove datashaper	2025-01-03 13:59:26 -08:00
dependabot[bot]	58f646a019	Bump ruff from 0.8.4 to 0.8.5 (#1579 ) * Bump ruff from 0.8.4 to 0.8.5 Bumps [ruff](https://github.com/astral-sh/ruff) from 0.8.4 to 0.8.5. - [Release notes](https://github.com/astral-sh/ruff/releases) - [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md) - [Commits](https://github.com/astral-sh/ruff/compare/0.8.4...0.8.5) --- updated-dependencies: - dependency-name: ruff dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Fix ruff * Semver * Another ruff --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-02 17:45:52 -06:00
Derek Worthen	80367be018	Remove config input models (#1570 ) * Remove config input models * remove unit tests related to config input models * add semversioner change * Merge branch 'main' into config-remove-input-models	2025-01-02 15:25:10 -08:00
gaudyb	185f513ca7	Basic search implementation (#1563 ) * basic search implementation * basic streaming functionality * format check * check fix * release change * Chore/gleanings any encoding (#1569) * Make claims and entities independent of encoding * Semver * Change semver release type --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-01-02 13:49:11 -06:00
Nathan Evans	a2647da473	Simplify flow config (#1554 ) * Flatten compute_communities config * Remove cluster strategy type * Flatten create_base_text_units config * Move cluster seed to config default, leave as None in functions * Remove "prechunked" logic * Remove hard-coded encoding model * Remove unused variables * Strongly type embed_config * Simplify layout_graph config * Semver * Fix integration test * Fix config unit tests: ignore new config defaults * Remove pipeline integ test	2024-12-27 16:38:36 -08:00
KennyZhang1	8368b12532	Add Cosmos DB storage/cache option (#1431 ) * added cosmosdb constructor and database methods * added rest of abstract method headers * added cosmos db container methods * implemented has and delete methods * finished implementing abstract class methods * integrated class into storage factory * integrated cosmosdb class into cache factory * added support for new config file fields * replaced primary key cosmosdb initialization with connection strings * modified cosmosdb setter to require json * Fix non-default emitters * Format * Ruff * ruff * first successful run of cosmosdb indexing * removed extraneous container_name setting * require base_dir to be typed as str * reverted merged changed from closed branch * removed nested try statement * readded initial non-parquet emitter fix * added basic support for parquet emitter using internal conversions * merged with main and resolved conflicts * fixed more merge conflicts * added cosmosdb functionality to query pipeline * tested query for cosmosdb * collapsed cosmosdb schema to use minimal containers and databases * simplified create_database and create_container functions * ruff fixes and semversioner * spellcheck and ci fixes * updated pyproject toml and lock file * apply fixes after merge from main * add temporary comments * refactor cache factory * refactored storage factory * minor formatting * update dictionary * fix spellcheck typo * fix default value * fix pydantic model defaults * update pydantic models * fix init_content * cleanup how factory passes parameters to file storage * remove unnecessary output file type * update pydantic model * cleanup code * implemented clear method * fix merge from main * add test stub for cosmosdb * regenerate lock file * modified set method to collapse parquet rows * modified get method to collapse parquet rows * updated has and delete methods and docstrings to adhere to new schema * added prefix helper function * replaced delimiter for prefixed id * verified empty tests are passing * fix merges from main * add find test * update cicd step name * tested querying for new schema * resolved errors from merge conflicts * refactored set method to handle cache in new schema * refactored get method to handle cache in new schema * force unique ids to be written to cosmos for nodes * found bug with has and delete methods * modified has and delete to work with cache in new schema * fix the merge from main * minor typo fixes * update lock file * spellcheck fix * fix init function signature * minor formatting updates * remove https protocol * change localhost to 127.0.0.1 address * update pytest to use bacj engine * verified cache tests * improved speed of has function * resolved pytest error with find function * added test for child method * make container_name variable private as _container_name * minor variable name fix * cleanup cosmos pytest and make the cosmosdb storage class operations more efficient * update cicd to use different cosmosdb emulator * test with http protocol * added pytest for clear() * add longer timeout for cosmosdb emulator startup * revert http connection back to https * add comments to cicd code for future dev usage * set to container and database clients to none upon deletion * ruff changes * add comments to cicd code * removed unneeded None statements and ruff fixes * more ruff fixes * Update test_run.py * remove unnecessary call to delete container * ruff format updates * Reverted test_run.py * fix ruff formatter errors * cleanup variable names to be more consistent * remove extra semversioner file * revert pydantic model changes * revert pydantic model change * revert pydantic model change * re-enable inline formatting rule * update documentation in dev guide --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-19 13:43:21 -06:00
Nathan Evans	c1c09bab80	Flow cleanup (#1510 ) * Move snapshots out of flows into verbs * Move degree compute out of extract_graph * Move entity/relationship df merging into extract * Move "title" to extraction source * Move text_unit_ids agg closer to extraction * Move data definition * Update test data * Semver * Update smoke tests * Fix empty degree field and update smoke tests and verb data * Move extractors (#1516) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> * Fix syntax typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 18:07:44 -08:00
Nathan Evans	d0543d1fd6	Move extractors (#1516 ) * Consolidate graph embedding and umap * Consolidate claim extraction * Consolidate graph extractor * Move graph utils * Move summarizers * Semver --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-18 16:21:41 -08:00
Nathan Evans	1d68af308b	Community workflow (#1495 ) * Create separate communities workflow * Add test for new workflow * Rename workflows * Collapse subflows into parents * Rename flows, reuse variables * Semver * Fix integration test * Fix smoke tests * Fix megapipeline format * Rename missed files --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-11 15:41:16 -06:00
Josh Bradley	823342188d	Cleanup factory methods (#1482 ) * cleanup factory methods to have similar design pattern across codebase * add semversioner file * cleanup logging factory * update developer guide * add comment * typo fix * cleanup reporter terminology * renmae reporter to logger * fix comments * update comment * instantiate factory classes correctly and update index api callback parameter --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-10 16:11:11 -06:00
Alonso Guevara	04405803db	Add Parent to communities in data model (#1491 ) * Add Parent to communities in data model * Semver * Pyright * Update docs * Use leiden cluster parent id * Format	2024-12-10 14:38:11 -06:00
Alonso Guevara	1c3b0f34c3	Chore/lib updates (#1477 ) * Update dependencies and fix issues * Format * Semver * Fix Pyright * Pyright * More Pyright * Pyright	2024-12-06 14:08:24 -06:00
Chris Trevino	5ff2d3c76d	Remove graphrag.llm, replace with fnllm (#1315 ) * add fnllm; remove llm folder * remove llm unit tests * update imports * update imports * formatting * enable autosave * update mockllm * update community reports extractor * move most llm usage to fnllm * update type issues * fix unit tests * type updates * update dictionary * semver * update llm construction, get integration tests working * load from llmparameters model * move ruff settings to ruff.toml * add gitattributes file * ignore ruff.toml spelling * update .gitattributes * update gitignore * update config construction * update prompt var usage * add cache adapter * use cache adapter in embeddings calls * update embedding strategy * add fnllm * add pytest-dotenv * fix some verb tests * get verbtests running * update ruff.toml for vscode * enable ruff native server in vscode * update artifact inspecting code * remove local-test update * use string.replace instead of string.format in community reprots etxractor * bump timeout * revert ruff.toml, vscode settings for another pr * revert cspell config * revert gitignore * remove json-repair, update fnllm * use fnllm generic type interfaces * update load_llm to use target models * consolidate chat parameters * add 'extra_attributes' prop to community report response * formatting * update fnllm * formatting * formatting * Add defaults to some llm params to avoid null on params hash * Formatting --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-12-05 18:07:47 -06:00
Alonso Guevara	d43124e576	Refactor Create Final Community reports to simplify code (#1456 ) * Optimize prep claims * Optimize community hierarchy restore * Partial optimization of prepare_community_reports * More optimization code * Fix context string generation * Filter community -1 * Fix cache, add more optimization fixes * Fix local search community ids * Cleanup * Format * Semver * Remove perf counter * Unused import * Format * Fix edge addition to reports * Add edge by edge context creation * Re-org of the optimization code * Format * Ruff * Some Ruff fixes * More pyright * More pyright * Pyright * Pyright * Update tests	2024-12-05 17:13:05 -06:00
KennyZhang1	10f84c91eb	Replace md5 hash (#1470 ) * switched hashing function helper to sha256 * refactored references to hashing util * semversioner * switched from sha256 to sha512 * new semversioner * updated tests/verbs/data folder * generated fresh parquet files in data folder * moved ignore flag	2024-12-05 13:24:35 -06:00
Nathan Evans	d17dfd01f9	Graph collapse (#1464 ) * Refactor graph creation * Semver * Spellcheck * Update integ pipeline * Fix cast * Improve pandas chaining * Cleaner apply * Use list comprehensions --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-12-05 11:57:26 -06:00
Josh Bradley	dad2176b3c	Miscellaneous code cleanup procedures (#1452 )	2024-11-27 13:27:43 -05:00
Josh Bradley	22a57d14c7	Improve CLI speed with lazy imports (#1319 )	2024-11-15 19:41:10 -05:00
Nathan Evans	9b4f24ebce	First cut at config cleanup (#1411 ) * Firsst cut at config cleanup * Reorder top nav * Add query prompts to tuning page * Remove dynamic notebook from nav * Add more thorough yml config descriptions in docs * Further clean out the config * Semver * Add new blog post * Emphasize yaml * Clarify output * Fix unit test * Fix bullet nesting	2024-11-15 14:33:26 -08:00
Nathan Evans	51912b2e03	Move prompts (#1404 ) * Move indexing prompts to root * Move query prompts to root * Export query prompts during init * Extract general knowledge prompt * Load query prompts from disk * Semver * Fix unit tests	2024-11-14 10:45:37 -08:00
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Alonso Guevara	d9f985ae52	Drift Search CLI, API, Docs and Example Notebook (#1348 ) * Drift CLI and backwards compat * Adding DRIFT Cli, Docs and example notebook * Update tests and fix ruff * Format * Small cleanup * Fix smoke tests * Update notebook * Oopsie fix * Delete duplicate img	2024-11-05 12:05:19 -06:00
Nathan Evans	634e3ed62a	Transient entity graph (#1349 ) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests	2024-11-04 17:23:29 -08:00
gaudyb	17658c5df8	New workflow to generate embeddings in a single workflow (#1296 ) * New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-11-01 15:01:35 -07:00
Alonso Guevara	7235c6faf5	Add Incremental Indexing v1 (#1318 ) * Create entypoint for cli and api (#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Incremental indexing/file delta (#1123) * Calculate new inputs and deleted inputs on update * Semver * Clear ruff checks * Fix pyright * Fix PyRight * Ruff again * Update relationships after inc index (#1236) * Collapse create final community reports (#1227) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set * Collapse create base entity graph (#1233) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment * Collapse create summarized entities (#1237) * Collapse entity summarize * Semver * Collapse create base extracted entities (#1235) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests * Incremental indexing/update final text units (#1241) * Update final text units * Format * Address comments * Add v1 community merge using time period (#1257) * Add naive community merge using time period * formatting * Query fixes * Add descriptions from merged_entities * Add summarization and embeddings * Use iso format * Ruff * Pyright and smoke tests * Pyright * Pyright * Update parquet for verb tests * Fix smoke tests * Remove sorting * Update smoke tests * Smoke tests * Smoke tests * Updated verb test to ack for latest changes on covariates * Add config for incremental index + Bug fixes (#1317) * Add config for incremental index + Bug fixes * Ruff * Fix smoke tests * Semversioner * Small refactor * Remove unused file * Ruff * Update verb tests inputs * Update verb tests inputs --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-10-30 11:59:44 -06:00
Josh Bradley	0cc79b9cf7	Add backwards compatibility patch for vector store (#1334 )	2024-10-29 14:54:08 -04:00
Josh Bradley	083de12bcf	Auto-generate CLI doc pages (#1325 )	2024-10-25 19:00:24 -04:00
Josh Bradley	d6e6f5c077	Convert CLI to Typer app (#1305 )	2024-10-24 14:22:32 -04:00
Nathan Evans	94f1e62e5c	Rework workflow architecture (#1311 ) * Rename pipeline_storage file * Add runtime storage option to context * Fix import * Switch to memory storage for runtime * Infra for workflow runtime storage * Migrate base_text_units to runtime storage * Fix comment * Semver * Remove whitespace * Remove subflow smoke tests and ignore transient artifacts * Remove entity graph from transient list (not yet implemented) * Increase smoke runtime allotment for create_base_entity_graph * Revert format fix * Remove noqa	2024-10-24 10:20:03 -07:00
Alonso Guevara	8a6d4e66fe	DRIFT Search (#1285 ) * drift search * args for drift global query in local search * accept drift context in search base * optionally parse embeddings from df when creating CommunityReport * abstract class for drift context * pathing for drift config * drift config * add defs for drift config * formatting * capture generated tokens in token count * semversion * Formatting and ruff * Some algorithmic refactors * Ruff * Format * Use asdict() * Address comments * Update smoke tests * Update smoke tests * Update smoke tests part 2 --------- Co-authored-by: Julian Whiting <j2whitin@gmail.com>	2024-10-21 17:22:11 -06:00
KennyZhang1	e0840a2dc4	Fix vector store logic and refactor audience parameter (#1259 )	2024-10-21 16:56:56 -04:00
Matthieu Maitre	6aae386b30	Perf optimizations in map_query_to_entities() (#1276 ) * Address perf issue in map_query_to_entities() * Add semver --------- Co-authored-by: Matthieu Maitre <mmaitre@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-21 12:03:48 -06:00
Nathan Evans	1f70d42572	Empty workflow returns (#1291 ) * Skip emitting empty dataframes * Semver * Better empty df check	2024-10-17 09:25:36 -07:00
Nathan Evans	ce5b1207e0	Collapse graph documents workflows (#1284 ) * Copy base documents logic into final documents * Delete create_base_documents * Combine graph creation under create_base_entity_graph * Delete collapsed workflows * Migrate most graph internals to nx.Graph * Fix None edge case * Semver * Remove comment typo * Fix smoke tests	2024-10-15 13:58:58 -06:00
Andres Morales	fc9895f793	Replace current docs by mkdocs (#1263 ) * Replace docs by mkdocs-material * Fix markdown * Fix verions in gh-pages workflow * remove whitespaces * add semver * Add build docs check on python-ci * Fix command in index cli * Spellcheck * Spellcheck * remove docsite paths * clear outputs from notebook * remove dependabot npm for docsite * remove more docsite left overs * execute notebooks * Update notebooks * update poetry lock * Remove notebook build from ci * Revert dep update * Navigation tabs * Fix stylesheet * add kwds to dictionary * Turn on notebook execution * Update gitignore * Add MSR Blog posts * spellcheck * Accessibility Changes --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-11 13:39:03 -06:00
Nathan Evans	61b3d6d56a	Migrate helper verbs (#1248 ) * Remove genid * Move snapshot_rows * Move snapshot * Delete spread_json * Delete unzip * Delete zip * Move unpack_graph * Move compute_edge_combined_degree * Delete create_graph * Delete concat * Delete text replace * Delete text_translate * Move text_split * Inline aggregate override * Move cluster_graph * Move merge_graphs * Semver * Move text_chunk * Move layout_graph and fix some __init__s * Move extract_covariates * Rename text_split -> split_text * Move extract_entities * Move summarize_descriptions * Rename text_chunk -> chunk_text * Move community report creation * Remove verb-level packing operators * Streamline some naming * Streamline param name/order * Move mock LLM data to tests * Fixed missed rename * Update some strategy refs * Rename run_gi * Inject mock responses into integ test config	2024-10-09 13:46:44 -07:00
Nathan Evans	f5c5876dde	Reorganize flows (#1240 ) * Extract base docs and entity graph * Move extracted entities and text units * Move communities and community reports * Move covariates and final documents * Move entities, nodes, relationships * Move text_units and summarized entities * Assert all snapshot null cases * Remove disabled steps util * Remove incorrect use of input "others" * Convert text_embed_df to just return the embeddings, not update the df * Convert snapshot functions to noops * Semver * Remove lingering covariates_enabled param * Name consistency * Syntax cleanup	2024-10-02 08:57:08 -07:00
Nathan Evans	9070ea5c3c	Collapse create base extracted entities (#1235 ) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests	2024-09-30 17:32:56 -07:00
Nathan Evans	630679f8e3	Collapse create summarized entities (#1237 ) * Collapse entity summarize * Semver	2024-09-30 17:17:44 -07:00
Nathan Evans	5220bb7ecc	Collapse create base entity graph (#1233 ) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment	2024-09-30 15:39:42 -07:00
Nathan Evans	00d5e77568	Collapse create final community reports (#1227 ) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set	2024-09-30 10:46:07 -07:00
Nathan Evans	ce71bcf7fb	Collapse create final entities (#1220 ) * Collapse create_final_entities * Update smoke tests * Semver * Remove prints * Update embedding assertions	2024-09-25 17:35:44 -07:00
Nathan Evans	3217013019	Revisit create final text units (#1216 ) * Add embeddings to collapsed subflow * Semver * Fix smoke tests	2024-09-25 16:55:27 -07:00
Nathan Evans	73e709b686	Collapse create final covariates (#1215 ) * Add covariate test * Add detailed mock assertions * Collapse create_final_covariates * Delete unused doc_id field * Semver * Update smoke test * Remove unused subject/object type columns	2024-09-25 16:30:22 -07:00
Nathan Evans	14750f4d37	Collapse create final documents (#1217 ) * Collapse create_final_documents * Semver	2024-09-25 15:50:46 -07:00
Nathan Evans	f518c8b80b	Collapse relationship embeddings (#1199 ) * Merge text_embed into a single relationships subflow * Update smoke tests * Semver * Spelling	2024-09-24 15:03:26 -07:00
Nathan Evans	1755afbdec	Collapse create base text units (#1178 ) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Setup initial test and config * Collapse create_base_text_units * Semver * Spelling * Fix smoke tests * Addres PR comments --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-23 16:55:53 -07:00
Nathan Evans	fbc483e4e5	Collapse create base documents (#1176 ) * Collapse non-attribute verbs * Include document_column_attributes in collapse * Remove merge_override verb * Semver * Clean up some df/tests	2024-09-23 13:24:06 -07:00
Nathan Evans	f8ab1b30dc	Collapse create_final_nodes (#1171 ) * Collapse create_final_nodes * Update smoke tests * Typo --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-20 13:48:56 -07:00
Nathan Evans	ae094bb144	Collapse create final relationships (#1158 ) * Collapse pre/post embedding workflows * Semver * Fix smoke tests --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-09-19 17:38:01 -06:00

1 2

77 Commits