graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2026-01-14 09:07:20 +08:00

Author	SHA1	Message	Date
Alonso Guevara	c90166ca32	Add Parquet as part of the default emitters when not present (#1407 ) Add Parquet as part of the default emitters when not pressent	2024-11-14 13:04:19 -06:00
Nathan Evans	51912b2e03	Move prompts (#1404 ) * Move indexing prompts to root * Move query prompts to root * Export query prompts during init * Extract general knowledge prompt * Load query prompts from disk * Semver * Fix unit tests	2024-11-14 10:45:37 -08:00
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Alonso Guevara	e53422366d	Implement dynamic community selection for global search (#1396 ) * update gitignore * add dynamic community sleection to updated main branch * update SearchResult to record output_tokens. * update search result * dynamic search working * format * add llm_calls_categories and prompt_tokens and output_tokens cate * update * formatting * log drift search output and prompt tokens separately * update global_search.ipynb. update operate dulce dataset and add create_final_communities. update dynamic community selection init * add .ipynb back to cspell.config.yaml * format * add notebook example on dynamic search * rearrange * update gitignore * format code * code format * code format * fix default variable --------- Co-authored-by: Bryan Li <bryanlimy@gmail.com>	2024-11-11 16:45:07 -08:00
Alonso Guevara	ba50caab4d	Release v0.4.1 (#1387 ) * Release v0.4.1 * Spellcheck	2024-11-08 17:59:57 -06:00
Alonso Guevara	20c120288b	Feat/update cli (#1376 ) * Add update cli option with default storage * Semver * Semver * Pyright * Format	2024-11-07 06:59:10 -06:00
Kylin	baa261c8e9	[bugfix]Fix query error with --streaming (#1368 ) * fix streaming output error * add semversioner --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-11-06 17:49:06 -06:00
Alonso Guevara	3d79de96d1	Raise error on empty deltas for incremental indexing (#1375 ) * Raise error on empty deltas for incremental indexing * Format	2024-11-06 17:33:35 -06:00
Alonso Guevara	1661672569	Fix optional covariates check in incremental indexing (#1374 ) * Fix optional covariates check in incremental indexing * Oopsie fix	2024-11-06 17:22:11 -06:00
Josh Bradley	a8ccded83c	Fix file path issue in the viz guide (#1372 ) * Fix a file paths issue in the viz guide. * fix formatting	2024-11-06 14:42:07 -08:00
Alonso Guevara	2047c1561c	Fix styling and misalignment on drift docs (#1373 )	2024-11-06 16:29:53 -06:00
Josh Bradley	0394b55086	Update CI/CD - skip running unit tests on documentation-only PRs (#1371 )	2024-11-06 14:19:21 -05:00
Josh Bradley	9762f33c1a	Add visualization guide (#1340 )	2024-11-06 14:06:50 -05:00
Alonso Guevara	a6d9b0ce3d	Release v0.4.0 (#1361 ) * Release v0.4.0 * Missing change track	2024-11-05 18:44:07 -06:00
Alonso Guevara	635c21109f	Fix Community ID loading for DRIFT search over existing indexes (#1360 )	2024-11-05 18:21:36 -06:00
Alonso Guevara	80c0c7bdd1	Update Incremental Indexing to new embeddings workflow (#1359 )	2024-11-05 16:54:02 -06:00
Alonso Guevara	83bd5cefe5	Fix content embedding container name (#1358 )	2024-11-05 15:56:32 -06:00
Alonso Guevara	1557ce34f9	Fix init defaults for vector store and img in drift docs (#1357 ) * Fix init defaults for vector store and img in drift docs * Adde more doc * Spellcheck * Remove example	2024-11-05 14:14:17 -06:00
Alonso Guevara	d9f985ae52	Drift Search CLI, API, Docs and Example Notebook (#1348 ) * Drift CLI and backwards compat * Adding DRIFT Cli, Docs and example notebook * Update tests and fix ruff * Format * Small cleanup * Fix smoke tests * Update notebook * Oopsie fix * Delete duplicate img	2024-11-05 12:05:19 -06:00
Gabriel Nieves-Ponce	68dfceef21	Updated the variable names within the for-loop to differentiate betwe… (#1356 ) * Updated the variable names within the for-loop to differentiate between them and the original title variable used in the dataframe. This avoids corrupting the original column-name defined in the title variable. * Semver and formart --------- Co-authored-by: Gabriel Nieves-Ponce <gnievesponce@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-11-05 11:45:29 -06:00
Nathan Evans	634e3ed62a	Transient entity graph (#1349 ) * Make base_entity_graph transient * Add transient snapshots * Semver * Fix unit test * Fix smoke tests	2024-11-04 17:23:29 -08:00
gaudyb	17658c5df8	New workflow to generate embeddings in a single workflow (#1296 ) * New workflow to generate embeddings in a single workflow * New workflow to generate embeddings in a single workflow * version change * clean tests without any embeddings references * clean tests without any embeddings references * remove code * feedback implemented * changes in logic * feedback implemented * store in table bug fixed * smoke test for generate_text_embeddings workflow * smoke test fix * add generate_text_embeddings to the list of transient workflows * smoke tests * fix * ruff formatting updates * fix * smoke test fixed * smoke test fixed * fix lancedb import * smoke test fix * ignore sorting * smoke test fixed * smoke test fixed * check smoke test * smoke test fixed * change config for vector store * format fix * vector store changes * revert debug profile back to empty filepath * merge conflict solved * merge conflict solved * format fixed * format fixed * fix return dataframe * snapshot fix * format fix * embeddings param implemented * validation fixes * fix map * fix map * fix properties * config updates * smoke test fixed * settings change * Update collection config and rework back-compat * Repalce . with - for embedding store --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-11-01 15:01:35 -07:00
Chris Trevino	8302920ac8	move mkdocs-typer to devdeps (#1331 ) * move mkdocs-typer to devdeps * add .gitattributes for toml parsing issues on Windows CI * bump timeout --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-30 14:49:30 -07:00
Alonso Guevara	7235c6faf5	Add Incremental Indexing v1 (#1318 ) * Create entypoint for cli and api (#1067) * Add cli and api entrypoints for update index * Semver * Update docs * Run tests on feature branch main * Better /main handling in tests * Incremental indexing/file delta (#1123) * Calculate new inputs and deleted inputs on update * Semver * Clear ruff checks * Fix pyright * Fix PyRight * Ruff again * Update relationships after inc index (#1236) * Collapse create final community reports (#1227) * Remove extraneous param * Add community report mocking assertions * Collapse primary report generation * Collapse embeddings * Format * Semver * Remove extraneous check * Move option set * Collapse create base entity graph (#1233) * Collapse create_base_entity_graph * Format/typing * Semver * Fix smoke tests * Simplify assignment * Collapse create summarized entities (#1237) * Collapse entity summarize * Semver * Collapse create base extracted entities (#1235) * Set up base assertions * Replace entity_extract * Finish collapsing workflow * Semver * Update snoke tests * Incremental indexing/update final text units (#1241) * Update final text units * Format * Address comments * Add v1 community merge using time period (#1257) * Add naive community merge using time period * formatting * Query fixes * Add descriptions from merged_entities * Add summarization and embeddings * Use iso format * Ruff * Pyright and smoke tests * Pyright * Pyright * Update parquet for verb tests * Fix smoke tests * Remove sorting * Update smoke tests * Smoke tests * Smoke tests * Updated verb test to ack for latest changes on covariates * Add config for incremental index + Bug fixes (#1317) * Add config for incremental index + Bug fixes * Ruff * Fix smoke tests * Semversioner * Small refactor * Remove unused file * Ruff * Update verb tests inputs * Update verb tests inputs --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com>	2024-10-30 11:59:44 -06:00
Josh Bradley	0cc79b9cf7	Add backwards compatibility patch for vector store (#1334 )	2024-10-29 14:54:08 -04:00
Alonso Guevara	83026bdb26	Remove duplicated entried from relationships and nodes (#1333 )	2024-10-29 00:56:07 -04:00
Josh Bradley	083de12bcf	Auto-generate CLI doc pages (#1325 )	2024-10-25 19:00:24 -04:00
Josh Bradley	d6e6f5c077	Convert CLI to Typer app (#1305 )	2024-10-24 14:22:32 -04:00
Nathan Evans	94f1e62e5c	Rework workflow architecture (#1311 ) * Rename pipeline_storage file * Add runtime storage option to context * Fix import * Switch to memory storage for runtime * Infra for workflow runtime storage * Migrate base_text_units to runtime storage * Fix comment * Semver * Remove whitespace * Remove subflow smoke tests and ignore transient artifacts * Remove entity graph from transient list (not yet implemented) * Increase smoke runtime allotment for create_base_entity_graph * Revert format fix * Remove noqa	2024-10-24 10:20:03 -07:00
Alonso Guevara	ac09e0a740	Feature/optimize count relationships (#1312 ) * refactor build text unit context for better performance * Further optimization and styling * Remove TODO --------- Co-authored-by: Brad Firesheets <v-bradleyf@microsoft.com> Co-authored-by: bfirems <162185685+bfirems@users.noreply.github.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>	2024-10-23 12:03:57 -06:00
Josh Bradley	3df6f8c65b	Allow ci/cd to skip draft PRs (#1314 )	2024-10-23 12:46:00 -04:00
Alonso Guevara	77e77775ad	Fix drift search edge cases over small input sets (#1310 ) * Fix edge cases over small input sets * Ruff	2024-10-22 16:24:41 -06:00
JunHo Kim (김준호)	8d8c67d503	fix typo. Update documentation URLs for consistency (#1298 ) Update documentation URLs for consistency Revised links in documentation files to remove the "posts" subdirectory for consistency and correctness. Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-21 17:24:17 -06:00
Alonso Guevara	8a6d4e66fe	DRIFT Search (#1285 ) * drift search * args for drift global query in local search * accept drift context in search base * optionally parse embeddings from df when creating CommunityReport * abstract class for drift context * pathing for drift config * drift config * add defs for drift config * formatting * capture generated tokens in token count * semversion * Formatting and ruff * Some algorithmic refactors * Ruff * Format * Use asdict() * Address comments * Update smoke tests * Update smoke tests * Update smoke tests part 2 --------- Co-authored-by: Julian Whiting <j2whitin@gmail.com>	2024-10-21 17:22:11 -06:00
KennyZhang1	e0840a2dc4	Fix vector store logic and refactor audience parameter (#1259 )	2024-10-21 16:56:56 -04:00
Matthieu Maitre	6aae386b30	Perf optimizations in map_query_to_entities() (#1276 ) * Address perf issue in map_query_to_entities() * Add semver --------- Co-authored-by: Matthieu Maitre <mmaitre@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-21 12:03:48 -06:00
Nathan Evans	1f70d42572	Empty workflow returns (#1291 ) * Skip emitting empty dataframes * Semver * Better empty df check	2024-10-17 09:25:36 -07:00
Andres Morales	fc502ee029	Fix cookie consent script missing (#1292 )	2024-10-17 09:44:14 -06:00
Nathan Evans	ce5b1207e0	Collapse graph documents workflows (#1284 ) * Copy base documents logic into final documents * Delete create_base_documents * Combine graph creation under create_base_entity_graph * Delete collapsed workflows * Migrate most graph internals to nx.Graph * Fix None edge case * Semver * Remove comment typo * Fix smoke tests	2024-10-15 13:58:58 -06:00
Andres Morales	137a5cd550	Fix/docs auto prompt img (#1283 ) * Fix auto prompt tuning image path	2024-10-14 09:02:31 -06:00
Alonso Guevara	cb052a742f	Dependency updates (#1272 ) * Dependency updates * Pyright update	2024-10-11 18:06:11 -06:00
Andres Morales	fc9895f793	Replace current docs by mkdocs (#1263 ) * Replace docs by mkdocs-material * Fix markdown * Fix verions in gh-pages workflow * remove whitespaces * add semver * Add build docs check on python-ci * Fix command in index cli * Spellcheck * Spellcheck * remove docsite paths * clear outputs from notebook * remove dependabot npm for docsite * remove more docsite left overs * execute notebooks * Update notebooks * update poetry lock * Remove notebook build from ci * Revert dep update * Navigation tabs * Fix stylesheet * add kwds to dictionary * Turn on notebook execution * Update gitignore * Add MSR Blog posts * spellcheck * Accessibility Changes --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-11 13:39:03 -06:00
Josh Bradley	d9a005c9b8	Reorganize python package structure (#1214 )	2024-10-10 17:01:42 -04:00
9prodhi	ce8749bd19	Fix: Add await to LLM execution for async handling (#1206 ) Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-09 17:26:28 -06:00
Sumit K Bhuttan	cd4f1fa9ba	Adding fix per comment on Issue-692 (#1255 ) Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-09 17:09:17 -06:00
Alonso Guevara	9fa6b91684	Chore/community context clean (#1262 ) * Update community_context.py to check conversation_history_context's value For the following code (line 90 - 96), conversation_history_context is concatenated with community_context, but the case where conversation_history_context is empty("") has not been considered. When conversation_history_context is empty (""), concatenation should not be performed, as it would result in community_context or each element in community_context having an extra "\n\n". Therefore, by introducing a context_prefix to check the state of conversation_history_context, concatenation can be handled appropriately. When conversation_history_context is empty (""), the following code will use "" for concatenation. When conversation_history_context is not empty (""), the functionality will be similar to the previous code. * Format and semver * Code cleanup --------- Co-authored-by: ZeyuTeng96 <96521059+ZeyuTeng96@users.noreply.github.com>	2024-10-09 17:01:54 -06:00
JunHo Kim (김준호)	d4a0a590f4	Change config.json references to settings.json in the configuration document. (#1221 ) Updated the configuration documentation to reflect the default filename for configuration file. Default config files are `["settings.yaml", "settings.yml", "settings.json"]` `ce71bcf7fb/graphrag/config/config_file_loader.py (L15)` Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-09 15:20:18 -06:00
JunHo Kim (김준호)	d66901e67e	Update description of GRAPHRAG_CACHE_BASE_DIR in env_vars.md (#1213 ) * Update description of GRAPHRAG_CACHE_BASE_DIR in env_vars.md Clarified that `GRAPHRAG_CACHE_BASE_DIR` refers to the base directory path for cache files rather than reporting outputs. This improves the accuracy of the documentation and helps users understand the correct usage of this environment variable. * Update description of `GRAPHRAG_CACHE_BASE_DIR` Simplified the description of `GRAPHRAG_CACHE_BASE_DIR` to make it clearer. Changed "base directory path" to "base path" for conciseness. --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-09 15:16:50 -06:00
Nathan Evans	61b3d6d56a	Migrate helper verbs (#1248 ) * Remove genid * Move snapshot_rows * Move snapshot * Delete spread_json * Delete unzip * Delete zip * Move unpack_graph * Move compute_edge_combined_degree * Delete create_graph * Delete concat * Delete text replace * Delete text_translate * Move text_split * Inline aggregate override * Move cluster_graph * Move merge_graphs * Semver * Move text_chunk * Move layout_graph and fix some __init__s * Move extract_covariates * Rename text_split -> split_text * Move extract_entities * Move summarize_descriptions * Rename text_chunk -> chunk_text * Move community report creation * Remove verb-level packing operators * Streamline some naming * Streamline param name/order * Move mock LLM data to tests * Fixed missed rename * Update some strategy refs * Rename run_gi * Inject mock responses into integ test config	2024-10-09 13:46:44 -07:00
Nathan Evans	718d1ef441	Migrate embedding operations (#1242 ) * Move text_embed to verb-less operation * Move embed_graph to verb-less operation * Return embeddings from embed_graph instead of modifying df * Semver * Use config existence instead of bool for graph embedding * Send clustering strategy directly	2024-10-03 16:01:39 -07:00

1 2 3 4 5

248 Commits