graphrag

mirror of https://github.com/microsoft/graphrag.git synced 2026-01-14 09:07:20 +08:00

Author	SHA1	Message	Date
Nathan Evans	710fdad6f0	Input factory (#2168 ) Some checks are pending Python Build and Type Check / python-ci (ubuntu-latest, 3.11) (push) Waiting to run Details Python Build and Type Check / python-ci (ubuntu-latest, 3.13) (push) Waiting to run Details Python Build and Type Check / python-ci (windows-latest, 3.11) (push) Waiting to run Details Python Build and Type Check / python-ci (windows-latest, 3.13) (push) Waiting to run Details Python Integration Tests / python-ci (ubuntu-latest, 3.13) (push) Waiting to run Details Python Integration Tests / python-ci (windows-latest, 3.13) (push) Waiting to run Details Python Notebook Tests / python-ci (ubuntu-latest, 3.13) (push) Waiting to run Details Python Notebook Tests / python-ci (windows-latest, 3.13) (push) Waiting to run Details Python Smoke Tests / python-ci (ubuntu-latest, 3.13) (push) Waiting to run Details Python Smoke Tests / python-ci (windows-latest, 3.13) (push) Waiting to run Details Python Unit Tests / python-ci (ubuntu-latest, 3.13) (push) Waiting to run Details Python Unit Tests / python-ci (windows-latest, 3.13) (push) Waiting to run Details * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format	2026-01-12 12:47:57 -08:00
Nathan Evans	8fd7730067	Chunker factory (#2156 ) Some checks failed Python Build and Type Check / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python Build and Type Check / python-ci (ubuntu-latest, 3.13) (push) Has been cancelled Details Python Build and Type Check / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Build and Type Check / python-ci (windows-latest, 3.13) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.13) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.13) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.13) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.13) (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.13) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.13) (push) Has been cancelled Details Python Unit Tests / python-ci (ubuntu-latest, 3.13) (push) Has been cancelled Details Python Unit Tests / python-ci (windows-latest, 3.13) (push) Has been cancelled Details * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking	2026-01-06 15:39:44 -08:00
Derek Worthen	e0cce31f54	Graphrag config (#2119 ) Some checks failed Python Build and Type Check / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python Build and Type Check / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Build and Type Check / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Build and Type Check / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Unit Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Unit Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details * Add load_config to graphrag-common package.	2025-11-10 07:57:03 -08:00
Nathan Evans	ae1f5e1811	Nov 2025 housekeeping (#2120 ) Some checks failed Python Build and Type Check / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python Build and Type Check / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Build and Type Check / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Build and Type Check / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details Python Unit Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled Details Python Unit Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled Details * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type	2025-11-06 10:03:22 -08:00
Nathan Evans	1bb9fa8e13	Unified factory (#2105 ) Some checks are pending Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run Details Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run Details Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run Details Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run Details Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run Details Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run Details * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas	2025-10-20 12:05:27 -07:00
Nathan Evans	4364d678dd	Merge branch 'main' into v3/main Some checks failed Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details	2025-10-06 12:41:12 -07:00
Nathan Evans	7f996cf584	Docs/2.6.0 (#2070 ) Some checks failed gh-pages / build (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Spellcheck / spellcheck (push) Has been cancelled Details * Add basic search to overview * Add info on input documents DataFrame * Add info on factories to docs * Add consumption warning and switch to "christmas" for folder name * Add logger to factories list * Add litellm docs. (#2058) * Fix version for input docs * Spelling --------- Co-authored-by: Derek Worthen <worthend.derek@gmail.com>	2025-09-23 14:48:28 -07:00
Nathan Evans	97704ab933	Remove text unit grouping (#2052 ) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort	2025-09-09 16:04:04 -07:00
Nathan Evans	429e1b1f9e	Remove graph embedding and UMAP (#2048 ) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search	2025-09-09 15:35:43 -07:00
Copilot	7c28c70d5c	Switch from Poetry to uv for package management (#2008 ) Some checks are pending gh-pages / build (push) Waiting to run Details Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run Details Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run Details Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run Details Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run Details Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run Details Python Publish (pypi) / Upload release to PyPI (push) Waiting to run Details Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run Details Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run Details Spellcheck / spellcheck (push) Waiting to run Details * Initial plan * Switch from Poetry to uv for package management Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Clean up build artifacts and update gitignore Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * remove build artifacts * remove hardcoded version string * fix calls to pip in cicd * Update gh-pages.yml workflow to use uv instead of Poetry Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff formatting fixes * update cicd workflow with latest uv action * fix command to retrieve package version * update development instructions * remove Poetry references * Replace deprecated azuright action with npm-based Azurite installation Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * skip api version check for azurite * add semversioner file * update more changes from switching to UV * Migrate unified-search-app from Poetry to uv package management Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * minor typo update * minor Dockerfile update * update cicd thresholds * update pytest thresholds * ruff fixes * ruff fixes * remove legacy npm settings that no longer apply * Update Unified Search App Readme --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2025-08-13 18:57:25 -06:00
Nathan Evans	27c6de846f	Update docs for 2.0+ (#1984 ) Some checks failed gh-pages / build (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Spellcheck / spellcheck (push) Has been cancelled Details * Update docs * Fix prompt links	2025-06-23 13:49:47 -07:00
Nathan Evans	25bbae8642	Docs: Add models page (#1842 ) Some checks failed gh-pages / build (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python CI / python-ci (windows-latest, 3.11) (push) Has been cancelled Details Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Integration Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Python Publish (pypi) / Upload release to PyPI (push) Has been cancelled Details Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Has been cancelled Details Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Has been cancelled Details Spellcheck / spellcheck (push) Has been cancelled Details * Add models page * Update config docs for new params * Spelling * Add comment on CoT with o-series * Add notes about managed identity * Update the viz guide * Spruce up the getting started wording * Capitalization * Add BYOG page * More BYOG edits * Update dictionary * Change example model name	2025-04-28 17:36:08 -07:00
Nathan Evans	ddc6541ab6	Add docs page about input formats (#1784 ) * Add docs page about input formats * Add json example * Spelling	2025-03-11 17:37:46 -07:00
Nathan Evans	bcb74789f1	Next release docs (#1627 ) * Wordind updates * Update yam lconfig and add notes to "deprecated" env * Add basic search section * Update versioning docs * Minor edits for clarity * Update init command * Update init to add --force in docs * Add NLP extraction params * Move vector_store to root * Add workflows to config * Add FastGraphRAG docs * add metadata column changes * Added documentation for multi index search. * Minor fixes. * Add config and table renames * Update migration notebook and comments to specify v1 * Add frequency to entity table docs * add new chunking options for metadata * Update output docs * Minor edits and cleanup * Add model ids to search configs * Spruce up migration notebook * Lint/format multi-index notebook * SpaCy model note * Update SpaCy footnote * Updated multi_index_search.ipynb to remove ruff errors. * add spacy to dictionary --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: dorbaker <dorbaker@microsoft.com>	2025-03-03 14:46:00 -08:00
Nathan Evans	0e7d22bfb0	Jan documentation updates (#1612 ) * Update workflow docs * Docs cleanup	2025-01-10 11:36:27 -08:00
Nathan Evans	a35cb12741	Remove datashaper strip code (#1581 ) Remove datashaper	2025-01-03 13:59:26 -08:00
Alonso Guevara	04405803db	Add Parent to communities in data model (#1491 ) * Add Parent to communities in data model * Semver * Pyright * Update docs * Use leiden cluster parent id * Format	2024-12-10 14:38:11 -06:00
Josh Bradley	dad2176b3c	Miscellaneous code cleanup procedures (#1452 )	2024-11-27 13:27:43 -05:00
Nathan Evans	425dbc60e3	Docs update (#1408 ) * Fix footer contrast * Fix broken links * Remove a few unneeded examples * Point python API example to the whole folder * Convert schema bullets to tables	2024-11-14 21:26:29 -06:00
Nathan Evans	c8c354e357	Artifact cleanup (#1341 ) * Add source documents for verb tests * Remove entity_type erroneous column * Add new test data * Remove source/target degree columns * Remove top_level_node_id * Remove chunk column configs * Rename "chunk" to "text" * Rename "chunk" to "text" in base * Re-map document input to use base text units * Revert base text units as final documents dep * Update test data * Split/rename node source_id * Drop node size (dup of degree) * Drop document_ids from covariates * Remove unused document_ids from models * Remove n_tokens from covariate table * Fix missed document_ids delete * Wire base text units to final documents * Rename relationship rank as combined_degree * Add rank as first-class property to Relationship * Remove split_text operation * Fix relationships test parquet * Update test parquets * Add entity ids to community table * Remove stored graph embedding columns * Format * Semver * Fix JSON typo * Spelling * Rename lancedb * Sort lancedb * Fix unit test * Fix test to account for changing period * Update tests for separate embeddings * Format * Better assertion printing * Fix unit test for windows * Rename document.raw_content -> document.text * Remove read_documents function * Remove unused document summary from model * Remove unused imports * Format * Add new snapshots to default init * Use util to construct embeddings collection name * Align inc index model with branch changes * Update data and tests for int ids * Clean up embedding locs * Switch entity "name" to "title" for consistency * Fix short_id -> human_readable_id defaults * Format * Rework community IDs * Fix community size compute * Fix unit tests * Fix report read * Pare down nodes table output * Fix unit test * Fix merge * Fix community loading * Format * Fix community id report extraction * Update tests * Consistent short IDs and ordering * Update ordering and tests * Update incremental for new nodes model * Guard document columns loc * Match column ordering * Fix document guard * Update smoke tests * Fill NA on community extract * Logging for smoke test debug * Add parquet schema details doc * Fix community hierarchy guard * Use better empty hierarchy guard * Back-compat shims * Semver * Fix warning * Format * Remove default fallback * Reuse key	2024-11-13 15:11:19 -08:00
Josh Bradley	083de12bcf	Auto-generate CLI doc pages (#1325 )	2024-10-25 19:00:24 -04:00
Josh Bradley	d6e6f5c077	Convert CLI to Typer app (#1305 )	2024-10-24 14:22:32 -04:00
Andres Morales	fc9895f793	Replace current docs by mkdocs (#1263 ) * Replace docs by mkdocs-material * Fix markdown * Fix verions in gh-pages workflow * remove whitespaces * add semver * Add build docs check on python-ci * Fix command in index cli * Spellcheck * Spellcheck * remove docsite paths * clear outputs from notebook * remove dependabot npm for docsite * remove more docsite left overs * execute notebooks * Update notebooks * update poetry lock * Remove notebook build from ci * Revert dep update * Navigation tabs * Fix stylesheet * add kwds to dictionary * Turn on notebook execution * Update gitignore * Add MSR Blog posts * spellcheck * Accessibility Changes --------- Co-authored-by: Alonso Guevara <alonsog@microsoft.com>	2024-10-11 13:39:03 -06:00

23 Commits