* Move indexing prompts to root
* Move query prompts to root
* Export query prompts during init
* Extract general knowledge prompt
* Load query prompts from disk
* Semver
* Fix unit tests
* Add source documents for verb tests
* Remove entity_type erroneous column
* Add new test data
* Remove source/target degree columns
* Remove top_level_node_id
* Remove chunk column configs
* Rename "chunk" to "text"
* Rename "chunk" to "text" in base
* Re-map document input to use base text units
* Revert base text units as final documents dep
* Update test data
* Split/rename node source_id
* Drop node size (dup of degree)
* Drop document_ids from covariates
* Remove unused document_ids from models
* Remove n_tokens from covariate table
* Fix missed document_ids delete
* Wire base text units to final documents
* Rename relationship rank as combined_degree
* Add rank as first-class property to Relationship
* Remove split_text operation
* Fix relationships test parquet
* Update test parquets
* Add entity ids to community table
* Remove stored graph embedding columns
* Format
* Semver
* Fix JSON typo
* Spelling
* Rename lancedb
* Sort lancedb
* Fix unit test
* Fix test to account for changing period
* Update tests for separate embeddings
* Format
* Better assertion printing
* Fix unit test for windows
* Rename document.raw_content -> document.text
* Remove read_documents function
* Remove unused document summary from model
* Remove unused imports
* Format
* Add new snapshots to default init
* Use util to construct embeddings collection name
* Align inc index model with branch changes
* Update data and tests for int ids
* Clean up embedding locs
* Switch entity "name" to "title" for consistency
* Fix short_id -> human_readable_id defaults
* Format
* Rework community IDs
* Fix community size compute
* Fix unit tests
* Fix report read
* Pare down nodes table output
* Fix unit test
* Fix merge
* Fix community loading
* Format
* Fix community id report extraction
* Update tests
* Consistent short IDs and ordering
* Update ordering and tests
* Update incremental for new nodes model
* Guard document columns loc
* Match column ordering
* Fix document guard
* Update smoke tests
* Fill NA on community extract
* Logging for smoke test debug
* Add parquet schema details doc
* Fix community hierarchy guard
* Use better empty hierarchy guard
* Back-compat shims
* Semver
* Fix warning
* Format
* Remove default fallback
* Reuse key
* update gitignore
* add dynamic community sleection to updated main branch
* update SearchResult to record output_tokens.
* update search result
* dynamic search working
* format
* add llm_calls_categories and prompt_tokens and output_tokens cate
* update
* formatting
* log drift search output and prompt tokens separately
* update global_search.ipynb. update operate dulce dataset and add create_final_communities. update dynamic community selection init
* add .ipynb back to cspell.config.yaml
* format
* add notebook example on dynamic search
* rearrange
* update gitignore
* format code
* code format
* code format
* fix default variable
---------
Co-authored-by: Bryan Li <bryanlimy@gmail.com>
* Updated the variable names within the for-loop to differentiate between them and the original title variable used in the dataframe. This avoids corrupting the original column-name defined in the title variable.
* Semver and formart
---------
Co-authored-by: Gabriel Nieves-Ponce <gnievesponce@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* New workflow to generate embeddings in a single workflow
* New workflow to generate embeddings in a single workflow
* version change
* clean tests without any embeddings references
* clean tests without any embeddings references
* remove code
* feedback implemented
* changes in logic
* feedback implemented
* store in table bug fixed
* smoke test for generate_text_embeddings workflow
* smoke test fix
* add generate_text_embeddings to the list of transient workflows
* smoke tests
* fix
* ruff formatting updates
* fix
* smoke test fixed
* smoke test fixed
* fix lancedb import
* smoke test fix
* ignore sorting
* smoke test fixed
* smoke test fixed
* check smoke test
* smoke test fixed
* change config for vector store
* format fix
* vector store changes
* revert debug profile back to empty filepath
* merge conflict solved
* merge conflict solved
* format fixed
* format fixed
* fix return dataframe
* snapshot fix
* format fix
* embeddings param implemented
* validation fixes
* fix map
* fix map
* fix properties
* config updates
* smoke test fixed
* settings change
* Update collection config and rework back-compat
* Repalce . with - for embedding store
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
* move mkdocs-typer to devdeps
* add .gitattributes for toml parsing issues on Windows CI
* bump timeout
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* refactor build text unit context for better performance
* Further optimization and styling
* Remove TODO
---------
Co-authored-by: Brad Firesheets <v-bradleyf@microsoft.com>
Co-authored-by: bfirems <162185685+bfirems@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
Update documentation URLs for consistency
Revised links in documentation files to remove the "posts" subdirectory for consistency and correctness.
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Update community_context.py to check conversation_history_context's value
For the following code (line 90 - 96), conversation_history_context is concatenated with community_context, but the case where conversation_history_context is empty("") has not been considered. When conversation_history_context is empty (""), concatenation should not be performed, as it would result in community_context or each element in community_context having an extra "\n\n".
Therefore, by introducing a context_prefix to check the state of conversation_history_context, concatenation can be handled appropriately. When conversation_history_context is empty (""), the following code will use "" for concatenation. When conversation_history_context is not empty (""), the functionality will be similar to the previous code.
* Format and semver
* Code cleanup
---------
Co-authored-by: ZeyuTeng96 <96521059+ZeyuTeng96@users.noreply.github.com>
Updated the configuration documentation to reflect the default filename for configuration file.
Default config files are `["settings.yaml", "settings.yml", "settings.json"]`
ce71bcf7fb/graphrag/config/config_file_loader.py (L15)
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Update description of GRAPHRAG_CACHE_BASE_DIR in env_vars.md
Clarified that `GRAPHRAG_CACHE_BASE_DIR` refers to the base directory path for cache files rather than reporting outputs. This improves the accuracy of the documentation and helps users understand the correct usage of this environment variable.
* Update description of `GRAPHRAG_CACHE_BASE_DIR`
Simplified the description of `GRAPHRAG_CACHE_BASE_DIR` to make it clearer. Changed "base directory path" to "base path" for conciseness.
---------
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Move text_embed to verb-less operation
* Move embed_graph to verb-less operation
* Return embeddings from embed_graph instead of modifying df
* Semver
* Use config existence instead of bool for graph embedding
* Send clustering strategy directly