mirror of
https://github.com/microsoft/graphrag.git
synced 2026-01-29 15:21:58 +08:00
Some checks failed
Python Build and Type Check / python-ci (ubuntu-latest, 3.11) (push) Has been cancelled
Python Build and Type Check / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled
Python Build and Type Check / python-ci (windows-latest, 3.11) (push) Has been cancelled
Python Build and Type Check / python-ci (windows-latest, 3.12) (push) Has been cancelled
Python Integration Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled
Python Integration Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled
Python Notebook Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled
Python Notebook Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled
Python Smoke Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled
Python Smoke Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled
Python Unit Tests / python-ci (ubuntu-latest, 3.12) (push) Has been cancelled
Python Unit Tests / python-ci (windows-latest, 3.12) (push) Has been cancelled
* Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type
183 lines
9.4 KiB
Markdown
183 lines
9.4 KiB
Markdown
# Indexing Dataflow
|
|
|
|
## The GraphRAG Knowledge Model
|
|
|
|
The knowledge model is a specification for data outputs that conform to our data-model definition. You can find these definitions in the python/graphrag/graphrag/model folder within the GraphRAG repository. The following entity types are provided. The fields here represent the fields that are text-embedded by default.
|
|
|
|
- `Document` - An input document into the system. These either represent individual rows in a CSV or individual .txt files.
|
|
- `TextUnit` - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below.
|
|
- `Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
|
|
- `Relationship` - A relationship between two entities.
|
|
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
|
|
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
|
|
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
|
|
|
|
## The Default Configuration Workflow
|
|
|
|
Let's take a look at how the default-configuration workflow transforms text documents into the _GraphRAG Knowledge Model_. This page gives a general overview of the major steps in this process. To fully configure this workflow, check out the [configuration](../config/overview.md) documentation.
|
|
|
|
```mermaid
|
|
---
|
|
title: Dataflow Overview
|
|
---
|
|
flowchart TB
|
|
subgraph phase1[Phase 1: Compose TextUnits]
|
|
documents[Documents] --> chunk[Chunk]
|
|
chunk --> textUnits[Text Units]
|
|
end
|
|
subgraph phase2[Phase 2: Document Processing]
|
|
documents --> link_to_text_units[Link to TextUnits]
|
|
textUnits --> link_to_text_units
|
|
link_to_text_units --> document_outputs[Documents Table]
|
|
end
|
|
subgraph phase3[Phase 3 Graph Extraction]
|
|
textUnits --> graph_extract[Entity & Relationship Extraction]
|
|
graph_extract --> graph_summarize[Entity & Relationship Summarization]
|
|
graph_summarize --> claim_extraction[Claim Extraction]
|
|
claim_extraction --> graph_outputs[Graph Tables]
|
|
end
|
|
subgraph phase4[Phase 4: Graph Augmentation]
|
|
graph_outputs --> community_detect[Community Detection]
|
|
community_detect --> community_outputs[Communities Table]
|
|
end
|
|
subgraph phase5[Phase 5: Community Summarization]
|
|
community_outputs --> summarized_communities[Community Summarization]
|
|
summarized_communities --> community_report_outputs[Community Reports Table]
|
|
end
|
|
subgraph phase6[Phase 6: Text Embeddings]
|
|
textUnits --> text_embed[Text Embedding]
|
|
graph_outputs --> description_embed[Description Embedding]
|
|
community_report_outputs --> content_embed[Content Embedding]
|
|
end
|
|
```
|
|
|
|
## Phase 1: Compose TextUnits
|
|
|
|
The first phase of the default-configuration workflow is to transform input documents into _TextUnits_. A _TextUnit_ is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.
|
|
|
|
The chunk size (counted in tokens), is user-configurable. By default this is set to 1200 tokens. Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.
|
|
|
|
```mermaid
|
|
---
|
|
title: Documents into Text Chunks
|
|
---
|
|
flowchart LR
|
|
doc1[Document 1] --> tu1[TextUnit 1]
|
|
doc1 --> tu2[TextUnit 2]
|
|
doc2[Document 2] --> tu3[TextUnit 3]
|
|
doc2 --> tu4[TextUnit 4]
|
|
|
|
```
|
|
|
|
## Phase 2: Document Processing
|
|
|
|
In this phase of the workflow, we create the _Documents_ table for the knowledge model. Final documents are not used directly in GraphRAG, but this step links them to their constituent text units for provenance in your own applications.
|
|
|
|
```mermaid
|
|
---
|
|
title: Document Processing
|
|
---
|
|
flowchart LR
|
|
aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]
|
|
```
|
|
|
|
### Link to TextUnits
|
|
|
|
In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.
|
|
|
|
### Documents Table
|
|
|
|
At this point, we can export the **Documents** table into the knowledge Model.
|
|
|
|
## Phase 3: Graph Extraction
|
|
|
|
In this phase, we analyze each text unit and extract our graph primitives: _Entities_, _Relationships_, and _Claims_.
|
|
Entities and Relationships are extracted at once in our _extract_graph_ workflow, and claims are extracted in our _extract_claims_ workflow. Results are then combined and passed into following phases of the pipeline.
|
|
|
|
```mermaid
|
|
---
|
|
title: Graph Extraction
|
|
---
|
|
flowchart LR
|
|
tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]
|
|
tu --> ce[Claim Extraction]
|
|
```
|
|
|
|
> Note: if you are using the [FastGraphRAG](https://microsoft.github.io/graphrag/index/methods/#fastgraphrag) option, entity and relationship extraction will be performed using NLP to conserve LLM resources, and claim extraction will always be skipped.
|
|
|
|
### Entity & Relationship Extraction
|
|
|
|
In this first step of graph extraction, we process each text-unit to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of **entities** with a _title_, _type_, and _description_, and a list of **relationships** with a _source_, _target_, and _description_.
|
|
|
|
These subgraphs are merged together - any entities with the same _title_ and _type_ are merged by creating an array of their descriptions. Similarly, any relationships with the same _source_ and _target_ are merged by creating an array of their descriptions.
|
|
|
|
### Entity & Relationship Summarization
|
|
|
|
Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.
|
|
|
|
### Claim Extraction (optional)
|
|
|
|
Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called **Covariates**.
|
|
|
|
Note: claim extraction is _optional_ and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.
|
|
|
|
## Phase 4: Graph Augmentation
|
|
|
|
Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the organization of our graph.
|
|
|
|
```mermaid
|
|
---
|
|
title: Graph Augmentation
|
|
---
|
|
flowchart LR
|
|
cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]
|
|
```
|
|
|
|
### Community Detection
|
|
|
|
In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.
|
|
|
|
### Graph Tables
|
|
|
|
Once our graph augmentation steps are complete, the final **Entities**, **Relationships**, and **Communities** tables are exported.
|
|
|
|
## Phase 5: Community Summarization
|
|
|
|
```mermaid
|
|
---
|
|
title: Community Summarization
|
|
---
|
|
flowchart LR
|
|
sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table]
|
|
```
|
|
|
|
At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.
|
|
|
|
Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.
|
|
|
|
### Generate Community Reports
|
|
|
|
In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.
|
|
|
|
### Summarize Community Reports
|
|
|
|
In this step, each _community report_ is then summarized via the LLM for shorthand use.
|
|
|
|
### Community Reports Table
|
|
|
|
At this point, some bookkeeping work is performed and we export the **Community Reports** tables.
|
|
|
|
## Phase 6: Text Embedding
|
|
|
|
For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.
|
|
|
|
```mermaid
|
|
---
|
|
title: Text Embedding Workflows
|
|
---
|
|
flowchart LR
|
|
textUnits[Text Units] --> text_embed[Text Embedding]
|
|
graph_outputs[Graph Tables] --> description_embed[Description Embedding]
|
|
community_report_outputs[Community Reports] --> content_embed[Content Embedding]
|
|
```
|