From 92161583f169f9e97755071df042ec9acc8b6bb0 Mon Sep 17 00:00:00 2001 From: natoverse Date: Fri, 10 Jan 2025 19:37:43 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20@=20microsof?= =?UTF-8?q?t/graphrag@0e7d22bfb0cbd08c8d66b34da763d38dfe4a719c=20?= =?UTF-8?q?=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- developing/index.html | 1 + examples_notebooks/global_search/index.html | 2 +- .../index.html | 2 +- index/default_dataflow/index.html | 241 ++++++++++-------- index/overview/index.html | 10 +- search/search_index.json | 2 +- 6 files changed, 135 insertions(+), 123 deletions(-) diff --git a/developing/index.html b/developing/index.html index f905f905..eaf7d6d5 100644 --- a/developing/index.html +++ b/developing/index.html @@ -1434,6 +1434,7 @@
  • poetry run poe test_unit - This will execute unit tests.
  • poetry run poe test_integration - This will execute integration tests.
  • poetry run poe test_smoke - This will execute smoke tests.
  • +
  • poetry run poe test_verbs - This will execute tests of the basic workflows.
  • poetry run poe check - This will perform a suite of static checks across the package, including:
  • formatting
  • documentation formatting
  • diff --git a/examples_notebooks/global_search/index.html b/examples_notebooks/global_search/index.html index 39a07c96..26ccb5ad 100644 --- a/examples_notebooks/global_search/index.html +++ b/examples_notebooks/global_search/index.html @@ -2248,7 +2248,7 @@ report_df.head()
     ---------------------------------------------------------------------------
     AttributeError                            Traceback (most recent call last)
    -/tmp/ipykernel_2138/1512985616.py in ?()
    +/tmp/ipykernel_2065/1512985616.py in ?()
           2 entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
           3 report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
           4 entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
    diff --git a/examples_notebooks/global_search_with_dynamic_community_selection/index.html b/examples_notebooks/global_search_with_dynamic_community_selection/index.html
    index e23f197c..68c321de 100644
    --- a/examples_notebooks/global_search_with_dynamic_community_selection/index.html
    +++ b/examples_notebooks/global_search_with_dynamic_community_selection/index.html
    @@ -2156,7 +2156,7 @@ report_df.head()
     
     ---------------------------------------------------------------------------
     AttributeError                            Traceback (most recent call last)
    -/tmp/ipykernel_2168/2760368953.py in ?()
    +/tmp/ipykernel_2098/2760368953.py in ?()
           2 entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
           3 report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
           4 entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
    diff --git a/index/default_dataflow/index.html b/index/default_dataflow/index.html
    index 5bd5f263..6b44381e 100644
    --- a/index/default_dataflow/index.html
    +++ b/index/default_dataflow/index.html
    @@ -685,9 +685,9 @@
     
             
               
  • - + - Claim Extraction & Emission + Claim Extraction (optional) @@ -718,18 +718,9 @@
  • - + - Graph Embedding - - - -
  • - -
  • - - - Graph Tables Emission + Graph Tables @@ -769,18 +760,9 @@
  • - + - Community Embedding - - - -
  • - -
  • - - - Community Tables Emission + Community Reports Table @@ -820,18 +802,9 @@
  • - + - Document Embedding - - - -
  • - -
  • - - - Documents Table Emission + Documents Table @@ -843,9 +816,42 @@
  • - + - Phase 6: Network Visualization + Phase 6: Network Visualization (optional) + + + + + +
  • + +
  • + + + Phase 7: Text Embedding @@ -1682,9 +1688,9 @@
  • - + - Claim Extraction & Emission + Claim Extraction (optional) @@ -1715,18 +1721,9 @@
  • - + - Graph Embedding - - - -
  • - -
  • - - - Graph Tables Emission + Graph Tables @@ -1766,18 +1763,9 @@
  • - + - Community Embedding - - - -
  • - -
  • - - - Community Tables Emission + Community Reports Table @@ -1817,18 +1805,9 @@
  • - + - Document Embedding - - - -
  • - -
  • - - - Documents Table Emission + Documents Table @@ -1840,9 +1819,42 @@
  • - + - Phase 6: Network Visualization + Phase 6: Network Visualization (optional) + + + + + +
  • + +
  • + + + Phase 7: Text Embedding @@ -1873,7 +1885,7 @@
  • Document - An input document into the system. These either represent individual rows in a CSV or individual .txt file.
  • TextUnit - A chunk of text to analyze. The size of these chunks, their overlap, and whether they adhere to any data boundaries may be configured below. A common use case is to set CHUNK_BY_COLUMNS to id so that there is a 1-to-many relationship between documents and TextUnits instead of a many-to-many.
  • Entity - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
  • -
  • Relationship - A relationship between two entities. These are generated from the covariates.
  • +
  • Relationship - A relationship between two entities.
  • Covariate - Extracted claim information, which contains statements about entities which may be time-bound.
  • Community - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
  • Community Report - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
  • @@ -1887,7 +1899,7 @@ title: Dataflow Overview flowchart TB subgraph phase1[Phase 1: Compose TextUnits] documents[Documents] --> chunk[Chunk] - chunk --> embed[Embed] --> textUnits[Text Units] + chunk --> textUnits[Text Units] end subgraph phase2[Phase 2: Graph Extraction] textUnits --> graph_extract[Entity & Relationship Extraction] @@ -1897,32 +1909,31 @@ flowchart TB end subgraph phase3[Phase 3: Graph Augmentation] graph_outputs --> community_detect[Community Detection] - community_detect --> graph_embed[Graph Embedding] - graph_embed --> augmented_graph[Augmented Graph Tables] + community_detect --> community_outputs[Communities Table] end subgraph phase4[Phase 4: Community Summarization] - augmented_graph --> summarized_communities[Community Summarization] - summarized_communities --> embed_communities[Community Embedding] - embed_communities --> community_outputs[Community Tables] + community_outputs --> summarized_communities[Community Summarization] + summarized_communities --> community_report_outputs[Community Reports Table] end subgraph phase5[Phase 5: Document Processing] documents --> link_to_text_units[Link to TextUnits] textUnits --> link_to_text_units - link_to_text_units --> embed_documents[Document Embedding] - embed_documents --> document_graph[Document Graph Creation] - document_graph --> document_outputs[Document Tables] + link_to_text_units --> document_outputs[Documents Table] end subgraph phase6[Phase 6: Network Visualization] - document_outputs --> umap_docs[Umap Documents] - augmented_graph --> umap_entities[Umap Entities] - umap_docs --> combine_nodes[Nodes Table] - umap_entities --> combine_nodes + graph_outputs --> graph_embed[Graph Embedding] + graph_embed --> umap_entities[Umap Entities] + umap_entities --> combine_nodes[Final Nodes] + end + subgraph phase7[Phase 7: Text Embeddings] + textUnits --> text_embed[Text Embedding] + graph_outputs --> description_embed[Description Embedding] + community_report_outputs --> content_embed[Content Embedding] end

    Phase 1: Compose TextUnits

    -

    The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source tex.

    +

    The first phase of the default-configuration workflow is to transform input documents into TextUnits. A TextUnit is a chunk of text that is used for our graph extraction techniques. They are also used as source-references by extracted knowledge items in order to empower breadcrumbs and provenance by concepts back to their original source text.

    The chunk size (counted in tokens), is user-configurable. By default this is set to 300 tokens, although we've had positive experience with 1200-token chunks using a single "glean" step. (A "glean" step is a follow-on extraction). Larger chunks result in lower-fidelity output and less meaningful reference texts; however, using larger chunks can result in much faster processing time.

    The group-by configuration is also user-configurable. By default, we align our chunks to document boundaries, meaning that there is a strict 1-to-many relationship between Documents and TextUnits. In rare cases, this can be turned into a many-to-many relationship. This is useful when the documents are very short and we need several of them to compose a meaningful analysis unit (e.g. Tweets or a chat log)

    -

    Each of these text-units are text-embedded and passed into the next phase of the pipeline.

    ---
     title: Documents into Text Chunks
     ---
    @@ -1942,65 +1953,71 @@ flowchart LR
         tu[TextUnit] --> ge[Graph Extraction] --> gs[Graph Summarization]
         tu --> ce[Claim Extraction]

    Entity & Relationship Extraction

    -

    In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a name, type, and description, and a list of relationships with a source, target, and description.

    -

    These subgraphs are merged together - any entities with the same name and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.

    +

    In this first step of graph extraction, we process each text-unit in order to extract entities and relationships out of the raw text using the LLM. The output of this step is a subgraph-per-TextUnit containing a list of entities with a title, type, and description, and a list of relationships with a source, target, and description.

    +

    These subgraphs are merged together - any entities with the same title and type are merged by creating an array of their descriptions. Similarly, any relationships with the same source and target are merged by creating an array of their descriptions.

    Entity & Relationship Summarization

    Now that we have a graph of entities and relationships, each with a list of descriptions, we can summarize these lists into a single description per entity and relationship. This is done by asking the LLM for a short summary that captures all of the distinct information from each description. This allows all of our entities and relationships to have a single concise description.

    -

    Claim Extraction & Emission

    +

    Claim Extraction (optional)

    Finally, as an independent workflow, we extract claims from the source TextUnits. These claims represent positive factual statements with an evaluated status and time-bounds. These get exported as a primary artifact called Covariates.

    Note: claim extraction is optional and turned off by default. This is because claim extraction generally requires prompt tuning to be useful.

    Phase 3: Graph Augmentation

    -

    Now that we have a usable graph of entities and relationships, we want to understand their community structure and augment the graph with additional information. This is done in two steps: Community Detection and Graph Embedding. These give us explicit (communities) and implicit (embeddings) ways of understanding the topological structure of our graph.

    +

    Now that we have a usable graph of entities and relationships, we want to understand their community structure. These give us explicit ways of understanding the topological structure of our graph.

    ---
     title: Graph Augmentation
     ---
     flowchart LR
    -    cd[Leiden Hierarchical Community Detection] --> ge[Node2Vec Graph Embedding] --> ag[Graph Table Emission]
    + cd[Leiden Hierarchical Community Detection] --> ag[Graph Tables]

    Community Detection

    In this step, we generate a hierarchy of entity communities using the Hierarchical Leiden Algorithm. This method will apply a recursive community-clustering to our graph until we reach a community-size threshold. This will allow us to understand the community structure of our graph and provide a way to navigate and summarize the graph at different levels of granularity.

    -

    Graph Embedding

    -

    In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.

    -

    Graph Tables Emission

    -

    Once our graph augmentation steps are complete, the final Entities and Relationships tables are exported after their text fields are text-embedded.

    +

    Graph Tables

    +

    Once our graph augmentation steps are complete, the final Entities, Relationships, and Communities tables are exported.

    Phase 4: Community Summarization

    ---
     title: Community Summarization
     ---
     flowchart LR
    -    sc[Generate Community Reports] --> ss[Summarize Community Reports] --> ce[Community Embedding] --> co[Community Tables Emission]
    -

    At this point, we have a functional graph of entities and relationships, a hierarchy of communities for the entities, as well as node2vec embeddings.

    + sc[Generate Community Reports] --> ss[Summarize Community Reports] --> co[Community Reports Table] +

    At this point, we have a functional graph of entities and relationships and a hierarchy of communities for the entities.

    Now we want to build on the communities data and generate reports for each community. This gives us a high-level understanding of the graph at several points of graph granularity. For example, if community A is the top-level community, we'll get a report about the entire graph. If the community is lower-level, we'll get a report about a local cluster.

    Generate Community Reports

    In this step, we generate a summary of each community using the LLM. This will allow us to understand the distinct information contained within each community and provide a scoped understanding of the graph, from either a high-level or a low-level perspective. These reports contain an executive overview and reference the key entities, relationships, and claims within the community sub-structure.

    Summarize Community Reports

    In this step, each community report is then summarized via the LLM for shorthand use.

    -

    Community Embedding

    -

    In this step, we generate a vector representation of our communities by generating text embeddings of the community report, the community report summary, and the title of the community report.

    -

    Community Tables Emission

    -

    At this point, some bookkeeping work is performed and we export the Communities and CommunityReports tables.

    +

    Community Reports Table

    +

    At this point, some bookkeeping work is performed and we export the Community Reports tables.

    Phase 5: Document Processing

    In this phase of the workflow, we create the Documents table for the knowledge model.

    ---
     title: Document Processing
     ---
     flowchart LR
    -    aug[Augment] --> dp[Link to TextUnits] --> de[Avg. Embedding] --> dg[Document Table Emission]
    + aug[Augment] --> dp[Link to TextUnits] --> dg[Documents Table]

    Augment with Columns (CSV Only)

    If the workflow is operating on CSV data, you may configure your workflow to add additional fields to Documents output. These fields should exist on the incoming CSV tables. Details about configuring this can be found in the configuration documentation.

    In this step, we link each document to the text-units that were created in the first phase. This allows us to understand which documents are related to which text-units and vice-versa.

    -

    Document Embedding

    -

    In this step, we generate a vector representation of our documents using an average embedding of document slices. We re-chunk documents without overlapping chunks, and then generate an embedding for each chunk. We create an average of these chunks weighted by token-count and use this as the document embedding. This will allow us to understand the implicit relationship between documents, and will help us generate a network representation of our documents.

    -

    Documents Table Emission

    +

    Documents Table

    At this point, we can export the Documents table into the knowledge Model.

    -

    Phase 6: Network Visualization

    +

    Phase 6: Network Visualization (optional)

    In this phase of the workflow, we perform some steps to support network visualization of our high-dimensional vector spaces within our existing graphs. At this point there are two logical graphs at play: the Entity-Relationship graph and the Document graph.

    ---
     title: Network Visualization Workflows
     ---
     flowchart LR
    -    nv[Umap Documents] --> ne[Umap Entities] --> ng[Nodes Table Emission]
    -

    For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of Nodes. The rows of this table include a discriminator indicating whether the node is a document or an entity, and the UMAP coordinates.

    + ag[Graph Table] --> ge[Node2Vec Graph Embedding] --> ne[Umap Entities] --> ng[Nodes Table] +

    Graph Embedding

    +

    In this step, we generate a vector representation of our graph using the Node2Vec algorithm. This will allow us to understand the implicit structure of our graph and provide an additional vector-space in which to search for related concepts during our query phase.

    +

    Dimensionality Reduction

    +

    For each of the logical graphs, we perform a UMAP dimensionality reduction to generate a 2D representation of the graph. This will allow us to visualize the graph in a 2D space and understand the relationships between the nodes in the graph. The UMAP embeddings are then exported as a table of Nodes. The rows of this table include the UMAP dimensions as x/y coordinates.

    +

    Phase 7: Text Embedding

    +

    For all artifacts that require downstream vector search, we generate text embeddings as a final step. These embeddings are written directly to a configured vector store. By default we embed entity descriptions, text unit text, and community report text.

    +
    ---
    +title: Text Embedding Workflows
    +---
    +flowchart LR
    +    textUnits[Text Units] --> text_embed[Text Embedding]
    +    graph_outputs[Graph Tables] --> description_embed[Description Embedding]
    +    community_report_outputs[Community Reports] --> content_embed[Content Embedding]
    diff --git a/index/overview/index.html b/index/overview/index.html index 1337908a..c358f034 100644 --- a/index/overview/index.html +++ b/index/overview/index.html @@ -1564,22 +1564,16 @@
  • embed entities into a graph vector space
  • embed text chunks into a textual vector space
  • -

    The outputs of the pipeline can be stored in a variety of formats, including JSON and Parquet - or they can be handled manually via the Python API.

    +

    The outputs of the pipeline are stored as Parquet tables by default, and embeddings are written to your configured vector store.

    Getting Started

    Requirements

    See the requirements section in Get Started for details on setting up a development environment.

    -

    The Indexing Engine can be used in either a default configuration mode or with a custom pipeline. -To configure GraphRAG, see the configuration documentation. +

    To configure GraphRAG, see the configuration documentation. After you have a config file you can run the pipeline using the CLI or the Python API.

    Usage

    CLI

    # Via Poetry
     poetry run poe cli --root <data_root> # default config mode
    -poetry run poe cli --config your_pipeline.yml # custom config mode
    -
    -# Via Node
    -yarn run:index --root <data_root> # default config mode
    -yarn run:index --config your_pipeline.yml # custom config mode
     

    Python API

    Please see the examples folder for a handful of functional pipelines illustrating how to create and run via a custom settings.yml or through custom python scripts.

    diff --git a/search/search_index.json b/search/search_index.json index 2641d067..ba9df3ac 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config": {"lang": ["en"], "separator": "[\\s\\-]+", "pipeline": ["stopWordFilter"]}, "docs": [{"location": "", "title": "Welcome to GraphRAG", "text": "

    \ud83d\udc49 Microsoft Research Blog Post \ud83d\udc49 GraphRAG Accelerator \ud83d\udc49 GraphRAG Arxiv

    Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.

    GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.

    To learn more about GraphRAG and how it can be used to enhance your LLMs ability to reason about your private data, please visit the Microsoft Research Blog Post.

    "}, {"location": "#solution-accelerator", "title": "Solution Accelerator \ud83d\ude80", "text": "

    To quickstart the GraphRAG system we recommend trying the Solution Accelerator package. This provides a user-friendly end-to-end experience with Azure resources.

    "}, {"location": "#get-started-with-graphrag", "title": "Get Started with GraphRAG \ud83d\ude80", "text": "

    To start using GraphRAG, check out the Get Started guide. For a deeper dive into the main sub-systems, please visit the docpages for the Indexer and Query packages.

    "}, {"location": "#graphrag-vs-baseline-rag", "title": "GraphRAG vs Baseline RAG \ud83d\udd0d", "text": "

    Retrieval-Augmented Generation (RAG) is a technique to improve LLM outputs using real-world information. This technique is an important part of most LLM-based tools and the majority of RAG approaches use vector similarity as the search technique, which we call Baseline RAG. GraphRAG uses knowledge graphs to provide substantial improvements in question-and-answer performance when reasoning about complex information. RAG techniques have shown promise in helping LLMs to reason about private datasets - data that the LLM is not trained on and has never seen before, such as an enterprise\u2019s proprietary research, business documents, or communications. Baseline RAG was created to help solve this problem, but we observe situations where baseline RAG performs very poorly. For example:

    To address this, the tech community is working to develop methods that extend and enhance RAG. Microsoft Research\u2019s new approach, GraphRAG, uses LLMs to create a knowledge graph based on an input corpus. This graph, along with community summaries and graph machine learning outputs, are used to augment prompts at query time. GraphRAG shows substantial improvement in answering the two classes of questions described above, demonstrating intelligence or mastery that outperforms other approaches previously applied to private datasets.

    "}, {"location": "#the-graphrag-process", "title": "The GraphRAG Process \ud83e\udd16", "text": "

    GraphRAG builds upon our prior research and tooling using graph machine learning. The basic steps of the GraphRAG process are as follows:

    "}, {"location": "#index", "title": "Index", "text": ""}, {"location": "#query", "title": "Query", "text": "

    At query time, these structures are used to provide materials for the LLM context window when answering a question. The primary query modes are:

    "}, {"location": "#prompt-tuning", "title": "Prompt Tuning", "text": "

    Using GraphRAG with your data out of the box may not yield the best possible results. We strongly recommend to fine-tune your prompts following the Prompt Tuning Guide in our documentation.

    "}, {"location": "blog_posts/", "title": "Microsoft Research Blog", "text": "