mirror of
https://github.com/microsoft/graphrag.git
synced 2026-02-14 15:04:29 +08:00
Deploying to gh-pages from @ microsoft/graphrag@27c6de846f 🚀
This commit is contained in:
parent
a15bd461fd
commit
9ae919afbc
@ -1684,7 +1684,7 @@
|
||||
<p>As of version 1.3, GraphRAG no longer supports a full complement of pre-built environment variables. Instead, we support variable replacement within the <a href="../yaml/">settings.yml file</a> so you can specify any environment variables you like.</p>
|
||||
<p>The only standard environment variable we expect, and include in the default settings.yml, is <code>GRAPHRAG_API_KEY</code>. If you are already using a number of the previous GRAPHRAG_* environment variables, you can insert them with template syntax into settings.yml and they will be adopted.</p>
|
||||
<blockquote>
|
||||
<p><strong>The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml.</strong></p>
|
||||
<p><strong>The environment variables below are documented as an aid for migration, but they WILL NOT be read unless you use template syntax in your settings.yml. We also WILL NOT be updating this page as the main config object changes.</strong></p>
|
||||
</blockquote>
|
||||
<hr />
|
||||
<h3 id="text-embeddings-customization">Text-Embeddings Customization</h3>
|
||||
|
||||
@ -1558,7 +1558,7 @@
|
||||
<h4 id="fields">Fields</h4>
|
||||
<ul>
|
||||
<li><code>api_key</code> <strong>str</strong> - The OpenAI API key to use.</li>
|
||||
<li><code>auth_type</code> <strong>api_key|managed_identity</strong> - Indicate how you want to authenticate requests.</li>
|
||||
<li><code>auth_type</code> <strong>api_key|azure_managed_identity</strong> - Indicate how you want to authenticate requests.</li>
|
||||
<li><code>type</code> <strong>openai_chat|azure_openai_chat|openai_embedding|azure_openai_embedding|mock_chat|mock_embeddings</strong> - The type of LLM to use.</li>
|
||||
<li><code>model</code> <strong>str</strong> - The model name.</li>
|
||||
<li><code>encoding_model</code> <strong>str</strong> - The text encoding model to use. Default is to use the encoding model aligned with the language model (i.e., it is retrieved from tiktoken if unset).</li>
|
||||
@ -1589,15 +1589,17 @@
|
||||
</ul>
|
||||
<h2 id="input-files-and-chunking">Input Files and Chunking</h2>
|
||||
<h3 id="input">input</h3>
|
||||
<p>Our pipeline can ingest .csv, .txt, or .json data from an input folder. See the <a href="../../index/inputs/">inputs page</a> for more details and examples.</p>
|
||||
<p>Our pipeline can ingest .csv, .txt, or .json data from an input location. See the <a href="../../index/inputs/">inputs page</a> for more details and examples.</p>
|
||||
<h4 id="fields_1">Fields</h4>
|
||||
<ul>
|
||||
<li><code>type</code> <strong>file|blob</strong> - The input type to use. Default=<code>file</code></li>
|
||||
<li><code>storage</code> <strong>StorageConfig</strong></li>
|
||||
<li><code>type</code> <strong>file|blob|cosmosdb</strong> - The storage type to use. Default=<code>file</code></li>
|
||||
<li><code>base_dir</code> <strong>str</strong> - The base directory to write output artifacts to, relative to the root.</li>
|
||||
<li><code>connection_string</code> <strong>str</strong> - (blob/cosmosdb only) The Azure Storage connection string.</li>
|
||||
<li><code>container_name</code> <strong>str</strong> - (blob/cosmosdb only) The Azure Storage container name.</li>
|
||||
<li><code>storage_account_blob_url</code> <strong>str</strong> - (blob only) The storage account blob URL to use.</li>
|
||||
<li><code>cosmosdb_account_blob_url</code> <strong>str</strong> - (cosmosdb only) The CosmosDB account blob URL to use.</li>
|
||||
<li><code>file_type</code> <strong>text|csv|json</strong> - The type of input data to load. Default is <code>text</code></li>
|
||||
<li><code>base_dir</code> <strong>str</strong> - The base directory to read input from, relative to the root.</li>
|
||||
<li><code>connection_string</code> <strong>str</strong> - (blob only) The Azure Storage connection string.</li>
|
||||
<li><code>storage_account_blob_url</code> <strong>str</strong> - The storage account blob URL to use.</li>
|
||||
<li><code>container_name</code> <strong>str</strong> - (blob only) The Azure Storage container name.</li>
|
||||
<li><code>encoding</code> <strong>str</strong> - The encoding of the input file. Default is <code>utf-8</code></li>
|
||||
<li><code>file_pattern</code> <strong>str</strong> - A regex to match input files. Default is <code>.*\.csv$</code>, <code>.*\.txt$</code>, or <code>.*\.json$</code> depending on the specified <code>file_type</code>, but you can customize it if needed.</li>
|
||||
<li><code>file_filter</code> <strong>dict</strong> - Key/value pairs to filter. Default is None.</li>
|
||||
|
||||
@ -1590,8 +1590,7 @@
|
||||
<h2 id="requirements">Requirements</h2>
|
||||
<p><a href="https://www.python.org/downloads/">Python 3.10-3.12</a></p>
|
||||
<p>To get started with the GraphRAG system, you have a few options:</p>
|
||||
<p>👉 <a href="https://github.com/Azure-Samples/graphrag-accelerator">Use the GraphRAG Accelerator solution</a> <br/>
|
||||
👉 <a href="https://pypi.org/project/graphrag/">Install from pypi</a>. <br/>
|
||||
<p>👉 <a href="https://pypi.org/project/graphrag/">Install from pypi</a>. <br/>
|
||||
👉 <a href="../developing/">Use it from source</a><br/></p>
|
||||
<p>The following is a simple end-to-end example for using the GraphRAG system, using the install from pypi option.</p>
|
||||
<p>It shows how to use the system to index some text, and then use the indexed data to answer questions about the documents.</p>
|
||||
|
||||
21
index.html
21
index.html
@ -501,15 +501,6 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#solution-accelerator" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
Solution Accelerator 🚀
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#get-started-with-graphrag" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@ -1625,15 +1616,6 @@
|
||||
</label>
|
||||
<ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#solution-accelerator" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
Solution Accelerator 🚀
|
||||
</span>
|
||||
</a>
|
||||
|
||||
</li>
|
||||
|
||||
<li class="md-nav__item">
|
||||
<a href="#get-started-with-graphrag" class="md-nav__link">
|
||||
<span class="md-ellipsis">
|
||||
@ -1724,7 +1706,6 @@
|
||||
|
||||
<h1 id="welcome-to-graphrag">Welcome to GraphRAG</h1>
|
||||
<p>👉 <a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/">Microsoft Research Blog Post</a> <br/>
|
||||
👉 <a href="https://github.com/Azure-Samples/graphrag-accelerator">GraphRAG Accelerator</a> <br/>
|
||||
👉 <a href="https://arxiv.org/pdf/2404.16130">GraphRAG Arxiv</a></p>
|
||||
<p align="center">
|
||||
<img src="img/GraphRag-Figure1.jpg" alt="Figure 1: LLM-generated knowledge graph built from a private dataset using GPT-4 Turbo." width="450" align="center" />
|
||||
@ -1736,8 +1717,6 @@ Figure 1: An LLM-generated knowledge graph built using GPT-4 Turbo.
|
||||
<p>GraphRAG is a structured, hierarchical approach to Retrieval Augmented Generation (RAG), as opposed to naive semantic-search
|
||||
approaches using plain text snippets. The GraphRAG process involves extracting a knowledge graph out of raw text, building a community hierarchy, generating summaries for these communities, and then leveraging these structures when perform RAG-based tasks.</p>
|
||||
<p>To learn more about GraphRAG and how it can be used to enhance your language model's ability to reason about your private data, please visit the <a href="https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/">Microsoft Research Blog Post</a>.</p>
|
||||
<h2 id="solution-accelerator">Solution Accelerator 🚀</h2>
|
||||
<p>To quickstart the GraphRAG system we recommend trying the <a href="https://github.com/Azure-Samples/graphrag-accelerator">Solution Accelerator</a> package. This provides a user-friendly end-to-end experience with Azure resources.</p>
|
||||
<h2 id="get-started-with-graphrag">Get Started with GraphRAG 🚀</h2>
|
||||
<p>To start using GraphRAG, check out the <a href="get_started/"><em>Get Started</em></a> guide.
|
||||
For a deeper dive into the main sub-systems, please visit the docpages for the <a href="index/overview/">Indexer</a> and <a href="query/overview/">Query</a> packages.</p>
|
||||
|
||||
@ -1786,14 +1786,13 @@
|
||||
<div class="highlight"><pre><span></span><code><a id="__codelineno-1-1" name="__codelineno-1-1" href="#__codelineno-1-1"></a><span class="nt">workflows</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">[</span><span class="nv">create_communities</span><span class="p p-Indicator">,</span><span class="w"> </span><span class="nv">create_community_reports</span><span class="p p-Indicator">,</span><span class="w"> </span><span class="nv">generate_text_embeddings</span><span class="p p-Indicator">]</span>
|
||||
</code></pre></div>
|
||||
<h3 id="fastgraphrag">FastGraphRAG</h3>
|
||||
<p><a href="../methods/#fastgraphrag">FastGraphRAG</a> uses text_units for the community reports instead of the entity and relationship descriptions. If your graph is sourced in such a way that it does not have descriptions, this might be a useful alternative. In this case, you would update your workflows list to include the text variant:</p>
|
||||
<p><a href="../methods/#fastgraphrag">FastGraphRAG</a> uses text_units for the community reports instead of the entity and relationship descriptions. If your graph is sourced in such a way that it does not have descriptions, this might be a useful alternative. In this case, you would update your workflows list to include the text variant of the community reports workflow:</p>
|
||||
<div class="highlight"><pre><span></span><code><a id="__codelineno-2-1" name="__codelineno-2-1" href="#__codelineno-2-1"></a><span class="nt">workflows</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">[</span><span class="nv">create_communities</span><span class="p p-Indicator">,</span><span class="w"> </span><span class="nv">create_community_reports_text</span><span class="p p-Indicator">,</span><span class="w"> </span><span class="nv">generate_text_embeddings</span><span class="p p-Indicator">]</span>
|
||||
</code></pre></div>
|
||||
<p>This method requires that your entities and relationships tables have valid links to a list of text_unit_ids. Also note that <code>generate_text_embeddings</code> is still only required if you are doing searches other than Global Search.</p>
|
||||
<h2 id="setup">Setup</h2>
|
||||
<p>Putting it all together:</p>
|
||||
<ul>
|
||||
<li><code>input</code>: GraphRAG does require an input document set, even if you don't need us to process it. You can create an input folder and drop a dummy.txt document in there to work around this.</li>
|
||||
<li><code>output</code>: Create an output folder and put your entities and relationships (and optionally text_units) parquet files in it.</li>
|
||||
<li>Update your config as noted above to only run the workflows subset you need.</li>
|
||||
<li>Run <code>graphrag index --root <your project root></code></li>
|
||||
|
||||
@ -1631,7 +1631,7 @@
|
||||
<li>relationship extraction: LLM is prompted to describe the relationship between each pair of entities in each text unit.</li>
|
||||
<li>entity summarization: LLM is prompted to combine the descriptions for every instance of an entity found across the text units into a single summary.</li>
|
||||
<li>relationship summarization: LLM is prompted to combine the descriptions for every instance of a relationship found across the text units into a single summary.</li>
|
||||
<li>claim extraction (optiona): LLM is prompted to extract and describe claims from each text unit.</li>
|
||||
<li>claim extraction (optional): LLM is prompted to extract and describe claims from each text unit.</li>
|
||||
<li>community report generation: entity and relationship descriptions (and optionally claims) for each community are collected and used to prompt the LLM to generate a summary report.</li>
|
||||
</ul>
|
||||
<p><code>graphrag index --method standard</code>. This is the default method, so the method param can actual be omitted.</p>
|
||||
@ -1642,7 +1642,7 @@
|
||||
<li>relationship extraction: relationships are defined as text unit co-occurrence between entity pairs. There is no description.</li>
|
||||
<li>entity summarization: not necessary.</li>
|
||||
<li>relationship summarization: not necessary.</li>
|
||||
<li>claim extraction (optiona): unused.</li>
|
||||
<li>claim extraction (optional): unused.</li>
|
||||
<li>community report generation: The direct text unit content containing each entity noun phrase is collected and used to prompt the LLM to generate a summary report.</li>
|
||||
</ul>
|
||||
<p><code>graphrag index --method fast</code></p>
|
||||
@ -1652,7 +1652,7 @@
|
||||
<p>This package requires SpaCy models to function correctly. If the required model is not installed, the package will automatically download and install it the first time it is used.</p>
|
||||
<p>You can install it manually by running <code>python -m spacy download <model_name></code>, for example <code>python -m spacy download en_core_web_md</code>.</p>
|
||||
<h2 id="choosing-a-method">Choosing a Method</h2>
|
||||
<p>Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG is a reasonable and cheaper alternative.</p>
|
||||
<p>Standard GraphRAG provides a rich description of real-world entities and relationships, but is more expensive that FastGraphRAG. We estimate graph extraction to constitute roughly 75% of indexing cost. FastGraphRAG is therefore much cheaper, but the tradeoff is that the extracted graph is less directly relevant for use outside of GraphRAG, and the graph tends to be quite a bit noisier. If high fidelity entities and graph exploration are important to your use case, we recommend staying with traditional GraphRAG. If your use case is primarily aimed at summary questions using global search, FastGraphRAG provides high quality summarization at much less LLM cost.</p>
|
||||
|
||||
|
||||
|
||||
|
||||
@ -1971,7 +1971,7 @@ We provide a means for you to do this by allowing you to specify a custom prompt
|
||||
<p>Each of these prompts may be overridden by writing a custom prompt file in plaintext. We use token-replacements in the form of <code>{token_name}</code>, and the descriptions for the available tokens can be found below.</p>
|
||||
<h2 id="indexing-prompts">Indexing Prompts</h2>
|
||||
<h3 id="entityrelationship-extraction">Entity/Relationship Extraction</h3>
|
||||
<p><a href="http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/entity_extraction.py">Prompt Source</a></p>
|
||||
<p><a href="http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/extract_graph.py">Prompt Source</a></p>
|
||||
<h4 id="tokens">Tokens</h4>
|
||||
<ul>
|
||||
<li><strong>{input_text}</strong> - The input text to be processed.</li>
|
||||
@ -1988,7 +1988,7 @@ We provide a means for you to do this by allowing you to specify a custom prompt
|
||||
<li><strong>{description_list}</strong> - A list of descriptions for the entity or relationship.</li>
|
||||
</ul>
|
||||
<h3 id="claim-extraction">Claim Extraction</h3>
|
||||
<p><a href="http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/claim_extraction.py">Prompt Source</a></p>
|
||||
<p><a href="http://github.com/microsoft/graphrag/blob/main/graphrag/prompts/index/extract_claims.py">Prompt Source</a></p>
|
||||
<h4 id="tokens_2">Tokens</h4>
|
||||
<ul>
|
||||
<li><strong>{input_text}</strong> - The input text to be processed.</li>
|
||||
|
||||
File diff suppressed because one or more lines are too long
BIN
sitemap.xml.gz
BIN
sitemap.xml.gz
Binary file not shown.
Loading…
Reference in New Issue
Block a user