mirror of
https://github.com/microsoft/graphrag.git
synced 2026-02-19 01:05:50 +08:00
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
* Initial plan for issue * Implement standard logging module and integrate with existing loggers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Add test cases and improve documentation for standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Apply ruff formatting and add semversioner file for logging improvements Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove custom logger classes and refactor to use standard logging only Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Apply ruff formatting to resolve CI/CD test failures Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Add semversioner file and fix linting issues Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff fixes * fix spelling error * Remove StandardProgressLogger and refactor to use standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove LoggerFactory and custom loggers, refactor to use standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix pyright error: use logger.info() instead of calling logger as function in cosmosdb_pipeline_storage.py Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff fixes * Remove deprecated logger files that were marked as deprecated placeholders Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Replace custom get_logger with standard Python logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix linting issues found by ruff check --fix Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff check fixes * add word to dictionary * Fix type checker error in ModelManager.__new__ method Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Refactor multiple logging.getLogger() calls to use single logger per file Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove progress_logger parameter from build_index() and logger parameter from generate_indexing_prompts() Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove logger parameter from run_pipeline and standardize logger naming Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Replace logger parameter with log_level parameter in CLI commands Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix import ordering in notebook files to pass poetry poe check Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove --logger parameter from smoke test command Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix Windows CI/CD issue with log file cleanup in tests Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Add StreamHandler to root logger in __main__.py for CLI logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Only add StreamHandler if root logger doesn't have existing StreamHandler Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix import ordering in notebook files to pass ruff checks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Replace logging.StreamHandler with colorlog.StreamHandler for colorized log output Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Regenerate poetry.lock file after adding colorlog dependency Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix import ordering in notebook files to pass ruff checks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * move printing of dataframes to debug level * remove colorlog for now * Refactor workflow callbacks to inherit from logging.Handler Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix linting issues in workflow callback handlers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix pyright type errors in blob and file workflow callbacks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Refactor pipeline logging to use pure logging.Handler subclasses Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Rename workflow callback classes to workflow logger classes and move to logger directory Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * update dictionary * apply ruff fixes * fix function name * simplify logger code * update * Remove error, warning, and log methods from WorkflowCallbacks and replace with standard logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * ruff fixes * Fix pyright errors by removing WorkflowCallbacks from strategy type signatures Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove ConsoleWorkflowLogger and apply consistent formatter to all handlers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * Refactor pipeline_logger.py to use standard FileHandler and remove FileWorkflowLogger Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove conditional azure import checks from blob_workflow_logger.py Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Fix pyright type checking errors in mock_provider.py and utils.py Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Run ruff check --fix to fix import ordering in notebooks Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Merge configure_logging and create_pipeline_logger into init_loggers function Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * Remove configure_logging and create_pipeline_logger functions, replace all usage with init_loggers Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * cleanup unused code * Update init_loggers to accept GraphRagConfig instead of ReportingConfig Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff check fixes * Fix test failures by providing valid GraphRagConfig with required model configurations Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * remove logging_workflow_callback * cleanup logging messages * Add logging to track progress of pandas DataFrame apply operation in create_base_text_units Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * cleanup logger logic throughout codebase * update * more cleanup of old loggers * small logger cleanup * final code cleanup and added loggers to query * add verbose logging to query * minor code cleanup * Fix broken unit tests for chunk_text and standard_logging Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * apply ruff fixes * Fix test_chunk_text by mocking progress_ticker function instead of ProgressTicker class Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> * remove unnecessary logger * remove rich and fix type annotation * revert test formatting changes my by copilot * promote graphrag logs to root logger * add correct semversioner file * revert change to file * revert formatting changes that have no effect * fix changes after merge with main * revert unnecessary copilot changes * remove whitespace * cleanup docstring * simplify some logic with less code * update poetry lock file * ruff fixes --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com> Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
182 lines
4.7 KiB
Python
182 lines
4.7 KiB
Python
# Copyright (c) 2024 Microsoft Corporation.
|
|
# Licensed under the MIT License
|
|
|
|
|
|
from unittest import mock
|
|
from unittest.mock import ANY, Mock
|
|
|
|
import pandas as pd
|
|
import pytest
|
|
|
|
from graphrag.config.enums import ChunkStrategyType
|
|
from graphrag.index.operations.chunk_text.chunk_text import (
|
|
_get_num_total,
|
|
chunk_text,
|
|
load_strategy,
|
|
run_strategy,
|
|
)
|
|
from graphrag.index.operations.chunk_text.typing import (
|
|
TextChunk,
|
|
)
|
|
|
|
|
|
def test_get_num_total_default():
|
|
output = pd.DataFrame({"column": ["a", "b", "c"]})
|
|
|
|
total = _get_num_total(output, "column")
|
|
assert total == 3
|
|
|
|
|
|
def test_get_num_total_array():
|
|
output = pd.DataFrame({"column": [["a", "b", "c"], ["x", "y"]]})
|
|
|
|
total = _get_num_total(output, "column")
|
|
assert total == 5
|
|
|
|
|
|
def test_load_strategy_tokens():
|
|
strategy_type = ChunkStrategyType.tokens
|
|
|
|
strategy_loaded = load_strategy(strategy_type)
|
|
|
|
assert strategy_loaded.__name__ == "run_tokens"
|
|
|
|
|
|
def test_load_strategy_sentence():
|
|
strategy_type = ChunkStrategyType.sentence
|
|
|
|
strategy_loaded = load_strategy(strategy_type)
|
|
|
|
assert strategy_loaded.__name__ == "run_sentences"
|
|
|
|
|
|
def test_load_strategy_none():
|
|
strategy_type = ChunkStrategyType
|
|
|
|
with pytest.raises(
|
|
ValueError, match="Unknown strategy: <enum 'ChunkStrategyType'>"
|
|
):
|
|
load_strategy(strategy_type) # type: ignore
|
|
|
|
|
|
def test_run_strategy_str():
|
|
input = "text test for run strategy"
|
|
config = Mock()
|
|
tick = Mock()
|
|
strategy_mocked = Mock()
|
|
|
|
strategy_mocked.return_value = [
|
|
TextChunk(
|
|
text_chunk="text test for run strategy",
|
|
source_doc_indices=[0],
|
|
)
|
|
]
|
|
|
|
runned = run_strategy(strategy_mocked, input, config, tick)
|
|
assert runned == ["text test for run strategy"]
|
|
|
|
|
|
def test_run_strategy_arr_str():
|
|
input = ["text test for run strategy", "use for strategy"]
|
|
config = Mock()
|
|
tick = Mock()
|
|
strategy_mocked = Mock()
|
|
|
|
strategy_mocked.return_value = [
|
|
TextChunk(
|
|
text_chunk="text test for run strategy", source_doc_indices=[0], n_tokens=5
|
|
),
|
|
TextChunk(text_chunk="use for strategy", source_doc_indices=[1], n_tokens=3),
|
|
]
|
|
|
|
expected = [
|
|
"text test for run strategy",
|
|
"use for strategy",
|
|
]
|
|
|
|
runned = run_strategy(strategy_mocked, input, config, tick)
|
|
assert runned == expected
|
|
|
|
|
|
def test_run_strategy_arr_tuple():
|
|
input = [("text test for run strategy", "3"), ("use for strategy", "5")]
|
|
config = Mock()
|
|
tick = Mock()
|
|
strategy_mocked = Mock()
|
|
|
|
strategy_mocked.return_value = [
|
|
TextChunk(
|
|
text_chunk="text test for run strategy", source_doc_indices=[0], n_tokens=5
|
|
),
|
|
TextChunk(text_chunk="use for strategy", source_doc_indices=[1], n_tokens=3),
|
|
]
|
|
|
|
expected = [
|
|
(
|
|
["text test for run strategy"],
|
|
"text test for run strategy",
|
|
5,
|
|
),
|
|
(
|
|
["use for strategy"],
|
|
"use for strategy",
|
|
3,
|
|
),
|
|
]
|
|
|
|
runned = run_strategy(strategy_mocked, input, config, tick)
|
|
assert runned == expected
|
|
|
|
|
|
def test_run_strategy_arr_tuple_same_doc():
|
|
input = [("text test for run strategy", "3"), ("use for strategy", "5")]
|
|
config = Mock()
|
|
tick = Mock()
|
|
strategy_mocked = Mock()
|
|
|
|
strategy_mocked.return_value = [
|
|
TextChunk(
|
|
text_chunk="text test for run strategy", source_doc_indices=[0], n_tokens=5
|
|
),
|
|
TextChunk(text_chunk="use for strategy", source_doc_indices=[0], n_tokens=3),
|
|
]
|
|
|
|
expected = [
|
|
(
|
|
["text test for run strategy"],
|
|
"text test for run strategy",
|
|
5,
|
|
),
|
|
(
|
|
["text test for run strategy"],
|
|
"use for strategy",
|
|
3,
|
|
),
|
|
]
|
|
|
|
runned = run_strategy(strategy_mocked, input, config, tick)
|
|
assert runned == expected
|
|
|
|
|
|
@mock.patch("graphrag.index.operations.chunk_text.chunk_text.load_strategy")
|
|
@mock.patch("graphrag.index.operations.chunk_text.chunk_text.run_strategy")
|
|
@mock.patch("graphrag.index.operations.chunk_text.chunk_text.progress_ticker")
|
|
def test_chunk_text(mock_progress_ticker, mock_run_strategy, mock_load_strategy):
|
|
input_data = pd.DataFrame({"name": ["The Shining"]})
|
|
column = "name"
|
|
size = 10
|
|
overlap = 2
|
|
encoding_model = "model"
|
|
strategy = ChunkStrategyType.sentence
|
|
callbacks = Mock()
|
|
callbacks.progress = Mock()
|
|
|
|
mock_load_strategy.return_value = Mock()
|
|
mock_progress_ticker.return_value = Mock()
|
|
|
|
chunk_text(input_data, column, size, overlap, encoding_model, strategy, callbacks)
|
|
|
|
mock_run_strategy.assert_called_with(
|
|
mock_load_strategy(), "The Shining", ANY, mock_progress_ticker.return_value
|
|
)
|