graphrag/tests/unit/indexing/operations/chunk_text/test_chunk_text.py
Copilot e84df28e64
Some checks are pending
gh-pages / build (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python CI / python-ci (ubuntu-latest, 3.11) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.10) (push) Waiting to run
Python CI / python-ci (windows-latest, 3.11) (push) Waiting to run
Python Integration Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Integration Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Notebook Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Python Publish (pypi) / Upload release to PyPI (push) Waiting to run
Python Smoke Tests / python-ci (ubuntu-latest, 3.10) (push) Waiting to run
Python Smoke Tests / python-ci (windows-latest, 3.10) (push) Waiting to run
Spellcheck / spellcheck (push) Waiting to run
Improve internal logging functionality by using Python's standard logging module (#1956)
* Initial plan for issue

* Implement standard logging module and integrate with existing loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add test cases and improve documentation for standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting and add semversioner file for logging improvements

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove custom logger classes and refactor to use standard logging only

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Apply ruff formatting to resolve CI/CD test failures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add semversioner file and fix linting issues

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* fix spelling error

* Remove StandardProgressLogger and refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove LoggerFactory and custom loggers, refactor to use standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright error: use logger.info() instead of calling logger as function in cosmosdb_pipeline_storage.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Remove deprecated logger files that were marked as deprecated placeholders

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace custom get_logger with standard Python logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues found by ruff check --fix

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* add word to dictionary

* Fix type checker error in ModelManager.__new__ method

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor multiple logging.getLogger() calls to use single logger per file

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove progress_logger parameter from build_index() and logger parameter from generate_indexing_prompts()

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove logger parameter from run_pipeline and standardize logger naming

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logger parameter with log_level parameter in CLI commands

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass poetry poe check

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove --logger parameter from smoke test command

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix Windows CI/CD issue with log file cleanup in tests

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Add StreamHandler to root logger in __main__.py for CLI logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Only add StreamHandler if root logger doesn't have existing StreamHandler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Replace logging.StreamHandler with colorlog.StreamHandler for colorized log output

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Regenerate poetry.lock file after adding colorlog dependency

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix import ordering in notebook files to pass ruff checks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* move printing of dataframes to debug level

* remove colorlog for now

* Refactor workflow callbacks to inherit from logging.Handler

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix linting issues in workflow callback handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type errors in blob and file workflow callbacks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Refactor pipeline logging to use pure logging.Handler subclasses

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Rename workflow callback classes to workflow logger classes and move to logger directory

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* update dictionary

* apply ruff fixes

* fix function name

* simplify logger code

* update

* Remove error, warning, and log methods from WorkflowCallbacks and replace with standard logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* ruff fixes

* Fix pyright errors by removing WorkflowCallbacks from strategy type signatures

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove ConsoleWorkflowLogger and apply consistent formatter to all handlers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Refactor pipeline_logger.py to use standard FileHandler and remove FileWorkflowLogger

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove conditional azure import checks from blob_workflow_logger.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Fix pyright type checking errors in mock_provider.py and utils.py

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Run ruff check --fix to fix import ordering in notebooks

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Merge configure_logging and create_pipeline_logger into init_loggers function

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* Remove configure_logging and create_pipeline_logger functions, replace all usage with init_loggers

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* cleanup unused code

* Update init_loggers to accept GraphRagConfig instead of ReportingConfig

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff check fixes

* Fix test failures by providing valid GraphRagConfig with required model configurations

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* remove logging_workflow_callback

* cleanup logging messages

* Add logging to track progress of pandas DataFrame apply operation in create_base_text_units

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* cleanup logger logic throughout codebase

* update

* more cleanup of old loggers

* small logger cleanup

* final code cleanup and added loggers to query

* add verbose logging to query

* minor code cleanup

* Fix broken unit tests for chunk_text and standard_logging

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* apply ruff fixes

* Fix test_chunk_text by mocking progress_ticker function instead of ProgressTicker class

Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>

* remove unnecessary logger

* remove rich and fix type annotation

* revert test formatting changes my by copilot

* promote graphrag logs to root logger

* add correct semversioner file

* revert change to file

* revert formatting changes that have no effect

* fix changes after merge with main

* revert unnecessary copilot changes

* remove whitespace

* cleanup docstring

* simplify some logic with less code

* update poetry lock file

* ruff fixes

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jgbradley1 <654554+jgbradley1@users.noreply.github.com>
Co-authored-by: Josh Bradley <joshbradley@microsoft.com>
2025-07-09 18:29:03 -06:00

182 lines
4.7 KiB
Python

# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License
from unittest import mock
from unittest.mock import ANY, Mock
import pandas as pd
import pytest
from graphrag.config.enums import ChunkStrategyType
from graphrag.index.operations.chunk_text.chunk_text import (
_get_num_total,
chunk_text,
load_strategy,
run_strategy,
)
from graphrag.index.operations.chunk_text.typing import (
TextChunk,
)
def test_get_num_total_default():
output = pd.DataFrame({"column": ["a", "b", "c"]})
total = _get_num_total(output, "column")
assert total == 3
def test_get_num_total_array():
output = pd.DataFrame({"column": [["a", "b", "c"], ["x", "y"]]})
total = _get_num_total(output, "column")
assert total == 5
def test_load_strategy_tokens():
strategy_type = ChunkStrategyType.tokens
strategy_loaded = load_strategy(strategy_type)
assert strategy_loaded.__name__ == "run_tokens"
def test_load_strategy_sentence():
strategy_type = ChunkStrategyType.sentence
strategy_loaded = load_strategy(strategy_type)
assert strategy_loaded.__name__ == "run_sentences"
def test_load_strategy_none():
strategy_type = ChunkStrategyType
with pytest.raises(
ValueError, match="Unknown strategy: <enum 'ChunkStrategyType'>"
):
load_strategy(strategy_type) # type: ignore
def test_run_strategy_str():
input = "text test for run strategy"
config = Mock()
tick = Mock()
strategy_mocked = Mock()
strategy_mocked.return_value = [
TextChunk(
text_chunk="text test for run strategy",
source_doc_indices=[0],
)
]
runned = run_strategy(strategy_mocked, input, config, tick)
assert runned == ["text test for run strategy"]
def test_run_strategy_arr_str():
input = ["text test for run strategy", "use for strategy"]
config = Mock()
tick = Mock()
strategy_mocked = Mock()
strategy_mocked.return_value = [
TextChunk(
text_chunk="text test for run strategy", source_doc_indices=[0], n_tokens=5
),
TextChunk(text_chunk="use for strategy", source_doc_indices=[1], n_tokens=3),
]
expected = [
"text test for run strategy",
"use for strategy",
]
runned = run_strategy(strategy_mocked, input, config, tick)
assert runned == expected
def test_run_strategy_arr_tuple():
input = [("text test for run strategy", "3"), ("use for strategy", "5")]
config = Mock()
tick = Mock()
strategy_mocked = Mock()
strategy_mocked.return_value = [
TextChunk(
text_chunk="text test for run strategy", source_doc_indices=[0], n_tokens=5
),
TextChunk(text_chunk="use for strategy", source_doc_indices=[1], n_tokens=3),
]
expected = [
(
["text test for run strategy"],
"text test for run strategy",
5,
),
(
["use for strategy"],
"use for strategy",
3,
),
]
runned = run_strategy(strategy_mocked, input, config, tick)
assert runned == expected
def test_run_strategy_arr_tuple_same_doc():
input = [("text test for run strategy", "3"), ("use for strategy", "5")]
config = Mock()
tick = Mock()
strategy_mocked = Mock()
strategy_mocked.return_value = [
TextChunk(
text_chunk="text test for run strategy", source_doc_indices=[0], n_tokens=5
),
TextChunk(text_chunk="use for strategy", source_doc_indices=[0], n_tokens=3),
]
expected = [
(
["text test for run strategy"],
"text test for run strategy",
5,
),
(
["text test for run strategy"],
"use for strategy",
3,
),
]
runned = run_strategy(strategy_mocked, input, config, tick)
assert runned == expected
@mock.patch("graphrag.index.operations.chunk_text.chunk_text.load_strategy")
@mock.patch("graphrag.index.operations.chunk_text.chunk_text.run_strategy")
@mock.patch("graphrag.index.operations.chunk_text.chunk_text.progress_ticker")
def test_chunk_text(mock_progress_ticker, mock_run_strategy, mock_load_strategy):
input_data = pd.DataFrame({"name": ["The Shining"]})
column = "name"
size = 10
overlap = 2
encoding_model = "model"
strategy = ChunkStrategyType.sentence
callbacks = Mock()
callbacks.progress = Mock()
mock_load_strategy.return_value = Mock()
mock_progress_ticker.return_value = Mock()
chunk_text(input_data, column, size, overlap, encoding_model, strategy, callbacks)
mock_run_strategy.assert_called_with(
mock_load_strategy(), "The Shining", ANY, mock_progress_ticker.return_value
)