* Update input factory to match other factories
* Move input config alongside input readers
* Move file pattern logic into InputReader
* Set encoding default
* Clean up optional column configs
* Combine structured data extraction
* Remove pandas from input loading
* Throw if empty documents
* Add json lines (jsonl) input support
* Store raw data
* Fix merge imports
* Move metadata handling entirely to chunking
* Nicer automatic title
* Typo
* Add get_property utility for nested dictionary access with dot notation
* Update structured_file_reader to use get_property utility
* Extract input module into new graphrag-input monorepo package
- Create new graphrag-input package with input loading utilities
- Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text)
- Add get_property utility for nested dictionary access with dot notation
- Include hashing utility for document ID generation
- Update all imports throughout codebase to use graphrag_input
- Add package to workspace configuration and release tasks
- Remove old graphrag.index.input module
* Rename ChunkResult to TextChunk and add transformer support
- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk
- Add 'original' field to TextChunk to track pre-transform text
- Add optional transform callback to chunker.chunk() method
- Add add_metadata transformer for prepending metadata to chunks
- Update create_chunk_results to apply transforms and populate original
- Update sentence_chunker and token_chunker with transform support
- Refactor create_base_text_units to use new transformer pattern
- Rename pluck_metadata to get/collect methods on TextDocument
* Back-compat comment
* Align input config type name with other factory configs
* Add MarkItDown support
* Remove pattern default from MarkItDown reader
* Remove plugins flag (implicit disabled)
* Format
* Update verb tests
* Separate storage from input config
* Add empty objects for NaN raw_data
* Fix smoke tests
* Fix BOM in csv smoke
* Format
* Delete NoopTextSplitter
* Delete unused check_token_limit
* Add base chunking factory and migrate workflow to use it
* Split apart chunker module
* Co-locate chunking/splitting
* Collapse token splitting functionality into one class/function
* Restore create_base_text_units parameterization
* Move Tokenizer base class to common package
* Move pre-pending into chunkers
* Streamline config
* Fix defaults construction
* Add prepending tests
* Remove chunk_size_includes_metadata config
* Revert ChunkingDocument interface
* Move metadata prepending to a util
* Move Tokenizer back to GR core
* Fix tokenizer removal from chunker
* Set defaults for chunking config
* Move chunking to monorepo package
* Format
* Typo
* Add ChunkResult model
* Streamline chunking config
* Add missing version updates for graphrag_chunking
* Simplify Factory interface
* Migrate CacheFactory to standard base class
* Migrate LoggerFactory to standard base class
* Migrate StorageFactory to standard base class
* Migrate VectorStoreFactory to standard base class
* Update vector store example notebook
* Delete notebook outputs
* Move default providers into factories
* Move retry/limit tests into integ
* Split language model factories
* Set smoke test tpm/rpm
* Fix factory integ tests
* Add method to smoke test, switch text to 'fast'
* Fix text smoke config for fast workflow
* Add new workflows to text smoke test
* Convert input readers to a proper factory
* Remove covariates from fast smoke test
* Update docs for input factory
* Bump smoke runtime
* Even longer runtime
* min-csv timeout
* Remove unnecessary lambdas
* Add basic search to overview
* Add info on input documents DataFrame
* Add info on factories to docs
* Add consumption warning and switch to "christmas" for folder name
* Add logger to factories list
* Add litellm docs. (#2058)
* Fix version for input docs
* Spelling
---------
Co-authored-by: Derek Worthen <worthend.derek@gmail.com>
* Remove text unit group_by_columns
* Semver
* Fix default token split test
* Fix models in config test samples
* Fix token length in context sort test
* Fix document sort
* Add models page
* Update config docs for new params
* Spelling
* Add comment on CoT with o-series
* Add notes about managed identity
* Update the viz guide
* Spruce up the getting started wording
* Capitalization
* Add BYOG page
* More BYOG edits
* Update dictionary
* Change example model name
* Fix footer contrast
* Fix broken links
* Remove a few unneeded examples
* Point python API example to the whole folder
* Convert schema bullets to tables
* Add source documents for verb tests
* Remove entity_type erroneous column
* Add new test data
* Remove source/target degree columns
* Remove top_level_node_id
* Remove chunk column configs
* Rename "chunk" to "text"
* Rename "chunk" to "text" in base
* Re-map document input to use base text units
* Revert base text units as final documents dep
* Update test data
* Split/rename node source_id
* Drop node size (dup of degree)
* Drop document_ids from covariates
* Remove unused document_ids from models
* Remove n_tokens from covariate table
* Fix missed document_ids delete
* Wire base text units to final documents
* Rename relationship rank as combined_degree
* Add rank as first-class property to Relationship
* Remove split_text operation
* Fix relationships test parquet
* Update test parquets
* Add entity ids to community table
* Remove stored graph embedding columns
* Format
* Semver
* Fix JSON typo
* Spelling
* Rename lancedb
* Sort lancedb
* Fix unit test
* Fix test to account for changing period
* Update tests for separate embeddings
* Format
* Better assertion printing
* Fix unit test for windows
* Rename document.raw_content -> document.text
* Remove read_documents function
* Remove unused document summary from model
* Remove unused imports
* Format
* Add new snapshots to default init
* Use util to construct embeddings collection name
* Align inc index model with branch changes
* Update data and tests for int ids
* Clean up embedding locs
* Switch entity "name" to "title" for consistency
* Fix short_id -> human_readable_id defaults
* Format
* Rework community IDs
* Fix community size compute
* Fix unit tests
* Fix report read
* Pare down nodes table output
* Fix unit test
* Fix merge
* Fix community loading
* Format
* Fix community id report extraction
* Update tests
* Consistent short IDs and ordering
* Update ordering and tests
* Update incremental for new nodes model
* Guard document columns loc
* Match column ordering
* Fix document guard
* Update smoke tests
* Fill NA on community extract
* Logging for smoke test debug
* Add parquet schema details doc
* Fix community hierarchy guard
* Use better empty hierarchy guard
* Back-compat shims
* Semver
* Fix warning
* Format
* Remove default fallback
* Reuse key