From 0c8b5221b4a0c41f08b8233a7e1285317402411e Mon Sep 17 00:00:00 2001
From: Venky <23023424+venkywonka@users.noreply.github.com>
Date: Mon, 9 Feb 2026 20:29:58 -0800
Subject: [PATCH] [TRTC-264][doc] Add CLAUDE.md and AGENTS.md (#11358)

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
---
 .github/CODEOWNERS   |   2 +
 .gitignore           |   1 +
 AGENTS.md            | 148 +++++++++++++++++++++++++++++++++++++++++++
 CLAUDE.md            |   2 +
 CODING_GUIDELINES.md |  10 +++
 5 files changed, 163 insertions(+)
 create mode 100644 AGENTS.md
 create mode 100644 CLAUDE.md
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
index bbfd95dca3..85e6f933b1 100644
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@@ -31,6 +31,8 @@
 /CONTAINER_SOURCE.md @NVIDIA/trt-llm-doc-owners
 /CONTRIBUTING.md @NVIDIA/trt-llm-doc-owners
 /README.md @NVIDIA/trt-llm-doc-owners
+/CLAUDE.md @NVIDIA/trt-llm-doc-owners
+/AGENTS.md @NVIDIA/trt-llm-doc-owners
 
 ## Examples
 /examples @NVIDIA/trt-llm-doc-owners
diff --git a/.gitignore b/.gitignore
index 1f1d89079d..a02853fef0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -89,6 +89,7 @@ cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_cubin.cpp
 /examples/layer_wise_benchmarks/profiles/
 
 # User config files
+CLAUDE.local.md
 CMakeUserPresets.json
 compile_commands.json
 *.bin
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000000..fdbc4456f6
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,148 @@
+# AGENTS.md
+
+TensorRT-LLM: open-source library for optimized LLM inference on NVIDIA GPUs.
+Python and C++ codebase supporting TensorRT engine-based and PyTorch-based execution paths.
+
+> If a `CLAUDE.local.md` file exists alongside this file, read and respect it — it contains developer-specific overrides that supplement this shared guidance.
+
+## Rules (Read First)
+
+**CRITICAL (YOU MUST):**
+- Read and follow `CODING_GUIDELINES.md` for ALL code changes (C++ and Python)
+- NVIDIA copyright header on ALL new files (update year on modified files)
+- `git commit -s` (DCO sign-off required). Never attribute AI tools in sign-off line
+- `pre-commit` hooks run on commit — if files are modified by hooks, re-stage and commit again
+- PR title format: `[JIRA/NVBUG/None][type] description` (e.g., `[TRTLLM-5516][perf] optimize cuda graph padding`)
+- Python imports: `from package.subpackage import module` (never `from module import Class`)
+- Set `LLM_MODELS_ROOT` env var when running tests that need model weights
+
+## Common Commands
+
+| Task | Command |
+|------|---------|
+| Unit tests | `pytest tests/unittest/` |
+| Specific test | `pytest tests/unittest/llmapi/test_llm_args.py` |
+| Pattern match | `pytest tests/unittest -k "test_llm_args"` |
+| Integration tests | `LLM_MODELS_ROOT=/path/to/models pytest tests/integration/defs/...` |
+| Serve model | `trtllm-serve --model <hf_model> --port 8000` |
+| Serve with config | `trtllm-serve --model <hf_model> --config config.yaml` |
+| Benchmark | `trtllm-bench --model <hf_model> --dataset_path <path>` |
+| Find CI stage for test | `python scripts/test_to_stage_mapping.py --tests "test_name"` |
+
+### Installation & Build
+
+Building TensorRT-LLM requires Docker and may involve compiling C++ components.
+See [build from source](docs/source/installation/build-from-source-linux.md) for full instructions,
+or [pip install](docs/source/installation/linux.md) for pre-built wheels.
+For container images, see [NGC containers](docs/source/installation/containers.md).
+
+### Reference Configs
+
+`examples/configs/database/` contains 170+ pareto-optimized serving configurations
+across multiple models, GPUs, ISL/OSL combinations, and concurrency levels.
+Use these as starting points for deployment and benchmarking rather than hand-tuning parameters.
+See [deployment guides](docs/source/deployment-guide/) for model-specific walkthroughs.
+
+## Architecture
+
+See [architecture diagram](.github/tava_architecture_diagram.md) for the full Mermaid diagram.
+
+### Backends
+
+| Backend | Status | Entry Point | Key Path |
+|---------|--------|-------------|----------|
+| **PyTorch** | Default | `LLM(backend="pytorch")` | `_torch/pyexecutor/` → `PyExecutor` → PyTorch Engine |
+| **AutoDeploy** | Beta | `LLM(backend="_autodeploy")` | `_torch/auto_deploy/` → `ADExecutor` → graph transforms + torch.export |
+| **TensorRT** | Legacy | `LLM(backend="tensorrt")` | `builder.py` → `trtllm.Executor` → TensorRT Engine |
+
+### Shared C++ Core (via Nanobind)
+
+Both PyTorch and TensorRT backends share these C++ components:
+- **Scheduling pipeline**: Scheduler → BatchManager (in-flight batching) → KV Cache Manager
+- **Decoding pipeline**: Decoder (token generation orchestration) → Sampling
+
+### Request Flow
+```text
+HuggingFace Model → LLM API → Executor (PyTorch/AutoDeploy/TensorRT)
+    → Scheduler → Model Forward → Decoder → Sampling → Generated Tokens
+```
+
+### Serving
+- `trtllm-serve`: OpenAI-compatible REST + gRPC server, supports all backends
+- **Disaggregated serving**: separates prefill (context) and decode (generation) across GPUs
+  - KV cache exchange via NIXL (default), UCX, or MPI
+
+## Key Files
+
+| File | Role |
+|------|------|
+| `tensorrt_llm/llmapi/llm.py` | Main API entry point |
+| `tensorrt_llm/llmapi/llm_args.py` | Complete configuration schema (Pydantic) |
+| `tensorrt_llm/llmapi/llm_utils.py` | Model loading, model-specific default overrides |
+| `tensorrt_llm/models/modeling_utils.py` | Base classes for all models (`PretrainedConfig`, `PretrainedModel`) |
+| `tensorrt_llm/executor/executor.py` | Execution abstraction (`GenerationExecutor`) |
+| `tensorrt_llm/models/automodel.py` | Auto-discovery and model registry |
+
+## Design Patterns
+
+| Pattern | Key Points |
+|---------|------------|
+| **Config hierarchy** | `LlmArgs` → `TrtLlmArgs` / `TorchLlmArgs`, model-specific defaults override generics, Pydantic validation |
+| **Model architecture** | Each model: `Config` (inherits `PretrainedConfig`) + `ForCausalLM` (inherits `PretrainedModel`) |
+| **Model defaults** | Architecture-specific overrides in `llm_utils.py` (attention kernels, quant, spec decoding, cache) |
+| **Distributed execution** | Tensor/pipeline parallelism via `Mapping` class, multiple backends (MPI, Ray, RPC) |
+| **Auto-discovery** | Models self-register via `automodel.py`, resolved by HF config `architectures` field |
+
+## Anti-Patterns / Gotchas
+
+- **Pre-commit modifies files in-place** — if hooks fail, files are already modified. Re-stage (`git add`) and commit again.
+- **Protected APIs exist** — changes to LLM API signatures will fail `tests/api_stability` tests. Get code owner review.
+- **Integration tests need GPUs + models** — always set `LLM_MODELS_ROOT` and ensure GPU access. Unit tests don't.
+- **Copyright year** — update to current year when modifying existing files; add full header to new files.
+- **Avoid broad exception handling** — catch specific exceptions, not bare `except:` (see `CODING_GUIDELINES.md`).
+- **Python import style is enforced** — `from package.subpackage import module`, never `from module import Class`. Pre-commit will not catch this.
+- **One concern per PR** — avoid scope creep. If a PR touches unrelated areas, split it.
+
+## Development Workflow
+
+1. Set up build environment (see [installation docs](docs/source/installation/))
+2. Make changes following `CODING_GUIDELINES.md`
+3. Test locally with `pytest`
+4. Submit PR:
+   - PR title format: `[JIRA/NVBUG/None][type] description` (e.g., `[TRTLLM-5516][perf] optimize cuda graph padding`)
+   - Sign commits with DCO (`git commit -s`)
+   - Target `main` unless fixing a release branch bug
+   - See `CONTRIBUTING.md` for full PR policies
+
+## CI / Testing
+
+See [CI overview](docs/source/developer-guide/ci-overview.md) for full details.
+
+| Layer | Location | Notes |
+|-------|----------|-------|
+| Unit tests | `tests/unittest/` | Run in pre-merge CI; some tests require GPU |
+| API stability | `tests/api_stability/` | Protects committed API signatures |
+| Integration tests | `tests/integration/defs/` | Requires GPU + `LLM_MODELS_ROOT` |
+| Test lists | `tests/integration/test_lists/test-db/` | Per-GPU YAML files (`l0_a10.yml`, `l0_h100.yml`, etc.) |
+| Test waives | `tests/integration/test_lists/waives.txt` | Skip known-failing tests with NVBug links |
+| Performance | See [benchmarking guide](docs/source/developer-guide/perf-benchmarking.md) | `trtllm-bench` and `trtllm-serve` benchmarks |
+
+## Key Documentation
+
+| Topic | Path |
+|-------|------|
+| Architecture overview | `docs/source/developer-guide/overview.md` |
+| PyTorch backend | `docs/source/torch/arch_overview.md` |
+| Adding a new model | `docs/source/torch/adding_new_model.md` |
+| AutoDeploy | `docs/source/features/auto_deploy/auto-deploy.md` |
+| Disaggregated serving | `docs/source/features/disagg-serving.md` |
+| Speculative decoding | `docs/source/features/speculative-decoding.md` |
+| Quantization | `docs/source/features/quantization.md` |
+| Parallelism strategies | `docs/source/features/parallel-strategy.md` |
+| KV cache | `docs/source/features/kvcache.md` |
+| API change guidelines | `docs/source/developer-guide/api-change.md` |
+| Feature compatibility matrix | `docs/source/features/feature-combination-matrix.md` |
+| Supported models | `docs/source/models/supported-models.md` |
+| Deployment guides | `docs/source/deployment-guide/` |
+| Examples & customization | `docs/source/examples/` |
+| Performance analysis | `docs/source/developer-guide/perf-analysis.md` |
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000000..40c6cf85d5
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,2 @@
+# In ./CLAUDE.md
+@AGENTS.md
diff --git a/CODING_GUIDELINES.md b/CODING_GUIDELINES.md
index 70f0c1bfbe..f9302ff2a7 100644
--- a/CODING_GUIDELINES.md
+++ b/CODING_GUIDELINES.md
@@ -495,6 +495,16 @@ else:
    - Example: `trtllm-serve --model <model_path> --config config.yaml` (preferred)
    - Avoid: `trtllm-serve --model <model_path> --extra_llm_api_options config.yaml`
 
+## AI Coding Agent Guidance
+
+This repository includes configuration files for AI coding agents (Claude Code, Cursor, Codex, Copilot, etc.):
+
+- **`AGENTS.md`** — Shared project context, rules, architecture pointers, and commands. Checked into git.
+- **`CLAUDE.md`** — Simple `@AGENTS.md` import indirection for claude code.
+- **`CLAUDE.local.md`** — Personal developer overrides (gitignored). Create this file for your own preferences, local paths, or domain-specific context without affecting the shared config.
+
+**Keeping `AGENTS.md` up to date**: If you change workflows, commands, architecture, or conventions that would benefit all developers and AI agents, update `AGENTS.md` in the same PR. It should evolve at the pace of the code.
+
 ## NVIDIA Copyright
 
 1. All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.  The following block of text should be prepended to the top of all files.  This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.