[None][feat] add fp4 gemm + allreduce (#9729 )

Signed-off-by: benzh Signed-off-by: benzh-2025
[None][test] Unwaive qwen3 next test case. (#9877 )
2026-01-13 22:18:36 +08:00 · 2026-01-13 21:11:13 +08:00 · 2026-01-13 20:42:31 +08:00 · 2026-01-13 19:17:03 +08:00 · 2026-01-13 12:01:20 +01:00 · 2026-01-13 04:31:27 -05:00
6693 changed files with 280896 additions and 96315 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -12,3 +12,5 @@ tests/integration/test_input_files/*.jpg filter=lfs diff=lfs merge=lfs -text
 docs/source/blogs/media/tech_blog10_baseline_performance_detail.png filter=lfs diff=lfs merge=lfs -text
 docs/source/blogs/media/tech_blog10_full_strategy_performance.png filter=lfs diff=lfs merge=lfs -text
 docs/source/blogs/media/tech_blog10_context_wait_performance.png  filter=lfs diff=lfs merge=lfs -text
+cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/kernelMetaInfo_cubin.cpp filter=lfs diff=lfs merge=lfs -text
+cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/cubin/xqa_kernel_cubin.cpp filter=lfs diff=lfs merge=lfs -text
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@ -1,17 +1,36 @@
 # This file defines code ownership rules for the repository.

+## TensorRT-LLM QA
+### Integration Tests
+/tests/integration/test_lists/qa @NVIDIA/trt-llm-qa
+/tests/integration/defs/examples/test_ray.py @NVIDIA/trt-llm-qa-function
+/tests/integration/defs/examples/test_redrafter.py @NVIDIA/trt-llm-qa-function
+/tests/integration/defs/accuracy @NVIDIA/trt-llm-qa-function
+/tests/integration/defs/stress_test @NVIDIA/trt-llm-qa-function
+/tests/integration/defs/triton_server @NVIDIA/trt-llm-qa-function
+/tests/integration/defs/test_e2e.py @NVIDIA/trt-llm-qa-function
+/tests/integration/defs/disaggregated @NVIDIA/trt-llm-qa-serving
+/tests/integration/defs/sysinfo @NVIDIA/trt-llm-qa-perf
+/tests/integration/defs/perf @NVIDIA/trt-llm-qa-perf
+/tests/integration/defs/perf/disagg @NVIDIA/trt-llm-qa-serving

 ## TensorRT-LLM Infra
 ### CI
 /jenkins @NVIDIA/trt-llm-ci-infra-devs @NVIDIA/trt-llm-infra-devs
 ### Setup
 /docker @NVIDIA/trt-llm-setup-infra-devs @NVIDIA/trt-llm-infra-devs
+/.pre-commit-config.yaml @NVIDIA/trt-llm-setup-infra-devs @NVIDIA/trt-llm-infra-devs
 ### Github workflows
 /.github @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs
 /.coderabbit.yaml @NVIDIA/trt-llm-gh-workflows-infra-devs @NVIDIA/trt-llm-infra-devs

 ## TensorRT-LLM - Docs
 /docs @NVIDIA/trt-llm-doc-owners
+/CODING_GUIDELINES.md @NVIDIA/trt-llm-doc-owners
+/CODE_OF_CONDUCT.md @NVIDIA/trt-llm-doc-owners
+/CONTAINER_SOURCE.md @NVIDIA/trt-llm-doc-owners
+/CONTRIBUTING.md @NVIDIA/trt-llm-doc-owners
+/README.md @NVIDIA/trt-llm-doc-owners

 ## Examples
 /examples @NVIDIA/trt-llm-doc-owners
@ -151,6 +170,23 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
 /cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h @NVIDIA/trt-llm-disagg-devs
 /tensorrt_llm/serve/openai_disagg_server.py @NVIDIA/trt-llm-disagg-devs

+## TensorRT-LLM - KV Cache Manager
+/cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/batch_manager/kvCacheEventManager.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/include/tensorrt_llm/batch_manager/kvCacheEventManager.h @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/batch_manager/allocateKvCache.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tests/unit_tests/batch_manager/kvCacheUtilsTest.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/tensorrt_llm/_torch/pyexecutor/resource_manager.py @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.h @NVIDIA/trt-llm-kv-cache-manager-devs
+/cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp @NVIDIA/trt-llm-kv-cache-manager-devs

 # The rule below requires that any PR modifying public APIs must be approved by at least one member
 # of the NVIDIA/trt-llm-committed-api-review-committee or NVIDIA/trt-llm-noncommitted-api-review-committee team.
@ -165,6 +201,7 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
 ## and license compliance when adding, removing, or changing versions of dependencies.
 ### License Files
 /LICENSE @NVIDIA/trt-llm-oss-compliance
+/ATTRIBUTIONS-*.md @NVIDIA/trt-llm-oss-compliance
 /jenkins/license_cpp.json @NVIDIA/trt-llm-ci-infra-devs @NVIDIA/trt-llm-infra-devs @NVIDIA/trt-llm-oss-compliance

 ### Python Dependency Management
--- a/.github/workflows/auto-assign.yml
+++ b/.github/workflows/auto-assign.yml
@ -11,10 +11,10 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v2
+        uses: actions/checkout@v6

      - name: Get assignee
-        uses: actions/github-script@v6
+        uses: actions/github-script@v8
        id: get-assignee
        with:
          github-token: ${{secrets.GITHUB_TOKEN}}
--- a/.github/workflows/auto-close-inactive-issues.yml
+++ b/.github/workflows/auto-close-inactive-issues.yml
@ -14,7 +14,7 @@ jobs:
      pull-requests: write

    steps:
-      - uses: actions/stale@v9
+      - uses: actions/stale@v10
        with:
          repo-token: ${{ secrets.GITHUB_TOKEN }}
          stale-issue-message: 'Issue has not received an update in over 14 days. Adding stale label.'
--- a/.github/workflows/blossom-ci.yml
+++ b/.github/workflows/blossom-ci.yml
@ -40,8 +40,315 @@ jobs:
        startsWith(github.event.comment.body, '/bot skip --comment') ||
        startsWith(github.event.comment.body, '/bot reuse-pipeline') ||
        startsWith(github.event.comment.body, '/bot kill')) && contains(
-        fromJson('["byshiue","chuangz0","funatiq","hypdeb","jdemouth-nvidia","joyang-nv","lowsfer","Tabrizian","yweng0828","Shixiaowei02","MartinMarciniszyn","schetlur-nv","dcampora","pcastonguay","Naveassaf","lfr-0531","nekorobov","PerkzZheng","kaiyux","nv-guomingz","LinPoly","thorjohnsen","jiahanc","latency1024","tburt-nv","zeroepoch","chzblych","niukuo","ZhanruiSunCh","EmmaQiaoCh","yiqingy0","achartier","suyoggupta","amukkara","mk-nvidia","QiJune","lucaslie","davidmlw","hlu1","nvzhou","syuoni","NVGaryJi","symphonylyh","hello-11","zongfeijing","Jackch-NV","jinyangyuan-nvidia","LarryXFly","crazydemo","jaedeok-nvidia","wm2012011492","rosenrodt","zhuoyao1012","xinhe-nv","Yuening-wa","Shunkangz","zhengd-nv","yibinl-nvidia","StanleySun639","KingsleyLiu-NV","kxdc","yingcanw","BestJuly","ChristinaZ","bobboli","xueweilnvidia","kunlunl","cherichy","lucifer1004","Autumn1998","litaotju","peaceh-nv","liji-nv","SimengLiu-nv","yuxianq","yechank-nvidia","vallis-neria","DylanChen-NV","Tracin","zhhuang-nv","ISEEKYAN","xupinjie","tongyuantongyu","laikhtewari","zhuolingwang","dominicshanshan","jershi425","shifangx","StudyingShao","Superjomn","dongjiyingdjy","guangyunh-nv","wili-65535","tiffany940107","DanBlanaru","mikeiovine","djns99","ruodil","xiaoweiw-nv","xuwchen","bashimao","yizhang-nv","hyukn","nvpohanh","yuki-666","juney-nvidia","barry-delaney","Kefeng-Duan","MinaHuai","yilin-void","jhaotingc","jmydurant","katec846","CarstyYou","Njuapp","Jie-Fang","nvbrantz","inocsin","ruoqianguo","chenfeiz0326","ming-wei","eopXD","longlee0622","dongfengy","georgeliu95","evezhier","rakib-hasan","shangz-ai","JyChang012","wangsiping1997","yuanjings-nvda","tomeras91","roikoren755","amirkl94","shaharmor98","danielafrimi","amitz-nv","hijkzzz","rzilberstein-nvidia","dc3671","hchings","yuhengxnv","dongxuy04","qiaoxj07","omera-nv","DomBrown","brb-nv","FrankD412","yuhsuan-t","Fridah-nv","a-mccarthy","HuiGao-NV","alexmsettle","meenchen","sugunav14","cjluo-nv","kyleliang-nv","chang-l","WeiHaocheng","qixiang-99","BatshevaBlack","ebarilanM","xmchen1987","lingjiew","heyuhhh","netanel-haber","jiefangz-nv","wyw1267","yunruis","sklevtsov-nvidia","jgangani","pamelap-nvidia","ixlmar","GalSha","Dido0o0","rabiel","nvzhihanj","milesial","fzmu727","zackyoray","RoeyAzran1992","viraatc","v-shobhit","yuanjingx87","uchihatmtkinu","nvrohanv","vegaluisjose","qsang-nv","ChunhuanLin","timlee0212","venkywonka","zbpatel","tijyojwad","shyeh25","zihaok","nv-yilinf","ttyio","farazkh80","yuantailing","JennyLiu-nv","moraxu","IzzyPutterman","nvchenghaoz","nvxuanyuc","poweiw","stnie","zhanga5","nzmora-nvidia","greg-kwasniewski1","linda-stadter","Tom-Zheng","vanshilshah97","ixlmar","MatthiasKohl","Wanli-Jiang", "arekay", "davidclark-nv", "2ez4bz", "tcherckez-nvidia", "MrGeva", "galagam", "limin2021", "dhansen-nvidia","talorabr","kanghui0204","wu6u3tw","hvagadia","xavier-nvidia","raayandhar","dbari","nvjullin","elvischenv","zhenhuaw-me","weireweire","yifeizhang-c","jiaganc","ziyixiong-nv","FelixXidddd","JunyiXu-nv","bo-nv","zerollzeng","RayenTian","ameynaik-hub","raymochen","shuyixiong","johncalesp","leslie-fang25","reasonsolo","zhou-yuxin","vadiklyutiy","yali-arch","NVShreyas","h-guo18","pengbowang-nv","lancelly","heyuhhh","mayani-nv","flin3500","sunnyqgg","kris1025", "karljang", "ajrasane", "jthomson04", "fredricz-20070104", "aalanwyr", "samuellees", "nvamyt", "jinzh-nvidia", "zheyuf", "yumin066", "sychen52", "xxi-nv", "barneuman", "xuanzic", "yufeiwu-nv", "richardhuo-nv", "dcaox", "tshmilnvidia", "anish-shanbhag", "zhangcl", "timothygao8710", "jthomson04", "faradawn", "govind-ramnarayan","Boreas618","baize97","jieli-matrix","qiangxu1996","atrifex","mlefeb01","Wong4j","JadoTu"]'),
-        github.actor)
+        fromJson('[
+        "2ez4bz",
+        "a-mccarthy",
+        "aalanwyr",
+        "achartier",
+        "ajrasane",
+        "alexmsettle",
+        "ameynaik-hub",
+        "amirkl94",
+        "amitz-nv",
+        "amukkara",
+        "anish-shanbhag",
+        "arekay",
+        "arysef",
+        "atrifex",
+        "Autumn1998",
+        "baize97",
+        "barneuman",
+        "barry-delaney",
+        "bashimao",
+        "BatshevaBlack",
+        "benzh-2025",
+        "BestJuly",
+        "bo-nv",
+        "bobboli",
+        "Boreas618",
+        "brb-nv",
+        "byshiue",
+        "CarstyYou",
+        "chang-l",
+        "chenfeiz0326",
+        "cherichy",
+        "cheshirekow",
+        "ChristinaZ",
+        "chuangz0",
+        "ChunhuanLin",
+        "chzblych",
+        "cjluo-nv",
+        "crazydemo",
+        "DanBlanaru",
+        "danielafrimi",
+        "davidclark-nv",
+        "davidmlw",
+        "dbari",
+        "dc3671",
+        "dcampora",
+        "dcaox",
+        "dhansen-nvidia",
+        "Dido0o0",
+        "djns99",
+        "DomBrown",
+        "dominicshanshan",
+        "dongfengy",
+        "dongjiyingdjy",
+        "dongxuy04",
+        "DylanChen-NV",
+        "ebarilanM",
+        "elvischenv",
+        "EmmaQiaoCh",
+        "eopXD",
+        "evezhier",
+        "faradawn",
+        "farazkh80",
+        "FelixXidddd",
+        "flin3500",
+        "FrankD412",
+        "fredricz-20070104",
+        "Fridah-nv",
+        "funatiq",
+        "fzmu727",
+        "galagam",
+        "GalSha",
+        "georgeliu95",
+        "govind-ramnarayan",
+        "greg-kwasniewski1",
+        "guangyunh-nv",
+        "h-guo18",
+        "hchings",
+        "hello-11",
+        "heyuhhh",
+        "hijkzzz",
+        "hlu1",
+        "hnover-nv",
+        "HuiGao-NV",
+        "hvagadia",
+        "hypdeb",
+        "hyukn",
+        "inocsin",
+        "ISEEKYAN",
+        "ixlmar",
+        "IzzyPutterman",
+        "Jackch-NV",
+        "JadoTu",
+        "jaedeok-nvidia",
+        "jdemouth-nvidia",
+        "JennyLiu-nv",
+        "jershi425",
+        "jgangani",
+        "jhaotingc",
+        "jiaganc",
+        "jiahanc",
+        "Jie-Fang",
+        "jiefangz-nv",
+        "jieli-matrix",
+        "jinyangyuan-nvidia",
+        "jinzh-nvidia",
+        "jmydurant",
+        "johncalesp",
+        "joyang-nv",
+        "jthomson04",
+        "juney-nvidia",
+        "JunyiXu-nv",
+        "JyChang012",
+        "kaiyux",
+        "kanghui0204",
+        "karljang",
+        "karthikvetrivel",
+        "katec846",
+        "Kefeng-Duan",
+        "KingsleyLiu-NV",
+        "kris1025",
+        "kunlunl",
+        "kxdc",
+        "kyleliang-nv",
+        "laikhtewari",
+        "lancelly",
+        "LarryXFly",
+        "latency1024",
+        "leslie-fang25",
+        "lfr-0531",
+        "liji-nv",
+        "limin2021",
+        "linda-stadter",
+        "lingjiew",
+        "LinPoly",
+        "litaotju",
+        "liyuhannnnn",
+        "lkomali",
+        "longlee0622",
+        "lowsfer",
+        "lucaslie",
+        "lucifer1004",
+        "MartinMarciniszyn",
+        "MatthiasKohl",
+        "mayani-nv",
+        "meenchen",
+        "mikeiovine",
+        "milesial",
+        "MinaHuai",
+        "ming-wei",
+        "mk-nvidia",
+        "mlefeb01",
+        "moraxu",
+        "MrGeva",
+        "mzweilz",
+        "Naveassaf",
+        "nekorobov",
+        "netanel-haber",
+        "niukuo",
+        "Njuapp",
+        "nv-guomingz",
+        "nv-lschneider",
+        "nv-yilinf",
+        "nvamyt",
+        "nvbrantz",
+        "nvchenghaoz",
+        "NVGaryJi",
+        "nvjullin",
+        "nvpohanh",
+        "nvrohanv",
+        "NVShreyas",
+        "nvxuanyuc",
+        "nvyocox",
+        "nvzhihanj",
+        "nvzhou",
+        "nzmora-nvidia",
+        "omera-nv",
+        "pamelap-nvidia",
+        "pcastonguay",
+        "pcicotti",
+        "pdrake-nv",
+        "peaceh-nv",
+        "pengbowang-nv",
+        "PerkzZheng",
+        "poweiw",
+        "qiangxu1996",
+        "qiaoxj07",
+        "QiJune",
+        "qixiang-99",
+        "qsang-nv",
+        "raayandhar",
+        "rabiel",
+        "rakib-hasan",
+        "RayenTian",
+        "raymochen",
+        "reasonsolo",
+        "richardhuo-nv",
+        "RoeyAzran1992",
+        "roikoren755",
+        "rosenrodt",
+        "rosong11",
+        "ruodil",
+        "ruoqianguo",
+        "rzilberstein-nvidia",
+        "samuellees",
+        "schetlur-nv",
+        "shaharmor98",
+        "shangz-ai",
+        "sherry-1001",
+        "shifangx",
+        "Shixiaowei02",
+        "Shunkangz",
+        "shuyixiong",
+        "shyeh25",
+        "SimengLiu-nv",
+        "sklevtsov-nvidia",
+        "StanleySun639",
+        "stnie",
+        "StudyingShao",
+        "sugunav14",
+        "sunnyqgg",
+        "Superjomn",
+        "suyoggupta",
+        "sychen52",
+        "symphonylyh",
+        "syuoni",
+        "Tabrizian",
+        "talorabr",
+        "taylor-yb-lee",
+        "tburt-nv",
+        "tcherckez-nvidia",
+        "thorjohnsen",
+        "tiffany940107",
+        "tijyojwad",
+        "timlee0212",
+        "timothygao8710",
+        "Tom-Zheng",
+        "tomeras91",
+        "tongyuantongyu",
+        "Tracin",
+        "tshmilnvidia",
+        "ttyio",
+        "uchihatmtkinu",
+        "v-shobhit",
+        "vadiklyutiy",
+        "vallis-neria",
+        "vanshilshah97",
+        "vegaluisjose",
+        "venkywonka",
+        "viraatc",
+        "wangsiping1997",
+        "Wanli-Jiang",
+        "WeiHaocheng",
+        "weireweire",
+        "wenmingw",
+        "wili-65535",
+        "wm2012011492",
+        "Wong4j",
+        "wu6u3tw",
+        "wyw1267",
+        "xavier-nvidia",
+        "xiaoweiw-nv",
+        "xinhe-nv",
+        "xmchen1987",
+        "xuanzic",
+        "xueweilnvidia",
+        "xupinjie",
+        "xuwchen",
+        "xxi-nv",
+        "yali-arch",
+        "yechank-nvidia",
+        "yibinl-nvidia",
+        "yifeizhang-c",
+        "yihwang-nv",
+        "yilin-void",
+        "yingcanw",
+        "yingguo-trt",
+        "yiqingy0",
+        "yizhang-nv",
+        "yuanjings-nvda",
+        "yuanjingx87",
+        "yuantailing",
+        "Yuening-wa",
+        "yufeiwu-nv",
+        "yuhengxnv",
+        "yuhsuan-t",
+        "yuki-666",
+        "yumin066",
+        "yunruis",
+        "yuxianq",
+        "yweng0828",
+        "zackyoray",
+        "zbpatel",
+        "zeroepoch",
+        "zerollzeng",
+        "zhanga5",
+        "zhangcl",
+        "ZhanruiSunCh",
+        "zhengd-nv",
+        "zhenhuaw-me",
+        "zheyuf",
+        "zhhuang-nv",
+        "zhou-yuxin",
+        "zhuolingwang",
+        "zhuoyao1012",
+        "zihaok",
+        "ziyixiong-nv",
+        "zongfeijing"
+        ]'), github.actor)
    steps:
      - name: Check if comment is issued by authorized person
        run: blossom-ci
--- a/.github/workflows/bot-command.yml
+++ b/.github/workflows/bot-command.yml
@ -36,7 +36,7 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Add bot help comment
-        uses: actions/github-script@v6
+        uses: actions/github-script@v8
        with:
          script: |
            const helpMessage = "" +
--- a/.github/workflows/l0-test.yml
+++ b/.github/workflows/l0-test.yml
@ -34,7 +34,7 @@ jobs:
    if: github.event_name == 'workflow_dispatch'
    steps:
      - name: Update commit status
-        uses: actions/github-script@v6
+        uses: actions/github-script@v8
        with:
          script: |
            state = 'pending'
@ -60,7 +60,7 @@ jobs:
        with:
          paths: results/**/results*.xml
      - name: Update commit status
-        uses: actions/github-script@v6
+        uses: actions/github-script@v8
        with:
          script: |
            github.rest.repos.createCommitStatus({
--- a/.github/workflows/label_community_pr.yml
+++ b/.github/workflows/label_community_pr.yml
@ -17,10 +17,10 @@ jobs:
    if: github.repository == 'NVIDIA/TensorRT-LLM'
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v3
+        uses: actions/checkout@v6

      - name: Set up Python
-        uses: actions/setup-python@v3
+        uses: actions/setup-python@v6
        with:
          python-version: '3.x'

--- a/.github/workflows/label_issue.yml
+++ b/.github/workflows/label_issue.yml
@ -13,12 +13,11 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout private action repository
-        uses: actions/checkout@v4
+        uses: actions/checkout@v6
        with:
-          repository: poweiw/goggles_action
+          repository: NVIDIA/goggles_action
          path: ./.github/actions/goggles_action # local path to store the action
-          token: ${{ secrets.GOGGLES_ACTION_REPO_TOKEN}} # token to access poweiw/goggles_action
-          ref: v1.2.1
+          ref: v1.3.0

      - name: AI Label Issue
        uses: ./.github/actions/goggles_action/actions/llm_label
--- a/.github/workflows/pr-check.yml
+++ b/.github/workflows/pr-check.yml
@ -59,10 +59,10 @@ jobs:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
-        uses: actions/checkout@v4
+        uses: actions/checkout@v6

      - name: Set up Python
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
        with:
          python-version: '3.10'

--- a/.github/workflows/precommit-check.yml
+++ b/.github/workflows/precommit-check.yml
@ -29,11 +29,11 @@ jobs:
    name: Pre-commit Check
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v6
        with:
          ref: ${{ github.event_name == 'workflow_dispatch' && github.event.inputs.ref || github.ref }}

-      - uses: actions/setup-python@v5
+      - uses: actions/setup-python@v6
        with:
          python-version: '3.12'
          cache: 'pip'
--- a/.github/workflows/waiting_for_feedback.yml
+++ b/.github/workflows/waiting_for_feedback.yml
@ -0,0 +1,127 @@
+name: Manage Waiting for Feedback Label
+
+on:
+  issue_comment:
+    types: [created]
+  pull_request_review_comment:
+    types: [created]
+
+permissions:
+  issues: write
+  pull-requests: write
+
+jobs:
+  manage-waiting-for-feedback:
+    runs-on: ubuntu-latest
+    if: github.repository == 'NVIDIA/TensorRT-LLM'
+    steps:
+      - name: Check membership and manage label
+        uses: actions/github-script@v8
+        with:
+          script: |
+            const commenter = context.payload.comment.user.login;
+            const commenterType = context.payload.comment.user.type;
+            const label = 'waiting for feedback';
+
+            // Ignore bots and CI accounts
+            const ignoredAccounts = ['tensorrt-cicd'];
+            if (commenterType === 'Bot' || ignoredAccounts.includes(commenter)) {
+              console.log(`Ignoring comment from ${commenter} (type: ${commenterType}). Skipping.`);
+              return;
+            }
+
+            // Handle both issue_comment and pull_request_review_comment events
+            // context.issue.number is only available for issue_comment events
+            const issueNumber = context.issue?.number || context.payload.pull_request?.number;
+            const issue = context.payload.issue || context.payload.pull_request;
+            const author = issue?.user?.login;
+            const isAuthor = (commenter === author);
+
+            if (!issueNumber) {
+              console.log('Could not determine issue/PR number. Skipping.');
+              return;
+            }
+
+            console.log(`Comment by ${commenter} on #${issueNumber} (author: ${author})`);
+            const owner = context.repo.owner;
+            const repo = context.repo.repo;
+
+            // Check if commenter is repository member
+            let isMember = false;
+            try {
+              await github.rest.repos.checkCollaborator({
+                owner,
+                repo,
+                username: commenter
+              });
+              isMember = true;
+            } catch (error) {
+              if (error.status === 404) {
+                isMember = false;
+              } else if (error.status === 302) {
+                console.log(`Cannot determine membership for ${commenter} (insufficient token permissions)`);
+                return;
+              } else {
+                console.error(`Error checking membership: ${error.message}`);
+                throw error;
+              }
+            }
+
+            // Logic:
+            // - Author responds → remove label (feedback provided)
+            // - NVIDIA non-author comments → add label (team is waiting for response)
+            // - External non-author comments → remove label (someone provided feedback)
+
+            if (isAuthor) {
+              // Author responded - remove 'waiting for feedback' label
+              console.log(`${commenter} is the author. Removing '${label}' label if present.`);
+
+              try {
+                await github.rest.issues.removeLabel({
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  issue_number: issueNumber,
+                  name: label
+                });
+                console.log(`Successfully removed '${label}' label from #${issueNumber}`);
+              } catch (error) {
+                if (error.status === 404) {
+                  console.log(`Label '${label}' was not present on #${issueNumber}. No action needed.`);
+                } else {
+                  throw error;
+                }
+              }
+
+            } else if (isMember) {
+              // NVIDIA non-author commented - add 'waiting for feedback' label
+              console.log(`${commenter} is an NVIDIA member (not author). Adding '${label}' label.`);
+
+              await github.rest.issues.addLabels({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                issue_number: issueNumber,
+                labels: [label]
+              });
+
+              console.log(`Successfully added '${label}' label to #${issueNumber}`);
+
+            } else {
+              // External non-author commented - remove 'waiting for feedback' label
+              console.log(`${commenter} is external (not author). Removing '${label}' label if present.`);
+
+              try {
+                await github.rest.issues.removeLabel({
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  issue_number: issueNumber,
+                  name: label
+                });
+                console.log(`Successfully removed '${label}' label from #${issueNumber}`);
+              } catch (error) {
+                if (error.status === 404) {
+                  console.log(`Label '${label}' was not present on #${issueNumber}. No action needed.`);
+                } else {
+                  throw error;
+                }
+              }
+            }
--- a/.gitignore
+++ b/.gitignore
@ -40,6 +40,8 @@ tensorrt_llm/libs
 tensorrt_llm/bindings.*.so
 tensorrt_llm/bindings.pyi
 tensorrt_llm/bindings/**/*.pyi
+tensorrt_llm/tensorrt_llm_transfer_agent_binding.*.so
+tensorrt_llm/tensorrt_llm_transfer_agent_binding.pyi
 tensorrt_llm/deep_ep/
 tensorrt_llm/deep_ep_cpp_tllm.*.so
 tensorrt_llm/deep_ep_cpp_tllm.pyi
@ -50,16 +52,20 @@ tensorrt_llm/pg_utils_bindings.*.so
 tensorrt_llm/flash_mla/
 tensorrt_llm/flash_mla_cpp_tllm.*.so
 tensorrt_llm/flash_mla_cpp_tllm.pyi
+tensorrt_llm/scripts
 *docs/cpp_docs*
 *docs/source/_cpp_gen*
 docs/source/**/*.rst
 !docs/source/examples/index.rst
+!docs/source/deployment-guide/config_table.rst
+!docs/source/_includes/note_sections.rst
 *.swp

 # Testing
 .coverage.*
 results_trt/
 llm-test-workspace/
+ad-test-workspace/

 # build/debug
 *.safetensors
@ -71,7 +77,10 @@ llm-test-workspace/
 cpp/include/tensorrt_llm/executor/version.h
 cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmha_v2_cu/
 cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_cubin.h
+cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/cubin/fmha_cubin.cpp
 .devcontainer/.env
+/examples/layer_wise_benchmarks/autotuner_cache/
+/examples/layer_wise_benchmarks/profiles/

 # User config files
 CMakeUserPresets.json
@ -84,3 +93,6 @@ compile_commands.json
 # Enroot sqsh files
 enroot/sw-tensorrt-docker+*.sqsh
 enroot/tensorrt_llm.devel.sqsh
+
+# MacOSX Files
+.DS_Store
--- a/.gitmodules
+++ b/.gitmodules
@ -1,35 +0,0 @@
-[submodule "3rdparty/cutlass"]
-	path = 3rdparty/cutlass
-	url = https://github.com/NVIDIA/cutlass.git
-[submodule "3rdparty/json"]
-	path = 3rdparty/json
-	url = https://github.com/nlohmann/json.git
-[submodule "3rdparty/cxxopts"]
-	path = 3rdparty/cxxopts
-	url = https://github.com/jarro2783/cxxopts
-	branch = v3.1.1
-[submodule "3rdparty/NVTX"]
-	path = 3rdparty/NVTX
-	url = https://github.com/NVIDIA/NVTX.git
-[submodule "3rdparty/ucxx"]
-	path = 3rdparty/ucxx
-	url = https://github.com/rapidsai/ucxx.git
-[submodule "3rdparty/pybind11"]
-	path = 3rdparty/pybind11
-	url = https://github.com/pybind/pybind11.git
-[submodule "3rdparty/xgrammar"]
-	path = 3rdparty/xgrammar
-	url = https://github.com/mlc-ai/xgrammar.git
-[submodule "3rdparty/nanobind"]
-	path = 3rdparty/nanobind
-	url = https://github.com/wjakob/nanobind
-[submodule "3rdparty/cppzmq"]
-	path = 3rdparty/cppzmq
-	url = https://github.com/zeromq/cppzmq.git
-[submodule "3rdparty/DeepGEMM"]
-	path = 3rdparty/DeepGEMM
-	url = https://github.com/ruoqianguo/DeepGEMM.git
-	branch = swapab_sm100
-[submodule "3rdparty/flash-mla"]
-	path = 3rdparty/flash-mla
-	url = https://github.com/deepseek-ai/FlashMLA.git
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -83,7 +83,6 @@ common-files: &common_files |
        examples/infinitebench/compute_scores.py |
        examples/infinitebench/construct_synthetic_dataset.py |
        examples/infinitebench/eval_utils.py |
-        examples/layer_wise_benchmarks/run_single.py |
        examples/llm-api/_tensorrt_engine/llm_eagle_decoding.py |
        examples/llm-api/_tensorrt_engine/llm_eagle2_decoding.py |
        examples/llm-api/_tensorrt_engine/llm_inference_customize.py |
@ -811,7 +810,6 @@ common-files: &common_files |
        tensorrt_llm/serve/tool_parser/utils.py |
        tensorrt_llm/tools/__init__.py |
        tensorrt_llm/tools/importlib_utils.py |
-        tensorrt_llm/tools/layer_wise_benchmarks/deepseekv3_runner.py |
        tensorrt_llm/tools/multimodal_builder.py |
        tensorrt_llm/tools/onnx_utils.py |
        tensorrt_llm/tools/plugin_gen/__init__.py |
@ -1061,7 +1059,6 @@ common-files: &common_files |
        tests/unittest/_torch/thop/parallel/test_logits_bitmask_op.py |
        tests/unittest/_torch/thop/parallel/test_mamba_conv1d_op.py |
        tests/unittest/_torch/thop/parallel/test_mamba2_chunk_ss_update.py |
-        tests/unittest/_torch/thop/parallel/test_moe.py |
        tests/unittest/_torch/thop/parallel/test_noaux_tc.py |
        tests/unittest/_torch/thop/parallel/test_scaled_mm.py |
        tests/unittest/_torch/thop/parallel/test_selective_scan_op.py |
@ -1073,6 +1070,7 @@ common-files: &common_files |
        tests/unittest/_torch/thop/parallel/test_weight_only_quant_gemm.py |
        tests/unittest/_torch/thop/parallel/test_weight_only_quant_linear.py |
        tests/unittest/_torch/thop/serial/test_moe_alltoall.py |
+        tests/unittest/_torch/thop/serial/test_moe.py |
        tests/unittest/api_stability/api_stability_core.py |
        tests/unittest/api_stability/test_llm_api.py |
        tests/unittest/bindings/binding_test_utils.py |
@ -1188,7 +1186,6 @@ common-files: &common_files |
        tests/unittest/tools/plugin_gen/test_core.py |
        tests/unittest/tools/plugin_gen/test_plugin_gen.py |
        tests/unittest/tools/plugin_gen/test_shape_infer.py |
-        tests/unittest/tools/test_layer_wise_benchmarks.py |
        tests/unittest/tools/test_prepare_dataset.py |
        tests/unittest/tools/test_test_to_stage_mapping.py |
        tests/unittest/trt/__init__.py |
@ -1389,7 +1386,7 @@ repos:
    -   id: yapf
        files: *common_files
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.1.0
+    rev: v6.0.0
    hooks:
    -   id: check-added-large-files
        exclude: |
@ -1398,6 +1395,8 @@ repos:
    -   id: check-symlinks
    -   id: detect-private-key
    -   id: end-of-file-fixer
+        exclude: |
+            (?x)^(.*cubin.cpp | .*cubin.h)$
    -   id: check-yaml
        args: [--allow-multiple-documents]
        exclude: ".*/gitlab/.*.yml"
@ -1442,7 +1441,7 @@ repos:
        additional_dependencies:
        - tomli
        # add ignore words list
-        args: ["-L", "Mor,ans,thirdparty", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*"]
+        args: ["-L", "Mor,ans,thirdparty,subtiles", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*"]
 -   repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.9.4
    hooks:
@ -1463,6 +1462,11 @@ repos:
        entry: ./scripts/format_test_list.py
        language: script
        files: tests/integration/test_lists/.*\.txt$
+    -   id: waive list check
+        name: Checks for duplicated test items in waives.txt
+        entry: ./scripts/check_test_list.py --check-duplicate-waives
+        language: script
+        pass_filenames: false
    -   id: DCO check
        name: Checks the commit message for a developer certificate of origin signature
        entry: ./scripts/dco_check.py
--- a/3rdparty/CMakeLists.txt
+++ b/3rdparty/CMakeLists.txt
@ -0,0 +1,118 @@
+include(ExternalProject)
+include(FetchContent)
+
+if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "")
+  set(github_base_url "$ENV{GITHUB_MIRROR}")
+else()
+  set(github_base_url "https://github.com")
+endif()
+
+FetchContent_Declare(
+  cppzmq
+  GIT_REPOSITORY https://github.com/zeromq/cppzmq
+  GIT_TAG v4.10.0 # c94c20743ed7d4aa37835a5c46567ab0790d4acc
+  GIT_SHALLOW TRUE
+  # NOTE: TensorRT-LLM only uses the headers
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  cutlass
+  GIT_REPOSITORY https://github.com/NVIDIA/cutlass
+  GIT_TAG v4.3.0 # e67e63c331d6e4b729047c95cf6b92c8454cba89
+  GIT_SHALLOW TRUE
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  cxxopts
+  GIT_REPOSITORY https://github.com/jarro2783/cxxopts
+  GIT_TAG v3.1.1 # eb787304d67ec22f7c3a184ee8b4c481d04357fd
+  GIT_SHALLOW TRUE)
+
+set(deep_ep_commit 5be51b228a7c82dbdb213ea58e77bffd12b38af8)
+set_property(GLOBAL PROPERTY DEEP_EP_COMMIT "${deep_ep_commit}")
+FetchContent_Declare(
+  deep_ep_download
+  URL ${github_base_url}/deepseek-ai/DeepEP/archive/${deep_ep_commit}.tar.gz)
+
+FetchContent_Declare(
+  deepgemm
+  GIT_REPOSITORY https://github.com/deepseek-ai/DeepGEMM
+  GIT_TAG 4ff3f54d9b7ed3129e4f36f9871232ea7ecab86b # nv_dev branch
+  GIT_SUBMODULES_RECURSE
+  ON
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  eigen
+  GIT_REPOSITORY https://github.com/libeigen/eigen
+  GIT_TAG 3.4.0
+  GIT_SHALLOW TRUE)
+
+FetchContent_Declare(
+  flashmla
+  GIT_REPOSITORY https://github.com/deepseek-ai/FlashMLA.git
+  GIT_TAG 1408756a88e52a25196b759eaf8db89d2b51b5a1
+  GIT_SUBMODULES_RECURSE
+  ON
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  googlebenchmark
+  GIT_REPOSITORY https://github.com/google/benchmark
+  GIT_TAG v1.8.3
+  GIT_SHALLOW TRUE)
+
+FetchContent_Declare(
+  googletest
+  GIT_REPOSITORY https://github.com/google/googletest
+  GIT_TAG v1.15.2
+  GIT_SHALLOW TRUE)
+
+FetchContent_Declare(
+  json
+  GIT_REPOSITORY https://github.com/nlohmann/json
+  GIT_TAG v3.12.0 # 55f93686c01528224f448c19128836e7df245f72
+  GIT_SHALLOW TRUE
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  nanobind
+  GIT_REPOSITORY https://github.com/wjakob/nanobind
+  GIT_TAG a0ed2587f1089ef7657e2ed49ad6756b01c74e9f)
+
+FetchContent_Declare(
+  nvtx
+  GIT_REPOSITORY https://github.com/NVIDIA/NVTX
+  GIT_TAG v3.1.0-c-cpp # a1ceb0677f67371ed29a2b1c022794f077db5fe7
+  GIT_SHALLOW TRUE
+  # NOTE: TensorRT-LLM only uses the headers
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  pybind11
+  GIT_REPOSITORY https://github.com/pybind/pybind11
+  GIT_TAG f99ffd7e03001810a3e722bf48ad1a9e08415d7d)
+
+FetchContent_Declare(
+  ucxx
+  GIT_REPOSITORY https://github.com/rapidsai/ucxx
+  GIT_TAG 16eaa57c8d98c8ef54d666a2d2b11e76cfa565f5
+  # NOTE: See the notes in cpp/CMakeList.txt where this project is build at
+  # configure time and then included via find_package
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
+
+FetchContent_Declare(
+  xgrammar
+  GIT_REPOSITORY https://github.com/mlc-ai/xgrammar
+  GIT_TAG v0.1.25 # e4e816f5f0fe39f5b1601a17a4552307fa3b70ff
+  GIT_SHALLOW TRUE
+  # NOTE: TensorRT-LLM only uses the headers
+  SOURCE_SUBDIR
+  dont-add-this-project-with-add-subdirectory)
--- a/3rdparty/DeepGEMM
+++ b/3rdparty/DeepGEMM
@ -1 +0,0 @@
-Subproject commit 9fa5965e265e27995f539e0dd73a06351a8a9eaf
--- a/3rdparty/NVTX
+++ b/3rdparty/NVTX
@ -1 +0,0 @@
-Subproject commit a1ceb0677f67371ed29a2b1c022794f077db5fe7
--- a/3rdparty/README.md
+++ b/3rdparty/README.md
@ -0,0 +1,13 @@
+# Adding new third-party Dependencies
+
+The markdown files in this directory contain playbooks for how to add new
+third-party dependencies. Please see the document that matches the kind of
+dependency you want to add:
+
+* For C++ dependencies compiled into the extension modules via the cmake build
+  and re-distributed with the wheel [see here][1]
+* For python dependencies declared via wheel metadata and installed in the
+  container via pip [see here][2]
+
+[1]: cpp-thirdparty.md
+[2]: py-thirdparty.md
--- a/3rdparty/cpp-thirdparty.md
+++ b/3rdparty/cpp-thirdparty.md
@ -0,0 +1,337 @@
+# Adding new C++ Dependencies
+
+## Step 1: Make the package available to the build
+
+First, decide if you must install the package in the container or if you
+may defer fetching until the build phase. In general, *prefer to fetch
+packages during the build phase*. You may be required to install
+packages into the container, however, if there is a runtime component
+(e.g. shared objects) that cannot be reasonably distributed with the
+wheel.
+
+### Install in the container
+
+#### Debian Packages via os package manager (e.g. apt, dnf)
+
+Add your package to one of the existing shell scripts used by the docker build
+under [docker/common/][1] Find the location where the package manager is
+invoked, and add the name of your package there.
+
+NOTE: Internal compliance tooling will automatically detect the
+installation of this package and fetch sources using the source-fetching
+facilities of the OS package manager.
+
+[1]: https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker/common.
+
+#### Python Packages via pip
+
+If it makes sense, add your package to one of the existing shell scripts used by
+the docker build under [docker/common/][2]. Grep for "pip3 install" to see
+existing invocations. If none of the existing shell scripts make sense, add a
+new shell script to install your package and then invoke that script in
+Dockerfile.multi.
+
+NOTE: If the new python package you are adding has a compiled component (e.g. a
+python extension module), you must coordinate with the [Security Team][20] to
+ensure that the source for this component is managed correctly.
+
+[2]: https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker/common
+
+#### Tarball packages via HTTP/FTP
+
+Invoke `wget` in a shell script which is called from the docker build file.
+When it makes sense, please prefer to extend an existing script in
+[docker/common/][3] rather than creating a new one. If you are downloading a
+binary package, you must also download the source package that produced that
+binary.
+
+Ensure that the source package is copied to /third-party-source and retained
+after all cleanup within the docker image layer.
+
+[3]: https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker/common
+
+### Fetch during the build
+
+#### Python Packages via pip
+
+Add an entry to [requirements-dev.txt][4].
+The package will be installed by build\_wheel.py during virtual
+environment initialization prior to configuring the build with cmake.
+Include a comment indicating the intended usage of the package.
+
+[4]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/requirements-dev.txt
+
+**Example:**
+
+`requirements-dev.txt`:
+
+``` requirements.txt
+# my-package is needed by <feature> where it is used for <reason>
+my-package==1.2.24
+```
+
+#### C/C++ Packages via conan
+
+Add a new entry to [conandata.yml][6] indicating the package version for the
+dependency you are adding. Include a yaml comment indicating the intended usage
+of the package. Then add a new invocation of `self.require()` within the `def
+requirements(self)` method of [conanfile.py], referencing the version you added
+to conandata.
+
+[6]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/conandata.yml
+[7]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/conanfile.py
+
+**Example:**
+
+`conandata.yml`:
+
+```.yml
+# my_dependency is needed by <feature> where it is used for <reason>
+my_dependency: 1.2.24+1
+```
+
+`conanfile.py`:
+
+```.py
+def requirements(self):
+    ...
+    my_dependency_version = self.conandata["my_dependency"]
+    self.requires(f"my_dependency/{my_dependency_version}")
+```
+
+#### Source integration via CMake
+
+If you have a package you need to build from source then use CMake
+[FetchContent][8] of [ExternalProject][9] to fetch the package sources and
+integrate it with the build. See the details in the next section.
+
+[8]: https://cmake.org/cmake/help/latest/module/FetchContent.html
+[9]: https://cmake.org/cmake/help/latest/module/ExternalProject.html#id1
+
+#### git Submodule - Don't Use
+
+Please *avoid use of git-submodule*. If, for some reason, the CMake integrations
+described below don't work and git-submodule is absolutely required, please add
+the submodule under the 3rdparty directory.
+
+**Rationale:**
+
+For a source-code dependency distributed via git,
+FetchContent/ExternalProject and git submodules both ultimately contain
+the same referential information (repository URL, commit sha) and, at
+the end of the day, do the same things. However
+FetchContent/ExternalProject have the following advantages:
+
+1.  The git operations happen during the build and are interleaved with the rest
+    of the build processing, rather than requiring an additional step managed
+    outside of CMake.
+
+2.  The fetch, patch, and build steps for the sub project are individually named
+    in the build, so any failures are more clearly identified
+
+3.  The build state is better contained within the build tree where it is less
+    prone to interference by development actions.
+
+4.  For source code that is modified, FetchContent/ExternalProject can manage
+    application of the patches making it clear what modifications are present.
+
+5.  The build does not have to make assumptions about the version control
+    configuration of the source tree, which may be incorrect due to the fact
+    that it is bind-mounted in a container. For example, `git submodule --init`
+    inside a container will corrupt the git configuration outside the container
+    if the source tree is a git worktree.
+
+6.  External project references and their patches are collected under a more
+    narrow surface, rather than being spread across different tools. This makes
+    it easier to track third part dependencies as well as to recognize them
+    during code review.
+
+**Example:**
+
+``` bash
+git submodule add https://github.com/some-organization/some-project.git 3rdparty/some-project
+```
+
+
+## Step 2: Integrate the package
+
+There are many ways to integrate a package with the build through cmake.
+
+### find\_package for binary packages
+
+For binary packages (os-provided via apt-get or yum, or conan-provided), prefer
+the use of [find\_package][10] to integrate the package into the build. Conan
+will generate a find-script for packages that don't already come with a Cmake
+configuration file and the conan-specific logic is provided through the
+conan-generated toolchain already used in our build.
+
+For any packages which do not have provided find modules (either built-in, or
+available from conan), please implement one in [cpp/cmake/modules][11]. Please
+do not add "direct" invocations of `find_library` / `add_library` / `find_file`
+/ `find_path` outside of a find module the package.
+
+Please add invocations of `find_package` directly in the root Cmake file.
+
+[10]: https://cmake.org/cmake/help/latest/command/find_package.html
+[11]: https://github.com/NVIDIA/TensorRT-LLM/tree/main//cpp/cmake/modules?ref_type=heads
+
+**Example:**
+
+cpp/CMakeLists.txt
+
+```.cmake
+find_package(NIXL)
+```
+
+cpp/cmake/modules/FindNIXL.cmake
+```.cmake
+...
+    find_library(
+NIXL_LIBRARY nixl
+HINTS
+    ${NIXL_ROOT}/lib/${NIXL_TARGET_ARCH}
+           ${NIXL_ROOT}/lib64)
+...
+    add_library(NIXL::nixl SHARED IMPORTED)
+    set_target_properties(
+      NIXL::nixl
+      PROPERTIES
+        INTERFACE_INCLUDE_DIRECTORIES ${NIXL_INCLUDE_DIR}
+        IMPORTED_LOCATION ${NIXL_LIBRARY}
+    ${NIXL_BUILD_LIBRARY}
+${SERDES_LIBRARY}
+)
+```
+
+### FetchContent for source packages with compatible cmake builds
+
+For source packages that have a compatible cmake (e.g. where add\_subdirectory
+will work correctly), please use [FetchContent][12] to download the sources and
+integrate them into the build. Please add new invocations of
+FetchContent\_Declare in [3rdparty/CMakeLists.txt][13]. Add new invocations for
+FetchContent\_MakeAvailable wherever it makes sense in the build where you are
+integrating it, but prefer the root listfile for that build
+([cpp/CMakeLists.txt][14] for the primary build).
+
+CODEOWNERS for this file will consist of PLC reviewers who verify that
+third-party license compliance strategies are being followed.
+
+If the dependency you are adding has modified sources, please do the
+following:
+
+1.  Create a repository on gitlab to mirror the upstream source files. If the
+    upstream is also in git, please use the gitlab "mirror" repository option.
+    Otherwise, please use branches/tags to help identify the upstream source
+    versions.
+
+2.  Track nvidia changes in a branch. Use a linear sequence (trunk-based)
+    development strategy. Use meaningful, concise commit message subjects and
+    comprehensive commit messages for the changes applied.
+
+3.  Use `git format-patch \<upstream-commit\>\...HEAD` to create a list of
+    patches, one file per commit,
+
+4.  Add your patches under 3rdparty/patches/\<package-name\>
+
+5.  Use CMake's [PATCH\_COMMAND][15] option to apply the patches during the
+    build process.
+
+[12]: https://cmake.org/cmake/help/latest/module/FetchContent.html
+[13]: https://github.com/NVIDIA/TensorRT-LLM/tree/main//3rdparty/CMakeLists.txt?ref_type=heads
+[14]: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/CMakeLists.txt
+[15]: https://cmake.org/cmake/help/latest/module/ExternalProject.html#patch-step-options
+
+**Example:**
+
+3rdparty/CMakeLists.txt
+
+```.cmake
+FetchContent_Declare(
+  pybind11
+  GIT_REPOSITORY https://github.com/pybind/pybind11.git
+  GIT_TAG        f99ffd7e03001810a3e722bf48ad1a9e08415d7d
+)
+```
+
+cpp/CmakeLists.txt
+
+```.cmake
+FetchContent_MakeAvailable(pybind11)
+```
+
+### ExternalProject
+
+If the package you are adding doesn't support FetchContent (e.g. if it's not
+built by CMake or if its CMake configuration doesn't nest well), then please use
+[ExternalProject][16]. In this case that project's build system will be invoked
+as a build step of the primary build system. Note that, unless both the primary
+and child build systems are GNU Make, they will not share a job server and will
+independently schedule parallelism (e.g. -j flags).
+
+[16]: https://cmake.org/cmake/help/latest/module/ExternalProject.html#id1
+
+**Example:**
+
+```.cmake
+ExternalProject_Add(
+  nvshmem_project
+  URL https://developer.download.nvidia.com/compute/nvshmem/redist/libnvshmem/linux-x86_64/libnvshmem-linux-x86_64-3.2.5_cuda12-archive.tar.xz
+  URL_HASH ${NVSHMEM_URL_HASH}
+  PATCH_COMMAND patch -p1 --forward --batch -i
+                ${DEEP_EP_SOURCE_DIR}/third-party/nvshmem.patch
+  ...
+  CMAKE_CACHE_ARGS
+    -DCMAKE_C_COMPILER:STRING=${CMAKE_C_COMPILER}
+    -DCMAKE_C_COMPILER_LAUNCHER:STRING=${CMAKE_C_COMPILER_LAUNCHER}
+  ...
+  BINARY_DIR ${CMAKE_CURRENT_BINARY_DIR}/nvshmem-build
+  BUILD_BYPRODUCTS
+    ${CMAKE_CURRENT_BINARY_DIR}/nvshmem-build/src/lib/libnvshmem.a
+)
+add_library(nvshmem_project::nvshmem STATIC IMPORTED)
+add_dependencies(nvshmem_project::nvshmem nvshmem_project)
+...
+set_target_properties(
+  nvshmem_project::nvshmem
+  PROPERTIES IMPORTED_LOCATION
+             ${CMAKE_CURRENT_BINARY_DIR}/nvshmem-build/src/lib/libnvshmem.a
+             INTERFACE_INCLUDE_DIRECTORIES
+             ${CMAKE_CURRENT_BINARY_DIR}/nvshmem-build/src/include)
+```
+
+## Step 3: Update third-party attributions and license tracking
+
+1.  Clone the dependency source code to an NVIDIA-controlled repository. The
+    consumed commit must be stored as-received (ensure the consumed commit-sha
+    is present in the clone). For sources available via git (or git-adaptable)
+    SCM, mirror the repository in the [oss-components][18] gitlab project.
+
+2.  Collect the license text of the consumed commit
+
+3.  If the license does not include a copyright notice, collect any copyright
+    notices that were originally published with the dependency (these may be on
+    individual file levels, in metadata files, or in packaging control files).
+
+4.  Add the license and copyright notices to the ATTRIBUTIONS-CPP-x86\_64.md and
+    ATTRIBUTIONS-CPP-aarch64.md files
+
+CODEOWNERS for ATTRIBUTIONS-CPP-\*.md are members of the PLC team and modifying
+this file will signal to reviewers that they are verifying that your change
+follows the process in this document.
+
+[18]: https://gitlab.com/nvidia/tensorrt-llm/oss-components
+
+## Step 4: File a JIRA ticket if you need help from the Security team
+
+This step is optional, if you need assistance from the Security team.
+
+File a Jira ticket using the issue template [TRTLLM-8383][19] to request
+inclusion of this new dependency and initiate license and/or security review.
+The Security Team will triage and assign the ticket.
+
+If you don’t have access to the JIRA project, please email the [Security
+Team][20].
+
+
+[19]: https://jirasw.nvidia.com/browse/TRTLLM-8383
+[20]: mailto://TensorRT-LLM-Security@nvidia.com
--- a/3rdparty/cppzmq
+++ b/3rdparty/cppzmq
@ -1 +0,0 @@
-Subproject commit c94c20743ed7d4aa37835a5c46567ab0790d4acc
--- a/3rdparty/cutlass
+++ b/3rdparty/cutlass
@ -1 +0,0 @@
-Subproject commit f3fde58372d33e9a5650ba7b80fc48b3b49d40c8
--- a/3rdparty/cxxopts
+++ b/3rdparty/cxxopts
@ -1 +0,0 @@
-Subproject commit eb787304d67ec22f7c3a184ee8b4c481d04357fd
--- a/3rdparty/flash-mla
+++ b/3rdparty/flash-mla
@ -1 +0,0 @@
-Subproject commit 1408756a88e52a25196b759eaf8db89d2b51b5a1
--- a/3rdparty/json
+++ b/3rdparty/json
@ -1 +0,0 @@
-Subproject commit 55f93686c01528224f448c19128836e7df245f72
--- a/3rdparty/nanobind
+++ b/3rdparty/nanobind
@ -1 +0,0 @@
-Subproject commit a0ed2587f1089ef7657e2ed49ad6756b01c74e9f
--- a/3rdparty/py-thirdparty.md
+++ b/3rdparty/py-thirdparty.md
@ -0,0 +1,69 @@
+# Adding new python dependencies via pip
+
+If you add a new python dependency and that dependency will be installed in
+(and, thus, distributed with) the container, please follow this process.
+
+## Third-party packages without modification
+
+If the package you wish to add does not require modification, then please follow
+these steps:
+
+1. Add your new dependency to one of the "pip install" invocations among the
+   scripts in docker/common.sh. If none of the existing ones make sense, then
+   add a new script to install your package and add a new line to
+   Dockerfile.multi to run your script.
+2. Update ATTRIBUTIONS-Python.md to include all new dependencies. Note that this
+   must cover the transitive closure of all dependencies. The dependency you
+   added may have pulled in new transitive dependencies and we must ensure all
+   are attributed in this file.
+3. Verify that your newly added package is listed in the compliance reports and
+   that sources are pulled via the compliance tooling.
+
+## Third-party packages with modification
+
+If you wish to depend on a package with nvidia-contributed modifications that
+haven't been upstreamed then please follow these steps:
+
+1. File an OSRB request to fork/contribute to a 3rd party open source package.
+   https://confluence.nvidia.com/display/OSS/Contribution+to+Open+Source
+2. Clone the original repository to a new public nvidia-controlled location
+   (e.g. https://gitlab.com/nvidia/tensorrt-llm/oss-components/)
+3. Register this new repository under nspec
+4. Make modifications in that public repository. Ensure that the clone
+   repository clearly indicates the software license via /LICENSE.txt in the
+   root of the repository. Ensure that this file contains a copyright statement
+   indicating copyright held by the original author(s) and Nvidia.
+5. Publish the modified package to pypi under a new name (e.g. nvidia-<package>)
+6. Add your new dependency to one of the "pip install" invocations among the
+   scripts in docker/common.sh. If none of the existing ones make sense, then
+   add a new script to install your package and add a new line to
+   Dockerfile.multi to run your script.
+7. Update ATTRIBUTIONS-Python.md to include all new dependencies. Note that this
+   must cover the transitive closure of all dependencies. The dependency you
+   added may have pulled in new transitive dependencies and we must ensure all
+   are attributed in this file.
+8. Verify that your newly added package is listed in the compliance reports and
+   that sources are pulled via the compliance tooling.
+
+Notes:
+* For pip/uv-installed versions of TensorRT-LLM, the modified package will be
+  installed as a transitive dependency by the package manager
+* For the container distribution of TensorRT-LLM, the modified package will be
+  pre-installed from the same pypi location via pip
+
+## Individual third-party sources with modification
+
+If you wish to integrate third-party source files with nvidia-contributed
+modifications that haven't been upstreamed then please follow these steps:
+
+1. File an OSRB request to use open source:
+   https://confluence.nvidia.com/display/OSS/So+you+want+to+use+open+source+in+your+product
+2. Clone the original repository to a new nvidia-controlled location
+   (e.g. https://gitlab.com/nvidia/tensorrt-llm/oss-components/)
+3. Make modifications in that repository on branch so that the versions
+   "as-used" can be easily found and the diff against upstream easily viewed.
+4. Copy the desired source files into the TensorRT-LLM repository.
+5. Update ATTRIBUTIONS-Python.md to include attribution for the source files
+   you have added. Note the terms of the license on the original repository
+   and see the examples already in the file to understand what all needs to be
+   stated.
--- a/3rdparty/pybind11
+++ b/3rdparty/pybind11
@ -1 +0,0 @@
-Subproject commit f99ffd7e03001810a3e722bf48ad1a9e08415d7d
--- a/3rdparty/ucxx
+++ b/3rdparty/ucxx
@ -1 +0,0 @@
-Subproject commit 16eaa57c8d98c8ef54d666a2d2b11e76cfa565f5
--- a/3rdparty/xgrammar
+++ b/3rdparty/xgrammar
@ -1 +0,0 @@
-Subproject commit e4e816f5f0fe39f5b1601a17a4552307fa3b70ff
--- a/ATTRIBUTIONS-CPP-aarch64.md
+++ b/ATTRIBUTIONS-CPP-aarch64.md
@ -14889,6 +14889,24 @@ Chen, Tianqi

 ```

+## Mooncake
+
+- **Repository URL**: https://github.com/kvcache-ai/Mooncake
+- **License URL**: https://github.com/kvcache-ai/Mooncake/blob/main/LICENSE-APACHE
+- **License name**: Apache 2.0
+
+### Authors
+
+© Copyright 2025, Mooncake Team.
+Copyright (c) Meta Platforms, Inc. and affiliates.
+Copyright 2024 KVCache.AI
+Ruoyu Qin
+Zheming Li
+Weiran He
+Mingxing Zhang
+Yongwei Wu
+Weimin Zheng
+Xinran Xu
 ## flashinfer

 ### License Text
--- a/ATTRIBUTIONS-CPP-x86_64.md
+++ b/ATTRIBUTIONS-CPP-x86_64.md
@ -14697,6 +14697,24 @@ Chen, Tianqi

 ```

+## Mooncake
+
+- **Repository URL**: https://github.com/kvcache-ai/Mooncake
+- **License URL**: https://github.com/kvcache-ai/Mooncake/blob/main/LICENSE-APACHE
+- **License name**: Apache 2.0
+
+### Authors
+
+© Copyright 2025, Mooncake Team.
+Copyright (c) Meta Platforms, Inc. and affiliates.
+Copyright 2024 KVCache.AI
+Ruoyu Qin
+Zheming Li
+Weiran He
+Mingxing Zhang
+Yongwei Wu
+Weimin Zheng
+Xinran Xu
 ## flashinfer

 ### License Text
--- a/ATTRIBUTIONS-Python.md
+++ b/ATTRIBUTIONS-Python.md
--- a/CODING_GUIDELINES.md
+++ b/CODING_GUIDELINES.md
@ -487,9 +487,17 @@ else:
    f.read()
 ```

+## Documentation Guidelines
+
+#### CLI Options in Documentation
+1. When documenting CLI commands for `trtllm-serve`, `trtllm-bench`, `trtllm-eval`, or similar tools, prefer using `--config` over `--extra_llm_api_options` for specifying configuration files.
+   - `--config` is the preferred, shorter alias for configuration file options.
+   - Example: `trtllm-serve --model <model_path> --config config.yaml` (preferred)
+   - Avoid: `trtllm-serve --model <model_path> --extra_llm_api_options config.yaml`
+
 ## NVIDIA Copyright

-1. All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year.  The following block of text should be prepended to the top of all files.  This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
+1. All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.  The following block of text should be prepended to the top of all files.  This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
 ```cpp
 /*
 * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
--- a/CONTAINER_SOURCE.md
+++ b/CONTAINER_SOURCE.md
@ -0,0 +1,8 @@
+# Container Source Notices
+
+A `NOTICES.txt` file containing a link to the open source archive for a given container can be found at `/` in both the `release` and `devel` images.
+
+Generally, source archives for each image and its tags can be found at the below links:
+
+* [TensorRT-LLM Release](https://opensource.nvidia.com/oss/teams/nvidia/release/index.html)
+* [TensorRT-LLM Develop](https://opensource.nvidia.com/oss/teams/nvidia/devel/index.html)
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -8,33 +8,9 @@

 ## Coding Guidelines

-* Coding style for TensorRT-LLM can be found [in this document](CODING_GUIDELINES.md).
+TensorRT-LLM Coding Style can be found [in this document](CODING_GUIDELINES.md).

-* All contributed C++ code should be formatted following the rules in TensorRT-LLM's [clang-format](.clang-format) file. The recommended version is clang-format>=14.0.
-
-* Changes can be formatted with the following command:
-
-  ```bash
-  # Commit ID is optional - if unspecified, run format on staged changes.
-  git-clang-format --style file [commit ID/reference]
-  ```
-
-* All contributed Python code should be formatted using the `black` Python package. The recommended version is `black>=23.0`
-
-* Changes can be formatted with the following command:
-
-  ```bash
-  git diff --name-only | grep "*.py" | xargs black -l 120
-  ```
-
-* Try to keep pull requests (PRs) as concise as possible:
-  * Avoid committing commented-out code.
-  * Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes.
-
-## Coding Style
-
-We use `pre-commit` for automatic code formatting and validation. Install the `pre-commit` package in your local
-Python environment.
+We use `pre-commit` for automatic code formatting and validation. Install the `pre-commit` package in your local Python environment.

 ```bash
 pip install pre-commit
@ -73,6 +49,9 @@ mdformat.................................................................Passed

 If any files were modified by this hook, you will need to stage and commit them again.

+In addition, please try to keep pull requests (PRs) as concise as possible:
+* Avoid committing commented-out code.
+* Wherever possible, each PR should address a single concern. If there are several otherwise-unrelated things that should be fixed to reach a desired endpoint, our recommendation is to open several PRs and indicate the dependencies in the description. The more complex the changes are in a single PR, the more time it will take to review those changes.

 ## Pull Requests

--- a/132
+++ b/132
@ -1,3 +1,84 @@
+Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+This project is licensed under the Apache 2.0 license, whose full license text is available below.
+
+This project contains portions of code that are based on or derived from
+other open source projects, which may have different licenses whose text
+is available below.
+
+All modifications and additions to other projects are licensed under the
+Apache License 2.0 unless otherwise specified. Please refer to the individual
+file headers for specific copyright and license information.
+
+Below is a list of other projects that have portions contained by this project:
+
+--------------------------------------------------------------------------------
+causal-conv1d
+--------------------------------------------------------------------------------
+Original Source: https://github.com/Dao-AILab/causal-conv1d
+Copyright (c) 2024, Tri Dao.
+Licensed under the BSD 3-Clause License
+
+--------------------------------------------------------------------------------
+flash-linear-attention
+--------------------------------------------------------------------------------
+Original Source: https://github.com/fla-org/flash-linear-attention
+Copyright (c) 2023-2025 Songlin Yang
+Licensed under the MIT License
+
+--------------------------------------------------------------------------------
+InstructEval
+--------------------------------------------------------------------------------
+Original Source: https://github.com/declare-lab/instruct-eval
+Copyright (c) 2020 Dan Hendrycks
+Copyright (c) 2023 Deep Cognition and Language Research (DeCLaRe) Lab
+Licensed under the MIT License
+
+--------------------------------------------------------------------------------
+Mamba
+--------------------------------------------------------------------------------
+Original Source: https://github.com/state-spaces/mamba
+Copyright 2023 Tri Dao, Albert Gu
+Licensed under the Apache License 2.0
+
+--------------------------------------------------------------------------------
+SGLang
+--------------------------------------------------------------------------------
+Original Source: https://github.com/sgl-project/sglang
+Copyright contributors to the SGLang project
+Licensed under the Apache License 2.0
+
+--------------------------------------------------------------------------------
+Text Generation Inference
+--------------------------------------------------------------------------------
+Original Source: https://github.com/huggingface/text-generation-inference
+Copyright 2022 Hugging Face
+Licensed under the Apache License 2.0
+
+--------------------------------------------------------------------------------
+Transformers
+--------------------------------------------------------------------------------
+Original Source: https://github.com/huggingface/transformers
+Copyright 2018 The HuggingFace Team
+Licensed under the Apache License 2.0
+
+--------------------------------------------------------------------------------
+XGrammar
+--------------------------------------------------------------------------------
+Original Source: https://github.com/mlc-ai/xgrammar
+Copyright (c) 2024 by XGrammar Contributors
+Licensed under the Apache License 2.0
+
+--------------------------------------------------------------------------------
+vLLM
+--------------------------------------------------------------------------------
+Original Source: https://github.com/vllm-project/vllm
+Copyright contributors to the vLLM project
+Licensed under the Apache License 2.0
+
+================================================================================
+                              Apache 2.0 LICENSE
+================================================================================

                                 Apache License
                           Version 2.0, January 2004
@ -200,3 +281,54 @@
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
+
+================================================================================
+                              MIT LICENSE
+================================================================================
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+================================================================================
+                              BSD 3-Clause License
+================================================================================
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice,
+  this list of conditions and the following disclaimer in the documentation
+  and/or other materials provided with the distribution.
+
+* Neither the name of the copyright holder nor the names of its
+  contributors may be used to endorse or promote products derived from
+  this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
--- a/README.md
+++ b/README.md
@ -9,11 +9,11 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
 [![python](https://img.shields.io/badge/python-3.12-green)](https://www.python.org/downloads/release/python-3123/)
 [![python](https://img.shields.io/badge/python-3.10-green)](https://www.python.org/downloads/release/python-31012/)
 [![cuda](https://img.shields.io/badge/cuda-13.0.0-green)](https://developer.nvidia.com/cuda-downloads)
-[![trt](https://img.shields.io/badge/TRT-10.13.2-green)](https://developer.nvidia.com/tensorrt)
-[![version](https://img.shields.io/badge/release-1.2.0rc2-green)](./tensorrt_llm/version.py)
-[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
+[![torch](https://img.shields.io/badge/torch-2.9.0-green)](https://pytorch.org)
+[![version](https://img.shields.io/badge/release-1.2.0rc8-green)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/version.py)
+[![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/LICENSE)

-[Architecture](./docs/source/torch/arch_overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Performance](./docs/source/performance/perf-overview.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](https://nvidia.github.io/TensorRT-LLM/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Roadmap](https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap)
+[Architecture](https://nvidia.github.io/TensorRT-LLM/developer-guide/overview.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Performance](https://nvidia.github.io/TensorRT-LLM/developer-guide/perf-overview.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentation](https://nvidia.github.io/TensorRT-LLM/)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Roadmap](https://github.com/NVIDIA/TensorRT-LLM/issues?q=is%3Aissue%20state%3Aopen%20label%3Aroadmap)

 ---
 <div align="left">
@ -21,40 +21,40 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
 ## Tech Blogs

 * [10/13] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog14_Scaling_Expert_Parallelism_in_TensorRT-LLM_part3.html)

 * [09/26] Inference Time Compute Implementation in TensorRT LLM
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog13_Inference_Time_Compute_Implementation_in_TensorRT-LLM.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog13_Inference_Time_Compute_Implementation_in_TensorRT-LLM.html)

 * [09/19] Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog12_Combining_Guided_Decoding_and_Speculative_Decoding.html)

 * [08/29] ADP Balance Strategy
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog10_ADP_Balance_Strategy.html)

 * [08/05] Running a High-Performance GPT-OSS-120B Inference Server with TensorRT LLM
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.html)

 * [08/01] Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.html)

 * [07/26] N-Gram Speculative Decoding in TensorRT LLM
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog7_NGram_performance_Analysis_And_Auto_Enablement.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog7_NGram_performance_Analysis_And_Auto_Enablement.html)

 * [06/19] Disaggregated Serving in TensorRT LLM
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.html)

 * [06/05] Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.html)

 * [05/30] Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.html)

 * [05/23] DeepSeek R1 MTP Implementation and Optimization
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.html)

 * [05/16] Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
-✨ [➡️ link](./docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.html)

 ## Latest News
 * [08/05] 🌟 TensorRT LLM delivers Day-0 support for OpenAI's latest open-weights models: GPT-OSS-120B [➡️ link](https://huggingface.co/openai/gpt-oss-120b) and GPT-OSS-20B [➡️ link](https://huggingface.co/openai/gpt-oss-20b)
@ -63,11 +63,11 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
 * [05/22] Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick
 ✨ [➡️ link](https://developer.nvidia.com/blog/blackwell-breaks-the-1000-tps-user-barrier-with-metas-llama-4-maverick/)
 * [04/10] TensorRT LLM DeepSeek R1 performance benchmarking best practices now published.
-✨ [➡️ link](./docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
+✨ [➡️ link](https://nvidia.github.io/TensorRT-LLM/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.html)

 * [04/05] TensorRT LLM can run Llama 4 at over 40,000 tokens per second on B200 GPUs!

-![L4_perf](./docs/source/media/l4_launch_perf.png)
+![L4_perf](https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/docs/source/media/l4_launch_perf.png)


 * [03/22] TensorRT LLM is now fully open-source, with developments moved to GitHub!
@ -164,7 +164,7 @@ state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.<
 [➡️ link](https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml)


-* [2024/08/20] 🏎️SDXL with #TensorRT Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12
+* [2024/08/20] 🏎️SDXL with #Model Optimizer ⏱️⚡ 🏁 cache diffusion 🏁 quantization aware training 🏁 QLoRA 🏁 #Python 3.12
 [➡️ link](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/)

 * [2024/08/13] 🐍 DIY Code Completion with #Mamba ⚡ #TensorRT #LLM for speed 🤖 NIM for ease ☁️ deploy anywhere
@ -209,7 +209,7 @@ Technical Deep Dive for serious coders ✅+99% compression ✅1 set of weights
 * [2024/05/21] ✨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM ✨👀 📚 Marvelous Modal Manual:
 Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.com/docs/examples/trtllm_llama)

-* [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
+* [2024/05/08] NVIDIA Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)

 * [2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚[➡️ link](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/)

@ -230,7 +230,7 @@ Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co

 TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.

-[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
+[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.1/docs/source/developer-guide/overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).

 TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.

--- a/benchmarks/cpp/CMakeLists.txt
+++ b/benchmarks/cpp/CMakeLists.txt
@ -20,8 +20,8 @@ set(TOP_LEVEL_DIR "${PROJECT_SOURCE_DIR}/..")
 add_custom_target(benchmarks)

 if(NOT TARGET cxxopts::cxxopts)
-  set(CXXOPTS_SRC_DIR ${PROJECT_SOURCE_DIR}/../3rdparty/cxxopts)
-  add_subdirectory(${CXXOPTS_SRC_DIR} ${CMAKE_CURRENT_BINARY_DIR}/cxxopts)
+  add_subdirectory(${CMAKE_BINARY_DIR}/_deps/cxxopts-src
+                   ${CMAKE_CURRENT_BINARY_DIR}/cxxopts)
 endif()

 function(add_benchmark test_name test_src)
--- a/benchmarks/cpp/prepare_dataset.py
+++ b/benchmarks/cpp/prepare_dataset.py
@ -49,7 +49,7 @@ class RootArgs(BaseModel):
        return self


-@click.group()
+@click.group(deprecated=True)
@click.option(
    "--tokenizer",
    required=True,
--- a/benchmarks/cpp/utils/utils.cpp
+++ b/benchmarks/cpp/utils/utils.cpp
@ -1,6 +1,7 @@

 /*
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION &
+ *AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: Apache-2.0
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
@ -17,13 +18,16 @@
 */

 #include "utils.h"
+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/common/logger.h"
 #include <random>

 #include <filesystem>
 #include <fstream>

-namespace tensorrt_llm::benchmark
+TRTLLM_NAMESPACE_BEGIN
+
+namespace benchmark
 {

 std::vector<std::vector<SizeType32>> parseVectorOfVectors(std::string const& input)
@ -98,7 +102,8 @@ Samples parseWorkloadJson(
    if (samples.size() < maxNumSamples)
    {
        TLLM_LOG_WARNING(
-            "Dataset size %zu is smaller than given max_num_samples %d, max_num_samples will be ignored.\n",
+            "Dataset size %zu is smaller than given max_num_samples "
+            "%d, max_num_samples will be ignored.\n",
            samples.size(), maxNumSamples);
    }
    return samples;
@ -160,4 +165,6 @@ std::ostream& operator<<(std::ostream& os, RecordBwMetric const& metric)
    return os;
 }

-} // namespace tensorrt_llm::benchmark
+} // namespace benchmark
+
+TRTLLM_NAMESPACE_END
--- a/benchmarks/cpp/utils/utils.h
+++ b/benchmarks/cpp/utils/utils.h
@ -16,6 +16,7 @@
 * limitations under the License.
 */

+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/executor/executor.h"

 #include <cstdint>
@ -29,7 +30,9 @@

 #pragma once

-namespace tensorrt_llm::benchmark
+TRTLLM_NAMESPACE_BEGIN
+
+namespace benchmark
 {

 // using namespace tensorrt_llm::batch_manager;
@ -237,4 +240,6 @@ std::vector<double> generateRandomExponentialValues(int count, float lambda, int

 std::vector<double> computeTimeDelays(BenchmarkParams const& benchmarkParams, int numDelays);

-} // namespace tensorrt_llm::benchmark
+} // namespace benchmark
+
+TRTLLM_NAMESPACE_END
--- a/constraints.txt
+++ b/constraints.txt
@ -1,2 +1,5 @@
 # These vulnerabilities were inherited from the base image (pytorch:25.10-py3) and should be removed when the base image
 # is updated.
+# WAR against https://github.com/advisories/GHSA-gm62-xv2j-4w53
+# WAR against https://github.com/advisories/GHSA-2xpw-w6gg-jr37
+urllib3>=2.6.0
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@ -68,6 +68,7 @@ option(USING_OSS_CUTLASS_MOE_GEMM "Using open sourced Cutlass moe gemm kernel"
       ON)
 option(USING_OSS_CUTLASS_ALLREDUCE_GEMM
       "Using open sourced Cutlass AR gemm kernel" ON)
+option(SKIP_SOFTMAX_STAT "Enable Statistics of Skip-Softmax" OFF)

 message(STATUS "ENABLE_NVSHMEM is ${ENABLE_NVSHMEM}")

@ -243,15 +244,35 @@ set(TRT_LIB TensorRT::NvInfer)
 get_filename_component(TRT_LLM_ROOT_DIR ${CMAKE_CURRENT_SOURCE_DIR} PATH)

 set(3RDPARTY_DIR ${TRT_LLM_ROOT_DIR}/3rdparty)
+add_subdirectory(${3RDPARTY_DIR} 3rdparty)
+
 if(BINDING_TYPE STREQUAL "pybind"
   OR BUILD_DEEP_EP
   OR BUILD_DEEP_GEMM)
-  add_subdirectory(${3RDPARTY_DIR}/pybind11
-                   ${CMAKE_CURRENT_BINARY_DIR}/pybind11)
+  FetchContent_MakeAvailable(pybind11)
+  include_directories(${CMAKE_BINARY_DIR}/_deps/pybind11-src/include)
 endif()
 if(BINDING_TYPE STREQUAL "nanobind")
-  add_subdirectory(${3RDPARTY_DIR}/nanobind
-                   ${CMAKE_CURRENT_BINARY_DIR}/nanobind)
+  FetchContent_MakeAvailable(nanobind)
+  include_directories(${CMAKE_BINARY_DIR}/_deps/nanobind-src/include)
+endif()
+
+FetchContent_MakeAvailable(cutlass cxxopts flashmla json xgrammar)
+
+if(ENABLE_UCX)
+  FetchContent_MakeAvailable(cppzmq ucxx)
+endif()
+
+if(NOT NVTX_DISABLE)
+  FetchContent_MakeAvailable(nvtx)
+endif()
+
+if(BUILD_DEEP_GEMM)
+  FetchContent_MakeAvailable(deepgemm)
+endif()
+
+if(NOT NVTX_DISABLE)
+  set(maybe_nvtx_includedir ${CMAKE_BINARY_DIR}/_deps/nvtx-src/include)
 endif()

 # include as system to suppress warnings
@ -261,18 +282,10 @@ include_directories(
  ${CUDAToolkit_INCLUDE_DIRS}/cccl
  ${CUDNN_ROOT_DIR}/include
  $<TARGET_PROPERTY:TensorRT::NvInfer,INTERFACE_INCLUDE_DIRECTORIES>
-  ${3RDPARTY_DIR}/cutlass/include
-  ${3RDPARTY_DIR}/cutlass/tools/util/include
-  ${3RDPARTY_DIR}/NVTX/include
-  ${3RDPARTY_DIR}/json/include)
-if(BINDING_TYPE STREQUAL "pybind"
-   OR BUILD_DEEP_EP
-   OR BUILD_DEEP_GEMM)
-  include_directories(${3RDPARTY_DIR}/pybind11/include)
-endif()
-if(BINDING_TYPE STREQUAL "nanobind")
-  include_directories(${3RDPARTY_DIR}/nanobind/include)
-endif()
+  ${maybe_nvtx_includedir}
+  ${CMAKE_BINARY_DIR}/_deps/cutlass-src/include
+  ${CMAKE_BINARY_DIR}/_deps/cutlass-src/tools/util/include
+  ${CMAKE_BINARY_DIR}/_deps/json-src/include)

 if(${CUDAToolkit_VERSION} VERSION_GREATER_EQUAL "11")
  add_definitions("-DENABLE_BF16")
@ -348,6 +361,11 @@ else()
                          $<$<COMPILE_LANGUAGE:CUDA>:ENABLE_NVSHMEM=0>)
 endif()

+if(SKIP_SOFTMAX_STAT)
+  add_compile_definitions("SKIP_SOFTMAX_STAT")
+  message(STATUS "SKIP_SOFTMAX_STAT is enabled")
+endif()
+
 # Fix linking issue with TRT 10, the detailed description about `--mcmodel` can
 # be found in
 # https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mcmodel_003dmedium-1
@ -486,7 +504,7 @@ print(os.path.dirname(torch.__file__),end='');"
  endif()
  list(APPEND CMAKE_PREFIX_PATH ${TORCH_DIR})
  set(USE_SYSTEM_NVTX ON)
-  set(nvtx3_dir ${3RDPARTY_DIR}/NVTX/include)
+  set(nvtx3_dir ${CMAKE_BINARY_DIR}/_deps/nvtx-src/include)
  set(CMAKE_CUDA_ARCHITECTURES_BACKUP ${CMAKE_CUDA_ARCHITECTURES})
  find_package(Torch REQUIRED)
  set(CMAKE_CUDA_ARCHITECTURES ${CMAKE_CUDA_ARCHITECTURES_BACKUP})
@ -538,14 +556,15 @@ if(ENABLE_UCX)
  if(NOT ${ucx_FOUND})
    set(ENABLE_UCX 0)
  else()
+    set(ucxx_source_dir ${CMAKE_BINARY_DIR}/_deps/ucxx-src)
    if(DEFINED ENV{GITHUB_MIRROR} AND NOT "$ENV{GITHUB_MIRROR}" STREQUAL "")
-      if(EXISTS "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake")
-        file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS)
+      if(EXISTS "${ucxx_source_dir}/fetch_rapids.cmake")
+        file(READ "${ucxx_source_dir}/fetch_rapids.cmake" FILE_CONTENTS)
        string(
          REPLACE "https://raw.githubusercontent.com/rapidsai/rapids-cmake"
                  "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads"
                  FILE_CONTENTS "${FILE_CONTENTS}")
-        file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}")
+        file(WRITE "${ucxx_source_dir}/fetch_rapids.cmake" "${FILE_CONTENTS}")
        message(WARNING "Replace UCXX fetch_rapids.cmake with internal mirror")
      endif()
    endif()
@ -556,13 +575,13 @@ if(ENABLE_UCX)
    execute_process(
      COMMAND
        ${CMAKE_COMMAND} -E env LIB_BUILD_DIR=${CMAKE_BINARY_DIR}/ucxx/build
-        ${3RDPARTY_DIR}/ucxx/build.sh libucxx -n
+        ${ucxx_source_dir}/build.sh libucxx -n
        --cmake-args=\"-DBUILD_SHARED_LIBS=OFF
        -DCMAKE_CXX_FLAGS=-D_GLIBCXX_USE_CXX11_ABI=${USE_CXX11_ABI}\"
      OUTPUT_VARIABLE UCXX_BUILD_OUTPUT
      RESULT_VARIABLE UCXX_BUILD_RESULT)
    if(UCXX_BUILD_RESULT)
-      message(${UCXX_BUILD_OUTPUT})
+      message("ucxx build: ${UCXX_BUILD_OUTPUT}")
      message(FATAL_ERROR "ucxx build failed")
    endif()
    find_package(ucxx REQUIRED PATHS ${CMAKE_BINARY_DIR}/ucxx/build
--- a/cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
+++ b/cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
@ -269,7 +269,8 @@ private:
    std::unique_ptr<executor::kv_cache::CacheState> mCacheState;
    std::unique_ptr<executor::kv_cache::ConnectionManager> mManager;
    std::optional<executor::CacheTransceiverConfig> mCacheTransceiverConfig;
-    std::unique_ptr<kv_cache_manager::CacheTransBufferManager> mCacheTransBufferManager;
+    std::vector<std::unique_ptr<kv_cache_manager::CacheTransBufferManager>> mCacheTransBufferManagers;
+    std::vector<kv_cache_manager::CacheTransBufferManager*> mCacheTransBufferManagerPtrs;
    // library handle to the communicator related features,
    // this is used to defer dependency resolution until needed.
    static std::mutex mDllMutex;
--- a/cpp/include/tensorrt_llm/batch_manager/decoderBuffers.h
+++ b/cpp/include/tensorrt_llm/batch_manager/decoderBuffers.h
@ -38,6 +38,7 @@ class DecoderInputBuffers
 public:
    using SizeType32 = runtime::SizeType32;
    using TensorPtr = runtime::ITensor::SharedPtr;
+    using TensorConstPtr = runtime::ITensor::SharedConstPtr;

    explicit DecoderInputBuffers(
        SizeType32 maxBatchSize, SizeType32 maxDecoderSteps, runtime::BufferManager const& manager);
@ -60,13 +61,22 @@ public:
    //! Requests for considered in decoder forward
    RequestVector decoderRequests;

+    //! Logits of decoder requests
+    std::vector<TensorPtr> decoderLogits;
+
+    //! Maximum number of decoding steps of decoder requests.
+    //! This is only more than 1 for external draft tokens speculative decoding.
+    SizeType32 maxDecoderSteps{1};
+
    //! Batch slots for all decoder steps, [maxDecoderSteps][maxBatchSize]
    std::vector<TensorPtr> forwardBatchSlots;

-    //! Logits of decoder requests
-    std::vector<TensorPtr> logits;
+    //! Logits for requests in forwardBatchSlots (in the same order).
+    //! [maxDecoderSteps][batchSize][1, beamWidth, vocabSizePadded], on gpu
+    std::vector<std::vector<TensorConstPtr>> batchLogits;

-    //! Logits for speculative decoding (Medusa)
+    //! Logits for speculative decoding (Medusa).
+    //! The vector is sparse, only slots in forwardBatchSlots are used.
    //! [maxBatchSize][maxAcceptedDraftTokensPerStep][maxDraftTokens + 1, vocabSizePadded]
    std::vector<std::vector<runtime::ITensor::SharedPtr>> predictedDraftLogits;
 };
--- a/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
+++ b/cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
@ -78,9 +78,7 @@ using VecUniqueTokens = tensorrt_llm::runtime::VecUniqueTokens;
 using LoraTaskIdType = tensorrt_llm::runtime::LoraTaskIdType;
 using BlocksPerWindow = std::map<SizeType32, std::tuple<SizeType32, SizeType32>>;
 using CacheSaltIDType = tensorrt_llm::runtime::CacheSaltIDType;
-
-// Type alias for multimodal hash key (hash array + start offset)
-using MmKey = std::pair<std::array<uint8_t, 32>, SizeType32>;
+using MmKey = tensorrt_llm::executor::MmKey;

 template <typename T>
 using OptionalRef = tensorrt_llm::common::OptionalRef<T>;
@ -325,6 +323,8 @@ public:

    size_t getHash() const;

+    std::vector<MmKey> getExtraKeys() const;
+
 private:
    // Linear ID of block independent of pool
    IdType mBlockId;
@ -380,6 +380,7 @@ public:
        , mBeamWidth(beamWidth)
        , mKvCacheRetentionConfig(std::move(kvCacheRetentionConfig))
        , mNumFrontBlocksRemoved(0)
+        , mCurrentPrepopulatedPromptLen(std::numeric_limits<SizeType32>::max())
    {
        auto const numWindowSizes = windowSizeToMetadata.size();
        mCacheBlockIds.reserve(numWindowSizes);
@ -500,6 +501,20 @@ public:
        return mKvCacheRetentionConfig.getDirectory();
    }

+    [[nodiscard]] SizeType32 getCurrentPrepopulatedPromptLen() const
+    {
+        return mCurrentPrepopulatedPromptLen;
+    }
+
+    void setCurrentPrepopulatedPromptLen(SizeType32 currentPrepopulatedPromptLen)
+    {
+        TLLM_CHECK_WITH_INFO(currentPrepopulatedPromptLen <= mCurrentPrepopulatedPromptLen,
+            "currentPrepopulatedPromptLen must be updated non-increasingly due to the "
+            "assumption that smaller window sizes have shorter or equal"
+            "currentPrepopulatedPromptLen in WindowSizeManager::loadOrAllocateBlocks.");
+        mCurrentPrepopulatedPromptLen = currentPrepopulatedPromptLen;
+    }
+
 private:
    // Request id of the sequence
    LlmRequest::RequestIdType mRequestId;
@ -517,6 +532,8 @@ private:
    SizeType32 mNumFrontBlocksRemoved;
    // Set of used blocks by the sequence
    std::set<KVCacheBlock::IdType> mUsedBlocks;
+    // Current prepopulated prompt length
+    SizeType32 mCurrentPrepopulatedPromptLen;
 };

 // attach metadata to a pool pointer
@ -595,6 +612,21 @@ public:

    ~WindowBlockManager();

+    [[nodiscard]] bool isEnableIndexerKCache() const
+    {
+        return mEnableIndexerKCache;
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheQuantBlockSize() const
+    {
+        return mIndexerKCacheQuantBlockSize;
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheIndexHeadDim() const
+    {
+        return mIndexerKCacheIndexHeadDim;
+    }
+
    void allocatePools(bool useUvm);

    void releasePools();
@ -616,7 +648,7 @@ public:

    void replaceSharedBlock(GenerationRequest& sequence, SizeType32 blockIdx);

-    [[nodiscard]] std::optional<KVCacheBlock::IdType> storeBlocksForReuse(
+    [[nodiscard]] std::vector<KVCacheBlock::IdType> storeBlocksForReuse(
        GenerationRequest& sequence, OptionalRef<LlmRequest const> llmRequest, bool pinBlocks = false);

    void storeNewBlock(GenerationRequest& sequence, OptionalRef<LlmRequest const> llmRequest);
@ -809,6 +841,9 @@ public:
        return mBufferManager;
    }

+    //! \brief Sync internal streams used by transfer manager with buffer manager stream
+    void syncTransferManagerWithBufferManager();
+
    //! \brief Perform per-request bookkeeping
    void refreshBlocks();

@ -818,8 +853,8 @@ public:
    //! \param blockKeys Key of each block.
    //! \param blockIds Id of each block.
    //! \param pinBlocks If true, increment ref count for blocks while storing (pin on store).
-    //! \return Pair of (num blocks stored for reuse, id of the last block stored if any).
-    [[nodiscard]] std::pair<SizeType32, std::optional<KVCacheBlock::IdType>> storeBlocks(
+    //! \return Pair of (num blocks stored for reuse, vector of pinned block IDs).
+    [[nodiscard]] std::pair<SizeType32, std::vector<KVCacheBlock::IdType>> storeBlocks(
        std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds,
        bool pinBlocks = false);

@ -851,8 +886,8 @@ public:

    [[nodiscard]] std::shared_ptr<KVCacheBlock> findBlocksInReuseTreeByBlockKey(BlockKey const& blockKey);

-    //! \brief Unpin blocks by starting from a block id and walking prev pointers.
-    void unpinBlocksById(KVCacheBlock::IdType blockId);
+    //! \brief Unpin blocks by block ids directly
+    void unpinBlocksById(std::vector<KVCacheBlock::IdType> const& blockIds);

    void initializeSequenceStorageValidity(LlmRequest::RequestIdType requestId)
    {
@ -1021,6 +1056,21 @@ public:
        std::optional<kvc::BaseAgentConfig> agentConfig = std::nullopt, bool enableIndexerKCache = false,
        SizeType32 indexerKCacheQuantBlockSize = 128, SizeType32 indexerKCacheIndexHeadDim = 0);

+    [[nodiscard]] bool isEnableIndexerKCache() const
+    {
+        return mIsEnableIndexerKCache;
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheQuantBlockSize() const
+    {
+        return mIndexerKCacheQuantBlockSize;
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheIndexHeadDim() const
+    {
+        return mIndexerKCacheIndexHeadDim;
+    }
+
    BlockManager(BlockManager const&) = delete;
    BlockManager& operator=(BlockManager const&) = delete;

@ -1053,7 +1103,7 @@ public:
    std::optional<KVCacheBlock::IdType> releaseBlocks(
        GenerationRequest& sequence, OptionalRef<LlmRequest const> llmRequest = std::nullopt, bool pinBlocks = false);

-    [[nodiscard]] std::optional<KVCacheBlock::IdType> storeBlocksForReuse(
+    [[nodiscard]] std::vector<KVCacheBlock::IdType> storeBlocksForReuse(
        GenerationRequest& sequence, OptionalRef<LlmRequest const> llmRequest = std::nullopt, bool pinBlocks = false);

    void schedulingReleaseBlocks(LlmRequest::RequestIdType requestId);
@ -1062,7 +1112,7 @@ public:
    /// @param sequence The generation request whose blocks should be pinned.
    void pinBlocks(GenerationRequest& sequence);

-    void unpinBlocksById(KVCacheBlock::IdType blockId);
+    void unpinBlocksById(std::vector<KVCacheBlock::IdType> const& blockIds);

    void releaseLastBlock(GenerationRequest& sequence, SizeType32 windowSize);

@ -1083,7 +1133,7 @@ public:
    void offloadBlock(BlockPtr const& block, SizeType32 windowSize,
        executor::KvCacheTransferMode mode = executor::KvCacheTransferMode::DRAM, std::string const& directory = "");

-    [[nodiscard]] std::pair<SizeType32, std::optional<KVCacheBlock::IdType>> storeBlocks(
+    [[nodiscard]] std::pair<SizeType32, std::vector<KVCacheBlock::IdType>> storeBlocks(
        std::vector<BlockKey> const& blockKeys, std::vector<KVCacheBlock::IdType> const& blockIds,
        SizeType32 windowSize, bool pinBlocks = false)
    {
@ -1283,6 +1333,9 @@ public:
    //! \brief Store newest block for reuse
    void storeNewBlock(GenerationRequest& sequence, OptionalRef<LlmRequest const> llmRequest);

+    //! \brief Sync internal streams used by transfer manager with buffer manager stream
+    void syncTransferManagerWithBufferManager();
+
    //! \brief Perform per-request bookkeeping
    void refreshBlocks();

@ -1398,6 +1451,10 @@ private:
    std::vector<SizeType32> mAbsolutePoolToRelativePoolIndex;
    // Record what sequences are currently managed by the block manager
    std::set<LlmRequest::RequestIdType> mManagedSequences;
+
+    bool mIsEnableIndexerKCache{false};
+    SizeType32 mIndexerKCacheQuantBlockSize{0};
+    SizeType32 mIndexerKCacheIndexHeadDim{0};
 };

 struct OffsetTableDimensions
@ -1500,6 +1557,10 @@ public:

    [[nodiscard]] virtual bool isEnableBlockReuse() const = 0;

+    [[nodiscard]] virtual bool isEnableIndexerKCache() const = 0;
+    [[nodiscard]] virtual SizeType32 getIndexerKCacheIndexHeadDim() const = 0;
+    [[nodiscard]] virtual SizeType32 getIndexerKCacheQuantBlockSize() const = 0;
+
    // void removeToken(SizeType32 seqSlotIdx);
    virtual void rewindKVCache(LlmRequest::RequestIdType requestId, SizeType32 rewindLengths) = 0;

@ -1523,7 +1584,7 @@ public:
    virtual void storeNewBlock(LlmRequest const& llmRequest) = 0;

    /// \brief Store blocks for reuse for a given request id
-    [[nodiscard]] virtual std::optional<KVCacheBlock::IdType> storeBlocksForReuse(
+    [[nodiscard]] virtual std::vector<KVCacheBlock::IdType> storeBlocksForReuse(
        LlmRequest::RequestIdType requestId, OptionalRef<LlmRequest const> llmRequest, bool pinBlocks = false)
        = 0;

@ -1546,6 +1607,7 @@ public:
    [[nodiscard]] virtual runtime::ITensor::SharedPtr getIndexerKCachePool() const = 0;
    [[nodiscard]] virtual SizeType32 getPoolLayerIdx(SizeType32 layer_idx) const = 0;

+    virtual void syncTransferManagerWithBufferManager() = 0;
    virtual void refreshBlocks() = 0;
    virtual void flushIterationEvents() = 0;
    virtual void resetReuseState() = 0;
@ -1616,7 +1678,7 @@ public:
        BlockKey const& blockKey, SizeType32 windowSize)
        = 0;

-    virtual void unpinBlocksById(KVCacheBlock::IdType blockId) = 0;
+    virtual void unpinBlocksById(std::vector<KVCacheBlock::IdType> const& blockIds) = 0;
 };

 class KVCacheManager : public BaseKVCacheManager
@ -1834,6 +1896,21 @@ public:
        return mEnableBlockReuse;
    }

+    [[nodiscard]] bool isEnableIndexerKCache() const override
+    {
+        return mBlockManager.isEnableIndexerKCache();
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheIndexHeadDim() const override
+    {
+        return mBlockManager.getIndexerKCacheIndexHeadDim();
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheQuantBlockSize() const override
+    {
+        return mBlockManager.getIndexerKCacheQuantBlockSize();
+    }
+
    void removeToken(LlmRequest::RequestIdType requestId);
    void rewindKVCache(LlmRequest::RequestIdType requestId, SizeType32 rewindLengths) override;

@ -1862,7 +1939,7 @@ public:
    //! \brief Store newest blocks for reuse
    void storeNewBlock(LlmRequest const& llmRequest) override;

-    [[nodiscard]] std::optional<KVCacheBlock::IdType> storeBlocksForReuse(
+    [[nodiscard]] std::vector<KVCacheBlock::IdType> storeBlocksForReuse(
        LlmRequest::RequestIdType requestId, OptionalRef<LlmRequest const> llmRequest, bool pinBlocks = false) override;

    [[nodiscard]] static SizeType32 getSinkBubbleLength(SizeType32 sinkTokenLen, SizeType32 tokensPerBlock);
@ -1883,7 +1960,7 @@ public:

    void pinBlocks(LlmRequest::RequestIdType requestId) override;

-    void unpinBlocksById(KVCacheBlock::IdType blockId) override;
+    void unpinBlocksById(std::vector<KVCacheBlock::IdType> const& blockIds) override;

    std::optional<KVCacheBlock::IdType> getLastBlockId(LlmRequest::RequestIdType requestId) const override;

@ -1912,6 +1989,11 @@ public:
        return mBlockManager.getPoolLayerIdx(layer_idx);
    }

+    void syncTransferManagerWithBufferManager() override
+    {
+        mBlockManager.syncTransferManagerWithBufferManager();
+    }
+
    //! \brief Perform per-iteration bookkeeping
    void refreshBlocks() override
    {
--- a/cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
+++ b/cpp/include/tensorrt_llm/batch_manager/kvCacheTransferManager.h
@ -46,7 +46,15 @@ public:
        int numTokensToCopy = 0, executor::KvCacheTransferMode mode = executor::KvCacheTransferMode::DRAM,
        std::string const& directory = "");

-    //! \brief Synchronize the offload/onboard streams with the bufferManager stream.
+    //! \brief Synchronize internal streams with bufferManager stream.
+    //! \details The buffer manager uses the same stream as the prefill and decode kernels. This method ensures that the
+    //! internal kernels used for offloading and onboarding will wait for prefill and decode kernels before performing
+    //! any block copies. This method must be called before the first call to KVCacheManager::addSequence in every step.
+    void syncWithBufferManager();
+
+    //! \brief Synchronize bufferManager stream with internal streams. This method ensures that prefill and decode
+    //! kernels for next step will wait for offloading and onboarding work that has already been scheduled. This method
+    //! must be called after last call to KVCacheManager::addSequence in every step.
    void syncTransfers();

 private:
@ -75,8 +83,10 @@ private:
    runtime::BufferManager mOnboardManager;
    runtime::BufferManager mOffloadManager;

-    // Track the block ids offloaded in this iteration.
-    std::unordered_map<int32_t, tr::CudaEvent> mPendingOffloads;
+    // Track reads and writes for blocks. Note that it is the memory pool index that
+    // identifies the raw memory blocks involved in I/O, not the block Id.
+    std::unordered_map<kernels::KVCacheIndex::UnderlyingType, tr::CudaEvent> mPendingReads;
+    std::unordered_map<kernels::KVCacheIndex::UnderlyingType, tr::CudaEvent> mPendingWrites;
    // Reference to parent loopback agent
    std::shared_ptr<kvc::BaseLoopbackAgent> mLoopbackAgent;
    int mDeviceId;
--- a/cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
+++ b/cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
@ -73,7 +73,8 @@ public:
        BaseKVCacheManager& cacheManager, BlockKey const& lastBlockKey, int32_t indexFromEnd)
    {

-        auto poolNum = cacheManager.getNumPools();
+        auto poolNum = cacheManager.getBlockManager().getNumPools(
+            /*includeBlockScalePools=*/false, /*includeIndexerKCachePools=*/false);
        TLLM_CHECK_WITH_INFO(poolNum == 1, "Reuse tree is not supported for multiple pools or variable window size");

        auto windowSize = cacheManager.getBlockManager().getWindowSizesMetadata().begin()->first;
@ -136,13 +137,21 @@ public:
        return blockHashesPerWindow;
    }

-    BlockRangeForWindow getBlockRangeForWindow(SizeType32 windowSize) const
+    BlockRangeForWindow getBlockRangeForWindow(SizeType32 windowSize, bool useIndexerKCache = false) const
    {
        TLLM_CHECK_WITH_INFO(
            mPoolsPerWindow.find(windowSize) != mPoolsPerWindow.end(), "Window size %d not found", windowSize);
        auto pool = mPoolsPerWindow.at(windowSize).front();
        auto blockIds = mBlockIdsPerWindow.at(windowSize);
-        return BlockRangeForWindow(mManager, windowSize, std::move(blockIds), std::move(pool));
+        if (useIndexerKCache)
+        {
+            TLLM_CHECK(mIndexerKCachePool);
+            return BlockRangeForWindow(mManager, windowSize, std::move(blockIds), mIndexerKCachePool);
+        }
+        else
+        {
+            return BlockRangeForWindow(mManager, windowSize, std::move(blockIds), std::move(pool));
+        }
    }

    std::vector<SizeType32> getWindowSizes() const
@ -167,21 +176,25 @@ private:
        , mRequestId(requestId)
        , mBlockIdsPerWindow(std::move(blockIdsPerWindow))
    {
-
-        // cacheManager.getBlockManager.getPrimaryPool(0);
-        auto poolNum = mManager->getNumPools();
+        auto poolNum = mManager->getBlockManager().getNumPools(
+            /*includeBlockScalePools=*/false, /*includeIndexerKCachePools=*/false);
        for (SizeType32 poolIdx = 0; poolIdx < poolNum; ++poolIdx)
        {
            auto windowSize = cacheManager.getBlockManager().getPoolWindowSize(poolIdx);
            mPoolsPerWindow[windowSize].push_back(cacheManager.getBlockManager().getPrimaryPool(poolIdx));
        }
+        if (cacheManager.isEnableIndexerKCache())
+        {
+            mIndexerKCachePool = cacheManager.getIndexerKCachePool();
+        }
    }

    BlockRange(BaseKVCacheManager const& cacheManager, LlmRequest::RequestIdType requestId)
        : mManager(&cacheManager)
        , mRequestId(requestId)
    {
-        auto poolNum = mManager->getNumPools();
+        auto poolNum = mManager->getBlockManager().getNumPools(
+            /*includeBlockScalePools=*/false, /*includeIndexerKCachePools=*/false);
        for (SizeType32 poolIdx = 0; poolIdx < poolNum; ++poolIdx)
        {
            auto windowSize = cacheManager.getBlockManager().getPoolWindowSize(poolIdx);
@ -189,6 +202,10 @@ private:
            mBlockIdsPerWindow[windowSize]
                = cacheManager.getSequence(mRequestId).getCacheBlockIds(windowSize).at(kFIRST_AND_ONLY_BEAM);
        }
+        if (cacheManager.isEnableIndexerKCache())
+        {
+            mIndexerKCachePool = cacheManager.getIndexerKCachePool();
+        }
    }

 private:
@ -196,6 +213,7 @@ private:
    LlmRequest::RequestIdType const mRequestId;
    std::unordered_map<SizeType32, std::vector<SizeType32>> mBlockIdsPerWindow;
    std::unordered_map<SizeType32, std::vector<runtime::ITensor::SharedPtr>> mPoolsPerWindow;
+    runtime::ITensor::SharedPtr mIndexerKCachePool;

    static constexpr SizeType32 kFIRST_AND_ONLY_BEAM = 0;
    static constexpr SizeType32 kFIRST_POOL_INDEX = 0;
--- a/cpp/include/tensorrt_llm/batch_manager/llmRequest.h
+++ b/cpp/include/tensorrt_llm/batch_manager/llmRequest.h
@ -1667,6 +1667,12 @@ public:
            [](auto reason) { return reason == executor::FinishReason::kLENGTH; });
    }

+    [[nodiscard]] bool isFinishedDueToCancellation() const noexcept
+    {
+        return std::all_of(mFinishReasons.begin(), mFinishReasons.end(),
+            [](auto reason) { return reason == executor::FinishReason::kCANCELLED; });
+    }
+
    [[nodiscard]] bool isTimedOut() const
    {
        if (!mAllottedTimeMs.has_value())
--- a/cpp/include/tensorrt_llm/batch_manager/makeDecodingBatchInputOutput.h
+++ b/cpp/include/tensorrt_llm/batch_manager/makeDecodingBatchInputOutput.h
@ -40,19 +40,17 @@ public:
    constexpr static auto name{"MakeDecodingBatchInputOutput"};

    using SizeType32 = tensorrt_llm::runtime::SizeType32;
-    using TensorPtr = runtime::decoder_batch::Input::TensorPtr;
+    using TensorPtr = runtime::ITensor::SharedPtr;
    template <typename T>
    using OptionalRef = tensorrt_llm::common::OptionalRef<T>;

    MakeDecodingBatchInputOutput() = default;

-    std::unique_ptr<runtime::decoder_batch::Input> operator()(DecoderInputBuffers& inputBuffers,
-        runtime::decoder::DecoderState& decoderState, runtime::ModelConfig const& modelConfig,
-        SizeType32 maxNumSequences, OptionalRef<RuntimeBuffers> fusedRuntimeBuffers) const;
+    void operator()(DecoderInputBuffers& inputBuffers, runtime::decoder::DecoderState& decoderState,
+        runtime::ModelConfig const& modelConfig, OptionalRef<RuntimeBuffers> fusedRuntimeBuffers) const;

-    [[nodiscard]] static std::unique_ptr<runtime::decoder_batch::Input> createDecoderBatchInputs(
-        std::vector<SizeType32> const& activeSlots, runtime::decoder::DecoderState const& decoderState,
-        std::vector<TensorPtr> const& logits, SizeType32 maxNumSequences, std::vector<TensorPtr> const& batchSlots);
+    static void createDecoderBatchInputs(DecoderInputBuffers& inputBuffers, std::vector<SizeType32> const& activeSlots,
+        runtime::decoder::DecoderState const& decoderState);
 };

 } // namespace tensorrt_llm::batch_manager
--- a/cpp/include/tensorrt_llm/common/algorithm.h
+++ b/cpp/include/tensorrt_llm/common/algorithm.h
@ -16,8 +16,9 @@

 #pragma once

-namespace tensorrt_llm
-{
+#include "tensorrt_llm/common/config.h"
+
+TRTLLM_NAMESPACE_BEGIN

 // Base class for algorithms
 struct Algorithm
@ -29,4 +30,4 @@ struct Algorithm
    Algorithm& operator=(Algorithm const&) = delete;
 };

-} // namespace tensorrt_llm
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/arrayView.h
+++ b/cpp/include/tensorrt_llm/common/arrayView.h
@ -17,9 +17,13 @@
 #pragma once

 #include "tensorrt_llm/common/assert.h"
+#include "tensorrt_llm/common/config.h"
+
 #include <cstdint>

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 //!
@ -100,4 +104,6 @@ private:
    size_type mSize;
 };

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/assert.h
+++ b/cpp/include/tensorrt_llm/common/assert.h
@ -16,14 +16,19 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/common/tllmException.h"

+TRTLLM_NAMESPACE_BEGIN
+
 class DebugConfig
 {
 public:
    static bool isCheckDebugEnabled();
 };

+TRTLLM_NAMESPACE_END
+
 #if defined(_WIN32)
 #define TLLM_LIKELY(x) (__assume((x) == 1), (x))
 #define TLLM_UNLIKELY(x) (__assume((x) == 0), (x))
@ -35,8 +40,8 @@ public:
 #define TLLM_CHECK(val)                                                                                                \
    do                                                                                                                 \
    {                                                                                                                  \
-        TLLM_LIKELY(static_cast<bool>(val)) ? ((void) 0)                                                               \
-                                            : tensorrt_llm::common::throwRuntimeError(__FILE__, __LINE__, #val);       \
+        TLLM_LIKELY(static_cast<bool>(val))                                                                            \
+        ? ((void) 0) : tensorrt_llm::common::throwRuntimeError(__FILE__, __LINE__, #val);                              \
    } while (0)

 #define TLLM_CHECK_WITH_INFO(val, info, ...)                                                                           \
@ -51,17 +56,17 @@ public:
 #define TLLM_CHECK_DEBUG(val)                                                                                          \
    do                                                                                                                 \
    {                                                                                                                  \
-        if (TLLM_UNLIKELY(DebugConfig::isCheckDebugEnabled()))                                                         \
+        if (TLLM_UNLIKELY(tensorrt_llm::DebugConfig::isCheckDebugEnabled()))                                           \
        {                                                                                                              \
-            TLLM_LIKELY(static_cast<bool>(val)) ? ((void) 0)                                                           \
-                                                : tensorrt_llm::common::throwRuntimeError(__FILE__, __LINE__, #val);   \
+            TLLM_LIKELY(static_cast<bool>(val))                                                                        \
+            ? ((void) 0) : tensorrt_llm::common::throwRuntimeError(__FILE__, __LINE__, #val);                          \
        }                                                                                                              \
    } while (0)

 #define TLLM_CHECK_DEBUG_WITH_INFO(val, info, ...)                                                                     \
    do                                                                                                                 \
    {                                                                                                                  \
-        if (TLLM_UNLIKELY(DebugConfig::isCheckDebugEnabled()))                                                         \
+        if (TLLM_UNLIKELY(tensorrt_llm::DebugConfig::isCheckDebugEnabled()))                                           \
        {                                                                                                              \
            TLLM_LIKELY(static_cast<bool>(val))                                                                        \
            ? ((void) 0)                                                                                               \
--- a/cpp/include/tensorrt_llm/common/bindingUtils.h
+++ b/cpp/include/tensorrt_llm/common/bindingUtils.h
@ -17,9 +17,13 @@
 #pragma once

 #include "c10/util/intrusive_ptr.h"
+#include "tensorrt_llm/common/config.h"
+
 #include <Python.h>

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 // Adapted from pybind11's example implementation:
@ -69,4 +73,6 @@ c10::intrusive_ptr<T> get_intrusive_ptr(PyObject* py_obj, std::string pybind11_a
    return c10::intrusive_ptr<T>::reclaim_copy(p);
 }

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/config.h
+++ b/cpp/include/tensorrt_llm/common/config.h
@ -0,0 +1,62 @@
+/*
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+#ifndef TRTLLM_CONFIG_H
+#define TRTLLM_CONFIG_H
+
+/**
+ * \def TRTLLM_ABI_NAMESPACE
+ * This macro is used to open an implicitly inline namespace block for the ABI version.
+ * This macro can be overridden to change the ABI version.
+ * The default ABI version is _v1.
+ */
+#ifndef TRTLLM_ABI_NAMESPACE
+#define TRTLLM_ABI_NAMESPACE _v1
+#endif
+
+#ifndef TRTLLM_ABI_NAMESPACE_BEGIN
+#define TRTLLM_ABI_NAMESPACE_BEGIN                                                                                     \
+    inline namespace TRTLLM_ABI_NAMESPACE                                                                              \
+    {
+#endif
+
+#ifndef TRTLLM_ABI_NAMESPACE_END
+#define TRTLLM_ABI_NAMESPACE_END }
+#endif
+
+/**
+ * \def TRTLLM_NAMESPACE_BEGIN
+ * This macro is used to open a `tensorrt_llm::` namespace block, along with any
+ * enclosing namespaces requested by TRTLLM_WRAPPED_NAMESPACE, etc.
+ * This macro is defined by TensorRT-LLM and may not be overridden.
+ */
+#define TRTLLM_NAMESPACE_BEGIN                                                                                         \
+    namespace tensorrt_llm                                                                                             \
+    {                                                                                                                  \
+    TRTLLM_ABI_NAMESPACE_BEGIN
+
+/**
+ * \def TRTLLM_NAMESPACE_END
+ * This macro is used to close a `tensorrt_llm::` namespace block, along with any
+ * enclosing namespaces requested by TRTLLM_WRAPPED_NAMESPACE, etc.
+ * This macro is defined by TensorRT-LLM and may not be overridden.
+ */
+#define TRTLLM_NAMESPACE_END                                                                                           \
+    TRTLLM_ABI_NAMESPACE_END                                                                                           \
+    }  /* end namespace tensorrt_llm */
+
+#endif // TRTLLM_CONFIG_H
--- a/cpp/include/tensorrt_llm/common/cudaFp8Utils.h
+++ b/cpp/include/tensorrt_llm/common/cudaFp8Utils.h
@ -16,6 +16,8 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
+
 #ifdef ENABLE_FP8
 #include <cuda_fp8.h>
 #include <cuda_runtime.h>
@ -29,8 +31,8 @@
 #define USE_QGMMA
 #endif

-namespace tensorrt_llm
-{
+TRTLLM_NAMESPACE_BEGIN
+
 namespace common
 {

@ -320,5 +322,6 @@ void invokeComputeScalesAndQuantizeMatrix(T_OUT* output, T_S* quant_ptr, const T
    const int64_t lda, QuantizeMode quantize_mode, cudaStream_t stream);

 } // namespace common
-} // namespace tensorrt_llm
+
+TRTLLM_NAMESPACE_END
 #endif // ENABLE_FP8
--- a/cpp/include/tensorrt_llm/common/cudaProfilerUtils.h
+++ b/cpp/include/tensorrt_llm/common/cudaProfilerUtils.h
@ -14,12 +14,18 @@
 * limitations under the License.
 */

+#pragma once
+
+#include "tensorrt_llm/common/config.h"
+
 #include <cstdint>
 #include <optional>
 #include <string>
 #include <unordered_set>

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 /// @brief Populate the start and end profiling iteration indexes from the provided environment variables
@ -28,4 +34,6 @@ namespace tensorrt_llm::common
 std::pair<std::unordered_set<int32_t>, std::unordered_set<int32_t>> populateIterationIndexes(
    std::string const& envVarName, std::optional<std::string> const& legacyEnvVarName = std::nullopt);

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/cudaUtils.h
+++ b/cpp/include/tensorrt_llm/common/cudaUtils.h
@ -16,9 +16,13 @@
 */
 #pragma once

+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/common/cudaBf16Wrapper.h"
 #include "tensorrt_llm/common/cudaDriverWrapper.h"
 #include "tensorrt_llm/common/cudaFp8Utils.h"
+#if ENABLE_FP4
+#include <cuda_fp4.h>
+#endif
 #include "tensorrt_llm/common/logger.h"
 #include "tensorrt_llm/common/tllmException.h"
 #include <algorithm>
@ -35,6 +39,7 @@
 #include <optional>
 #include <sstream>
 #include <string>
+#include <unordered_map>
 #ifndef _WIN32 // Linux
 #include <sys/sysinfo.h>
 #endif         // not WIN32
@ -45,7 +50,9 @@
               // this undef.
 #endif         // WIN32

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 // workspace for cublas gemm : 32MB
@ -429,6 +436,21 @@ inline int getMaxSharedMemoryPerBlockOptin()
    return nByteMaxSharedMemoryPerBlockOptin;
 }

+template <typename T>
+inline int getMaxActiveBlocksPerSM(T kernel, int blockSize, size_t dynamicSMemSize)
+{
+    static std::unordered_map<T, int> cache;
+    auto it = cache.find(kernel);
+    if (it != cache.end())
+    {
+        return it->second;
+    }
+    int numBlocks;
+    check_cuda_error(cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocks, kernel, blockSize, dynamicSMemSize));
+    cache[kernel] = numBlocks;
+    return numBlocks;
+}
+
 template <typename T1, typename T2>
 inline size_t divUp(T1 const& a, T2 const& b)
 {
@ -545,6 +567,9 @@ template void printArrayInfo(__nv_bfloat16 const* ptr, uint64_t nElement, std::s
 #ifdef ENABLE_FP8
 template void printArrayInfo(__nv_fp8_e4m3 const* ptr, uint64_t nElement, std::string name, bool const bPrintElement);
 #endif
+#ifdef ENABLE_FP4
+template void printArrayInfo(__nv_fp4_e2m1 const* ptr, uint64_t nElement, std::string name, bool const bPrintElement);
+#endif
 template void printArrayInfo(uint32_t const* ptr, uint64_t nElement, std::string name, bool const bPrintElement);
 template void printArrayInfo(uint64_t const* ptr, uint64_t nElement, std::string name, bool const bPrintElement);
 template void printArrayInfo(int const* ptr, uint64_t nElement, std::string name, bool const bPrintElement);
@ -1395,7 +1420,9 @@ DEFINE_MEMBER_CHECKER(deq)
 DEFINE_MEMBER_CHECKER(qua)
 DEFINE_MEMBER_CHECKER(high_preciecion_normed_output)

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END

 /*
 * Macros compliant with TensorRT coding conventions
--- a/cpp/include/tensorrt_llm/common/dataType.h
+++ b/cpp/include/tensorrt_llm/common/dataType.h
@ -16,11 +16,15 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/common/tllmException.h"
+
 #include <NvInferRuntime.h>
 #include <map>

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 constexpr static size_t getDTypeSize(nvinfer1::DataType type)
@ -84,4 +88,6 @@ constexpr static size_t getDTypeSizeInBits(nvinfer1::DataType type)
    return "";
 }

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/logger.h
+++ b/cpp/include/tensorrt_llm/common/logger.h
@ -22,9 +22,12 @@
 #include <string>

 #include "tensorrt_llm/common/assert.h"
+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/common/stringUtils.h"

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 class Logger
@ -125,12 +128,12 @@ private:

    static inline std::string getPrefix(Level const level)
    {
-        return fmtstr("%s[%s] ", kPREFIX, getLevelName(level));
+        return tensorrt_llm::common::fmtstr("%s[%s] ", kPREFIX, getLevelName(level));
    }

    static inline std::string getPrefix(Level const level, int const rank)
    {
-        return fmtstr("%s[%s][%d] ", kPREFIX, getLevelName(level), rank);
+        return tensorrt_llm::common::fmtstr("%s[%s][%d] ", kPREFIX, getLevelName(level), rank);
    }
 };

@ -171,6 +174,9 @@ void Logger::log(Logger::Level const level, int const rank, char const* format,
        out << std::endl;
    }
 }
+} // namespace common
+
+TRTLLM_NAMESPACE_END

 #define TLLM_LOG(level, ...)                                                                                           \
    do                                                                                                                 \
@ -188,4 +194,3 @@ void Logger::log(Logger::Level const level, int const rank, char const* format,
 #define TLLM_LOG_WARNING(...) TLLM_LOG(tensorrt_llm::common::Logger::WARNING, __VA_ARGS__)
 #define TLLM_LOG_ERROR(...) TLLM_LOG(tensorrt_llm::common::Logger::ERROR, __VA_ARGS__)
 #define TLLM_LOG_EXCEPTION(ex, ...) tensorrt_llm::common::Logger::getLogger()->log(ex, ##__VA_ARGS__)
-} // namespace tensorrt_llm::common
--- a/cpp/include/tensorrt_llm/common/optionalRef.h
+++ b/cpp/include/tensorrt_llm/common/optionalRef.h
@ -16,11 +16,15 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
+
 #include <functional>
 #include <memory>
 #include <optional>

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 /**
@ -100,4 +104,6 @@ public:
    }
 };

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/quantization.h
+++ b/cpp/include/tensorrt_llm/common/quantization.h
@ -16,12 +16,14 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
+
 #include <cstdint>
 #include <optional>
 #include <string>

-namespace tensorrt_llm
-{
+TRTLLM_NAMESPACE_BEGIN
+
 namespace common
 {

@ -480,4 +482,5 @@ public:
 };

 } // namespace common
-} // namespace tensorrt_llm
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/stringUtils.h
+++ b/cpp/include/tensorrt_llm/common/stringUtils.h
@ -16,6 +16,7 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
 #if ENABLE_BF16
 #include <cuda_bf16.h>
 #endif // ENABLE_BF16
@ -28,7 +29,9 @@
 #include <unordered_set>
 #include <vector>

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {
 #if ENABLE_BF16
 static inline std::basic_ostream<char>& operator<<(std::basic_ostream<char>& stream, __nv_bfloat16 const& val)
@ -228,4 +231,6 @@ inline void toUpper(std::string& s)
    }
 }

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/tllmException.h
+++ b/cpp/include/tensorrt_llm/common/tllmException.h
@ -16,6 +16,7 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/common/stringUtils.h"

 #include <array>
@ -41,7 +42,9 @@
    tensorrt_llm::common::RequestSpecificException(                                                                    \
        __FILE__, __LINE__, tensorrt_llm::common::fmtstr(__VA_ARGS__).c_str(), requestID, errorCode)

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 /// @brief Enumeration of different error codes for request-specific exceptions
@ -77,7 +80,8 @@ private:

 [[noreturn]] inline void throwRuntimeError(char const* const file, int const line, char const* info)
 {
-    throw TllmException(file, line, fmtstr("[TensorRT-LLM][ERROR] Assertion failed: %s", info).c_str());
+    throw TllmException(
+        file, line, tensorrt_llm::common::fmtstr("[TensorRT-LLM][ERROR] Assertion failed: %s", info).c_str());
 }

 [[noreturn]] inline void throwRuntimeError(char const* const file, int const line, std::string const& info = "")
@ -102,4 +106,6 @@ private:
    RequestErrorCode mErrorCode;
 };

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/common/utils.h
+++ b/cpp/include/tensorrt_llm/common/utils.h
@ -16,6 +16,8 @@

 #pragma once

+#include "tensorrt_llm/common/config.h"
+
 #include <algorithm>
 #include <initializer_list>
 #include <string>
@ -24,7 +26,9 @@
 #include <pthread.h>
 #endif

-namespace tensorrt_llm::common
+TRTLLM_NAMESPACE_BEGIN
+
+namespace common
 {

 inline bool setThreadName(std::string const& name)
@ -43,4 +47,6 @@ bool contains(std::initializer_list<T> const& c, T const& v)
    return std::find(c.begin(), c.end(), v) != c.end();
 }

-} // namespace tensorrt_llm::common
+} // namespace common
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/executor/cacheCommunicator.h
+++ b/cpp/include/tensorrt_llm/executor/cacheCommunicator.h
@ -17,6 +17,7 @@
 #pragma once

 #include "tensorrt_llm/executor/serialization.h"
+#include <atomic>
 #include <vector>

 namespace tensorrt_llm::executor::kv_cache
@ -27,8 +28,9 @@ class CommState;
 struct DataContext
 {
 public:
-    explicit DataContext(int tag)
+    explicit DataContext(int tag, std::atomic<bool> const& transferTerminate = sDefaultTransferTerminate)
        : mTag{tag}
+        , mTransferTerminate(transferTerminate)
    {
    }

@ -37,8 +39,15 @@ public:
        return mTag;
    }

+    [[nodiscard]] std::atomic<bool> const& getTransferTerminate() const noexcept
+    {
+        return mTransferTerminate;
+    }
+
 private:
+    inline static std::atomic<bool> sDefaultTransferTerminate{false};
    int const mTag;
+    std::atomic<bool> const& mTransferTerminate;
 };

 class Connection
@ -66,6 +75,7 @@ public:
    [[nodiscard]] virtual std::vector<Connection const*> getConnections(CommState const& state) = 0;

    [[nodiscard]] virtual CommState const& getCommState() const = 0;
+    [[nodiscard]] virtual bool isRunning() const = 0;
 };

 } // namespace tensorrt_llm::executor::kv_cache
--- a/cpp/include/tensorrt_llm/executor/dataTransceiverState.h
+++ b/cpp/include/tensorrt_llm/executor/dataTransceiverState.h
@ -50,7 +50,8 @@ public:

    CacheState(ModelConfig modelConfig, runtime::WorldConfig const& worldConfig,
        std::vector<SizeType32> const& attentionLayerNumPerPP, nvinfer1::DataType dataType,
-        AttentionType attentionType = AttentionType::kDEFAULT, int kvFactor = 2, bool enableBlockReuse = false)
+        AttentionType attentionType = AttentionType::kDEFAULT, int kvFactor = 2, bool enableBlockReuse = false,
+        bool hasIndexerKCache = false, SizeType32 indexerDimPerHead = 0, SizeType32 indexerKCacheQuantBlockSize = 128)
        : mModelConfig(std::move(modelConfig))
        , mParallelConfig{worldConfig.getTensorParallelism(), worldConfig.getPipelineParallelism(),
              worldConfig.getContextParallelism(), worldConfig.enableAttentionDP(), worldConfig.getTensorParallelRank(),
@ -59,13 +60,17 @@ public:
        , mAttentionConfig(attentionType, kvFactor)
    {
        mEnableBlockReuse = enableBlockReuse;
+        mHasIndexerKCache = hasIndexerKCache;
+        mIndexerDimPerHead = indexerDimPerHead;
+        mIndexerKCacheQuantBlockSize = indexerKCacheQuantBlockSize;
    }

    CacheState(std::vector<SizeType32> nbKvHeadPerLayer, SizeType32 sizePerHead, SizeType32 tokensPerBlock,
        SizeType32 tensorParallelism, SizeType32 pipelineParallelism, SizeType32 contextParallelism,
        std::vector<SizeType32> const& attentionLayerNumPerPP, nvinfer1::DataType dataType,
        AttentionType attentionType = AttentionType::kDEFAULT, int kvFactor = 2, bool enableAttentionDP = false,
-        int DPrank = 0, int DPsize = 0, bool enableBlockReuse = false)
+        int DPrank = 0, int DPsize = 0, bool enableBlockReuse = false, bool hasIndexerKCache = false,
+        SizeType32 indexerDimPerHead = 0, SizeType32 indexerKCacheQuantBlockSize = 128)
        : mModelConfig{std::move(nbKvHeadPerLayer), sizePerHead, tokensPerBlock}
        , mParallelConfig{tensorParallelism, pipelineParallelism, contextParallelism, enableAttentionDP, DPrank, DPsize,
              attentionLayerNumPerPP}
@ -73,13 +78,17 @@ public:
        , mAttentionConfig(attentionType, kvFactor)
    {
        mEnableBlockReuse = enableBlockReuse;
+        mHasIndexerKCache = hasIndexerKCache;
+        mIndexerDimPerHead = indexerDimPerHead;
+        mIndexerKCacheQuantBlockSize = indexerKCacheQuantBlockSize;
    }

    CacheState(SizeType32 nbAttentionLayers, SizeType32 nbKvHeads, SizeType32 sizePerHead, SizeType32 tokensPerBlock,
        SizeType32 tensorParallelism, SizeType32 pipelineParallelism, SizeType32 contextParallelism,
        std::vector<SizeType32> const& attentionLayerNumPerPP, nvinfer1::DataType dataType,
        AttentionType attentionType = AttentionType::kDEFAULT, int kvFactor = 2, bool enableAttentionDP = false,
-        int DPrank = 0, int DPsize = 0, bool enableBlockReuse = false)
+        int DPrank = 0, int DPsize = 0, bool enableBlockReuse = false, bool hasIndexerKCache = false,
+        SizeType32 indexerDimPerHead = 0, SizeType32 indexerKCacheQuantBlockSize = 128)
        : mModelConfig{std::vector(nbAttentionLayers, nbKvHeads), sizePerHead, tokensPerBlock}
        , mParallelConfig{tensorParallelism, pipelineParallelism, contextParallelism, enableAttentionDP, DPrank, DPsize,
              attentionLayerNumPerPP}
@ -87,6 +96,9 @@ public:
        , mAttentionConfig(attentionType, kvFactor)
    {
        mEnableBlockReuse = enableBlockReuse;
+        mHasIndexerKCache = hasIndexerKCache;
+        mIndexerDimPerHead = indexerDimPerHead;
+        mIndexerKCacheQuantBlockSize = indexerKCacheQuantBlockSize;
    }

    [[nodiscard]] bool operator==(kv_cache::CacheState const& other) const noexcept
@ -174,6 +186,21 @@ public:
        return mEnableBlockReuse;
    }

+    [[nodiscard]] bool getHasIndexerKCache() const
+    {
+        return mHasIndexerKCache;
+    }
+
+    [[nodiscard]] SizeType32 getIndexerDimPerHead() const
+    {
+        return mIndexerDimPerHead;
+    }
+
+    [[nodiscard]] SizeType32 getIndexerKCacheQuantBlockSize() const
+    {
+        return mIndexerKCacheQuantBlockSize;
+    }
+
    [[nodiscard]] std::string toString() const
    {
        std::stringstream sstring;
@ -194,6 +221,9 @@ public:
        sstring << "dpRank:" << mParallelConfig.mDPrank << "\n";
        sstring << "dpSize:" << mParallelConfig.mDPsize << "\n";
        sstring << "enableBlockReuse:" << mEnableBlockReuse << "\n";
+        sstring << "hasIndexerKCache:" << mHasIndexerKCache << "\n";
+        sstring << "indexerDimPerHead:" << mIndexerDimPerHead << "\n";
+        sstring << "indexerKCacheQuantBlockSize:" << mIndexerKCacheQuantBlockSize << "\n";
        return sstring.str();
    }

@ -204,6 +234,9 @@ private:
    nvinfer1::DataType mDataType;
    AttentionConfig mAttentionConfig;
    bool mEnableBlockReuse{false};
+    bool mHasIndexerKCache{false};
+    SizeType32 mIndexerDimPerHead{0};
+    SizeType32 mIndexerKCacheQuantBlockSize{128};
 };

 struct MpiState
--- a/cpp/include/tensorrt_llm/executor/executor.h
+++ b/cpp/include/tensorrt_llm/executor/executor.h
@ -47,6 +47,12 @@ class BaseKVCacheManager;
 namespace tensorrt_llm::executor
 {

+using SizeType32 = tensorrt_llm::runtime::SizeType32;
+// Mmkey is used in KVCacheBlock when multimodal data presents in a block.
+// Type alias for hash array + start offset at per-block granularity.
+// This differs from the per-request level multimodal hash in MultimodalInput.
+using MmKey = std::pair<std::array<uint8_t, 32>, SizeType32>;
+
 /// @brief Version of TRT-LLM
 char const* version() noexcept;

@ -1462,19 +1468,23 @@ public:
        DEFAULT = 0,
        MPI = 1,
        UCX = 2,
-        NIXL = 3
+        NIXL = 3,
+        MOONCAKE = 4
    };
    explicit CacheTransceiverConfig(std::optional<BackendType> backendType = std::nullopt,
-        std::optional<size_t> maxNumTokens = std::nullopt, std::optional<int> kvTransferTimeoutMs = std::nullopt);
+        std::optional<size_t> maxNumTokens = std::nullopt, std::optional<int> kvTransferTimeoutMs = std::nullopt,
+        std::optional<int> kvTransferSenderFutureTimeoutMs = std::nullopt);

    bool operator==(CacheTransceiverConfig const& other) const;
    void setBackendType(std::optional<BackendType> backendType);
    void setMaxTokensInBuffer(std::optional<size_t> maxTokensInBuffer);
    void setKvTransferTimeoutMs(std::optional<int> kvTransferTimeoutMs);
+    void setKvTransferSenderFutureTimeoutMs(std::optional<int> kvTransferSenderFutureTimeoutMs);

-    [[nodiscard]] std::optional<int> getKvTransferTimeoutMs() const;
    [[nodiscard]] std::optional<size_t> getMaxTokensInBuffer() const;
    [[nodiscard]] std::optional<BackendType> getBackendType() const;
+    [[nodiscard]] std::optional<int> getKvTransferTimeoutMs() const;
+    [[nodiscard]] std::optional<int> getKvTransferSenderFutureTimeoutMs() const;

 private:
    std::optional<BackendType> mBackendType;
@ -1483,6 +1493,9 @@ private:
    /// transfer may be degraded.
    std::optional<size_t> mMaxTokensInBuffer;
    std::optional<int> mKvTransferTimeoutMs;
+    // @brief Timeout in milliseconds to wait for the sender future to be ready when scheduled batch size is 0. This
+    // allows the request to be eventually cancelled by the user or because of kv_transfer_timeout_ms
+    std::optional<int> mKvTransferSenderFutureTimeoutMs;
 };

 /// @brief Configuration class for the model executor
@ -1685,12 +1698,14 @@ struct KVCacheStoredBlockData
 {

    KVCacheStoredBlockData(IdType blockHash, tensorrt_llm::runtime::VecUniqueTokens tokens,
-        std::optional<tensorrt_llm::runtime::LoraTaskIdType> loraId, SizeType32 cacheLevel, SizeType32 priority)
+        std::optional<tensorrt_llm::runtime::LoraTaskIdType> loraId, SizeType32 cacheLevel, SizeType32 priority,
+        std::vector<MmKey> mmKeys = {})
        : blockHash{blockHash}
        , tokens{std::move(tokens)}
        , loraId{loraId}
        , cacheLevel{cacheLevel}
        , priority{priority}
+        , mmKeys{std::move(mmKeys)}
    {
    }

@ -1704,6 +1719,8 @@ struct KVCacheStoredBlockData
    SizeType32 cacheLevel;
    /// @brief The priority of the block
    SizeType32 priority;
+    /// @brief The multimodal keys of the block
+    std::vector<MmKey> mmKeys;
 };

 struct KVCacheStoredData
--- a/cpp/include/tensorrt_llm/executor/transferAgent.h
+++ b/cpp/include/tensorrt_llm/executor/transferAgent.h
@ -274,13 +274,20 @@ private:
    std::optional<SyncMessage> mSyncMessage;
 };

+enum class TransferState : uint8_t
+{
+    kIN_PROGRESS,
+    kSUCCESS,
+    kFAILURE,
+};
+
 // Data structure for checking the status of active transfer operations.
 class TransferStatus
 {
 public:
    virtual ~TransferStatus() = default;
    [[nodiscard]] virtual bool isCompleted() const = 0;
-    virtual void wait() const = 0;
+    virtual TransferState wait(int64_t timeout_ms = -1) const = 0;
 };

 struct BaseAgentConfig
@ -288,6 +295,8 @@ struct BaseAgentConfig
    std::string mName;
    bool useProgThread;
    bool multiThread;
+    bool useListenThread;
+    unsigned int numWorkers;
 };

 class BaseTransferAgent
@ -391,6 +400,14 @@ template <typename... Args>
            "libtensorrt_llm_nixl_wrapper.so", "createNixlTransferAgent");
        return func(std::forward<Args>(args)...);
    }
+    if (backend == "mooncake")
+    {
+        auto& loader = DynLibLoader::getInstance();
+        using CreateMooncakeFuncType = std::unique_ptr<BaseTransferAgent> (*)(BaseAgentConfig const*);
+        auto* func = loader.getFunctionPointer<CreateMooncakeFuncType>(
+            "libtensorrt_llm_mooncake_wrapper.so", "createMooncakeTransferAgent");
+        return func(std::forward<Args>(args)...);
+    }
    TLLM_THROW("Unknown backend name.");
 }

--- a/cpp/include/tensorrt_llm/kernels/archCondition.h
+++ b/cpp/include/tensorrt_llm/kernels/archCondition.h
@ -16,7 +16,11 @@

 #pragma once

-namespace tensorrt_llm::kernels
+#include "tensorrt_llm/common/config.h"
+
+TRTLLM_NAMESPACE_BEGIN
+
+namespace kernels
 {

 namespace detail
@ -110,4 +114,6 @@ inline constexpr bool is_compatible_v = is_compatible<Arch>::value;

 } // namespace arch

-} // namespace tensorrt_llm::kernels
+} // namespace kernels
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/kernels/decodingCommon.h
+++ b/cpp/include/tensorrt_llm/kernels/decodingCommon.h
@ -17,11 +17,14 @@
 #pragma once

 #include "tensorrt_llm/common/assert.h"
+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/executor/types.h"
 #include <cstdint>
 #include <curand_kernel.h>

-namespace tensorrt_llm::kernels
+TRTLLM_NAMESPACE_BEGIN
+
+namespace kernels
 {

 class FinishedState
@ -308,4 +311,6 @@ template <typename T>
 void invokeScatterDecodingParams(
    T const* src, T scalar, T* dst, int const* batchSlots, int batchSize, cudaStream_t stream);

-} // namespace tensorrt_llm::kernels
+} // namespace kernels
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/kernels/kvCacheIndex.h
+++ b/cpp/include/tensorrt_llm/kernels/kvCacheIndex.h
@ -17,11 +17,14 @@
 #pragma once

 #include "tensorrt_llm/common/assert.h"
+#include "tensorrt_llm/common/config.h"

 #include <cstdint>
 #include <cuda_runtime.h>

-namespace tensorrt_llm::kernels
+TRTLLM_NAMESPACE_BEGIN
+
+namespace kernels
 {

 class KVCacheIndex
@ -53,4 +56,6 @@ private:
    UnderlyingType value;
 };

-} // namespace tensorrt_llm::kernels
+} // namespace kernels
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/kernels/kvCachePartialCopy.h
+++ b/cpp/include/tensorrt_llm/kernels/kvCachePartialCopy.h
@ -14,16 +14,18 @@
 * limitations under the License.
 */

+#include "tensorrt_llm/common/config.h"
 #include "tensorrt_llm/runtime/iBuffer.h"

 using namespace tensorrt_llm::runtime;

-namespace tensorrt_llm
-{
+TRTLLM_NAMESPACE_BEGIN
+
 namespace kernels
 {
 void kvCacheBlockPartialCopy(IBuffer& dst, IBuffer const& src, unsigned int numLayers, unsigned int numHeads,
    unsigned int tokensPerBlock, unsigned int numHidden, unsigned int numTokensToCopy, int kvFactor,
    cudaStream_t stream);
 } // namespace kernels
-} // namespace tensorrt_llm
+
+TRTLLM_NAMESPACE_END
--- a/cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
+++ b/cpp/include/tensorrt_llm/runtime/gptDecoderBatched.h
@ -52,8 +52,9 @@ public:

    void disableLookahead(RequestVector const& genRequests, TensorPtr const& batchSlots) override;

-    CudaEvent forwardAsync(decoder::DecoderState const& decoderState, decoder_batch::Input const& input) override;
-    void forward(decoder::DecoderState const& decoderState, decoder_batch::Input const& input) override;
+    CudaEvent forwardAsync(
+        decoder::DecoderState const& decoderState, batch_manager::DecoderInputBuffers const& input) override;
+    void forward(decoder::DecoderState const& decoderState, batch_manager::DecoderInputBuffers const& input) override;

    //! @brief Gather final beam search results for request `batchSlot`.
    //! Result will only be available after event returned.
@ -77,7 +78,7 @@ public:

 private:
    //! @brief Calls decoders for tokens per engine step
-    void forwardDispatch(decoder::DecoderState const& decoderState, decoder_batch::Input const& input);
+    void forwardDispatch(decoder::DecoderState const& decoderState, batch_manager::DecoderInputBuffers const& input);

 private:
    CudaStreamPtr mRuntimeStream;
--- a/cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
+++ b/cpp/include/tensorrt_llm/runtime/iGptDecoderBatched.h
@ -27,8 +27,9 @@

 namespace tensorrt_llm::batch_manager
 {
+class DecoderInputBuffers;
 class LlmRequest;
-}
+} // namespace tensorrt_llm::batch_manager

 namespace tensorrt_llm::runtime
 {
@ -39,43 +40,6 @@ namespace decoder
 class DecoderState;
 }

-namespace decoder_batch
-{
-
-class Input
-{
-public:
-    using TensorConstPtr = ITensor::SharedConstPtr;
-    using TensorPtr = ITensor::SharedPtr;
-
-    explicit Input(std::vector<std::vector<TensorConstPtr>> const& logits, SizeType32 maxDecoderSteps)
-        : logits{logits}
-        , maxDecoderSteps{maxDecoderSteps}
-    {
-        TLLM_CHECK_WITH_INFO(
-            logits.size() == static_cast<size_t>(maxDecoderSteps), "logits vector size does not match maxDecoderSteps");
-    }
-
-    explicit Input(std::vector<TensorConstPtr> const& logits)
-        : Input{{logits}, 1}
-    {
-    }
-
-    //! Mandatory parameters
-    //! Logits
-    // FIXME: remove first dimension of tensors
-    //! [maxDecoderSteps][batchSize][1, beamWidth, vocabSizePadded], on gpu
-    std::vector<std::vector<TensorConstPtr>> logits;
-
-    //! Maximum number of decoding tokens of active slots
-    SizeType32 maxDecoderSteps;
-
-    //! Batch of active decoder slots, sorted by slots, [maxDecoderSteps][batchSize]
-    std::vector<TensorPtr> batchSlots;
-};
-
-} // namespace decoder_batch
-
 //! GPT decoder class with support for in-flight batching
 class IGptDecoderBatched
 {
@ -94,10 +58,13 @@ public:
    virtual void disableLookahead(RequestVector const& genRequests, TensorPtr const& batchSlots) = 0;

    //! @brief Run one step for all requests without blocking the host process and return the token for synchronization.
-    virtual CudaEvent forwardAsync(decoder::DecoderState const& decoderState, decoder_batch::Input const& input) = 0;
+    virtual CudaEvent forwardAsync(
+        decoder::DecoderState const& decoderState, batch_manager::DecoderInputBuffers const& input)
+        = 0;

    //! @brief Run one step for all requests and wait for completion on the host.
-    virtual void forward(decoder::DecoderState const& decoderState, decoder_batch::Input const& input) = 0;
+    virtual void forward(decoder::DecoderState const& decoderState, batch_manager::DecoderInputBuffers const& input)
+        = 0;

    //! @brief Gather final beam search results for request `batchIdx`.
    //! Result will only be available after event returned
--- a/cpp/include/tensorrt_llm/runtime/iTensor.h
+++ b/cpp/include/tensorrt_llm/runtime/iTensor.h
@ -65,7 +65,6 @@ public:

    //!
    //! \brief Returns the tensor n-th dimension. If n is negative, returns the (nbDims - n)th dimension.
-    //! TODO: replace with constexpr parameter when moving to C++20.
    //!
    template <SizeType32 n>
    [[nodiscard]] DimType64 getDimension() const
--- a/cpp/include/tensorrt_llm/runtime/virtualMemory.h
+++ b/cpp/include/tensorrt_llm/runtime/virtualMemory.h
@ -22,9 +22,11 @@
 #include "tensorrt_llm/runtime/iBuffer.h"
 #include "tensorrt_llm/runtime/memoryCounters.h"

+#include <atomic>
 #include <cuda.h>
 #include <map>
 #include <mutex>
+#include <numeric>
 #include <unistd.h>
 #include <utility>

@ -466,7 +468,7 @@ public:
        CudaVirtualMemoryManager& mManager;
        std::string mTag;
        CudaStreamPtr mBackStream;
-        std::size_t mPageSize;
+        std::atomic<std::size_t> mAlignment;
        RestoreMode mMode;
        bool mBackground{};

@ -487,14 +489,45 @@ public:
            : mManager(manager)
            , mTag(std::move(tag))
            , mBackStream(std::move(backStream))
-            , mPageSize(getpagesize())
+            , mAlignment(0)
            , mMode(mode)
        {
        }

-        [[nodiscard]] std::size_t pageAligned(std::size_t n) const noexcept
+        [[nodiscard]] std::size_t aligned(std::size_t n, int device = 0)
        {
-            return (n + mPageSize - 1) & ~(mPageSize - 1);
+            // Lazy loading the alignment, since CUDA driver may yet to be initialized when Configuration is
+            // constructed.
+            // We have one process for each GPU so caching the value is fine.
+            constexpr std::size_t loading = std::numeric_limits<std::size_t>::max();
+            std::size_t alignment = 0;
+            if (mAlignment.compare_exchange_strong(alignment, loading, std::memory_order_relaxed))
+            {
+                std::size_t gpuAlignment = 1;
+                CUmemAllocationProp const prop{CU_MEM_ALLOCATION_TYPE_PINNED, CU_MEM_HANDLE_TYPE_NONE,
+                    {
+                        CU_MEM_LOCATION_TYPE_DEVICE,
+                        device,
+                    }};
+                TLLM_CU_CHECK(
+                    cuMemGetAllocationGranularity(&gpuAlignment, &prop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
+                alignment = std::lcm(getpagesize(), gpuAlignment);
+                mAlignment.store(alignment, std::memory_order_relaxed);
+            }
+            else
+            {
+                // spin wait
+                while (alignment == loading)
+                {
+#if defined(__x86_64__)
+                    asm volatile("pause");
+#elif defined(__aarch64__)
+                    asm volatile("yield");
+#endif
+                    alignment = mAlignment.load(std::memory_order_relaxed);
+                }
+            }
+            return (n + alignment - 1) / alignment * alignment;
        }

        // Background configuration, used to indicate no virtual memory allocator is explicitly configured by the user.
--- a/cpp/include/tensorrt_llm/runtime/worldConfig.h
+++ b/cpp/include/tensorrt_llm/runtime/worldConfig.h
@ -104,12 +104,14 @@ public:

    [[nodiscard]] SizeType32 constexpr getTensorParallelRank() const noexcept
    {
-        return mRank % mTensorParallelism;
+        // Layout: pp is outermost, then tp, then cp is innermost (consecutive).
+        return (mRank % (mTensorParallelism * mContextParallelism)) / mContextParallelism;
    }

    [[nodiscard]] SizeType32 constexpr getContextParallelRank() const noexcept
    {
-        return (mRank % (mTensorParallelism * mContextParallelism)) / mTensorParallelism;
+        // Layout: pp is outermost, then tp, then cp is innermost (consecutive).
+        return mRank % mContextParallelism;
    }

    [[nodiscard]] SizeType32 constexpr getLocalRank() const noexcept
--- a/cpp/kernels/fmha_v2/Makefile
+++ b/cpp/kernels/fmha_v2/Makefile
@ -1,18 +1,18 @@
 # ##################################################################################################
-#  Copyright (c) 2011-2023, NVIDIA CORPORATION.  All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 #
-#  Redistribution and use in source and binary forms, with or without modification, are not permit-
-#  ted.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
 #
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
-#  IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
-#  FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
-#  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
-#  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFIT;
-#  OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
-#  STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+# http://www.apache.org/licenses/LICENSE-2.0
 #
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 # ##################################################################################################

 # #################################################################################################
@ -69,6 +69,11 @@ PREPROCESSOR_FLAGS += -DUSE_SAME_SUM_ORDER_IN_SOFTMAX_AS_REF_CODE
 # Do we want to use half accumulation for flash attention
 PREPROCESSOR_FLAGS += -DHALF_ACCUMULATION_FOR_FLASH_ATTENTION

+# Print the resulted sparsity given threshold in Skip-Softmax attention
+# Note: You only need to "python scripts/build_wheel.py -D SKIP_SOFTMAX_STAT=ON ..." to use it inside TRTLLM.
+# Turn this on manually only if you want to build&run the unittest (bin/fmha.exe) with SKIP_SOFTMAX_STAT.
+# PREPROCESSOR_FLAGS += -DSKIP_SOFTMAX_STAT
+
 # Add FLAGS when generating cubins.
 ifdef GENERATE_CUBIN
 	PREPROCESSOR_FLAGS += -DGENERATE_CUBIN
--- a/cpp/kernels/fmha_v2/NVIDIA
+++ b/cpp/kernels/fmha_v2/NVIDIA
--- a/cpp/kernels/fmha_v2/conftest.py
+++ b/cpp/kernels/fmha_v2/conftest.py
@ -1,12 +1,17 @@
-# SPDX-FileCopyrightText: Copyright (c) 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+# SPDX-FileCopyrightText: Copyright (c) 2023-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 #
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

 import subprocess

--- a/cpp/kernels/fmha_v2/setup.py
+++ b/cpp/kernels/fmha_v2/setup.py
@ -1,12 +1,17 @@
-# SPDX-FileCopyrightText: Copyright (c) 2020-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+# SPDX-FileCopyrightText: Copyright (c) 2020-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 #
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

 import os
 import subprocess
@ -149,7 +154,9 @@ spec_fields = (
    'head_size_v',
    'sage_block_sizes',
    'output_dtype',
-    'is_mtp')
+    'is_mtp',
+    'enable_skip_softmax',
+)
 kernel_spec = namedtuple('kernel_spec', spec_fields)
 kernel_spec.__new__.__defaults__ = (
    1,  # ctas_per_head
@ -174,7 +181,9 @@ kernel_spec.__new__.__defaults__ = (
    0,  # head size of V
    None,  # sage_block_sizes
    None,  # output_dtype, same as dtype by default.
-    False)  # use MTP or not
+    False,  # use MTP or not
+    False,  # enable skip softmax
+)

 generate_cu_trtllm = os.environ.get('GENERATE_CU_TRTLLM',
                                    'False').lower() == 'true'
@ -195,38 +204,22 @@ ns_close = r"""

 copyright = '''\
 /***************************************************************************************************
- * Copyright (c) 2011-2023, NVIDIA CORPORATION.  All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * Redistribution and use in source and binary forms, with or without modification, are not permit-
- * ted.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
 *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
- * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
- * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
- * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
- * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
- * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ * http://www.apache.org/licenses/LICENSE-2.0
 *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 **************************************************************************************************/
-''' if not generate_cu_trtllm else r"""/*
-* SPDX-FileCopyrightText: Copyright (c) 1993-2024 NVIDIA CORPORATION &
-* AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-* http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-"""
+'''

 makefile_template = '''\

@ -1446,6 +1439,7 @@ using Ktraits = {kernel_traits_header}
                USE_TMA_STORE,
                {enable_attn_logit_softcapping_flag},
                {return_softmax_stats_flag},
+                {enable_skip_softmax_flag},
                {output_dtype_},
                {sage_block_size_q},
                {sage_block_size_k},
@ -1469,6 +1463,7 @@ using Ktraits_causal = {kernel_traits_header}
                       USE_TMA_STORE,
                       {enable_attn_logit_softcapping_flag},
                       {return_softmax_stats_flag},
+                       {enable_skip_softmax_flag},
                       {output_dtype_}>;

 using Ktraits_sliding_or_chunked_causal = {kernel_traits_header}
@ -1489,6 +1484,7 @@ using Ktraits_sliding_or_chunked_causal = {kernel_traits_header}
                                      USE_TMA_STORE && false,
                                      {enable_attn_logit_softcapping_flag},
                                      {return_softmax_stats_flag},
+                                      {enable_skip_softmax_flag},
                                      {output_dtype_}>;

 using Ktraits_custom_mask = {kernel_traits_header}
@ -1509,6 +1505,7 @@ using Ktraits_custom_mask = {kernel_traits_header}
                            USE_TMA_STORE && false,
                            {enable_attn_logit_softcapping_flag},
                            {return_softmax_stats_flag},
+                            {enable_skip_softmax_flag},
                            {output_dtype_}>;

 ////////////////////////////////////////////////////////////////////////////////////////////////////
@ -1846,6 +1843,8 @@ def encode_name(kernel_spec):

    if kernel_spec.enable_attn_logit_softcapping:
        feature_tags += '_softcapping'
+    if kernel_spec.enable_skip_softmax:
+        feature_tags += '_skipSoftmax'
    if kernel_spec.sage_block_sizes:
        feature_tags += f"_sage_{'_'.join(map(str, kernel_spec.sage_block_sizes))}"
    if kernel_spec.output_dtype:
@ -2142,6 +2141,8 @@ def get_kernel_code(kspec, kname, lname):

    return_softmax_stats_flag = pythonBoolean2cpp[kspec.return_softmax_stats]

+    enable_skip_softmax_flag = pythonBoolean2cpp[kspec.enable_skip_softmax]
+
    # needed by warpspec kernels.
    fp8_kernel = kspec.dtype in ["e4m3", "e4m3_fp32"]
    kernel_traits_header =  "fmha::ws::Kernel_traits_Hopper_qgmma_e4m3_fp32<" if fp8_kernel \
@ -2170,7 +2171,8 @@ def get_kernel_code(kspec, kname, lname):
    params_str = 'reinterpret_cast<bert::Fused_multihead_attention_params_v2 &>(params)' if generate_cu_trtllm else 'params'
    attn_mask_type_str = 'using Attention_mask_type = ContextAttentionMaskType;' if generate_cu_trtllm else 'using Attention_mask_type = fmha::Attention_mask_type;'
    bert_launch_params = '' if generate_cu_trtllm else 'using Launch_params = bert::Fused_multihead_attention_launch_params;'
-    include_str = '#include "../fused_multihead_attention_common.h"' if generate_cu_trtllm else ''
+    include_str = '#include "../fused_multihead_attention_common.h"\n' if generate_cu_trtllm else ''
+    include_str += '#include "tensorrt_llm/common/config.h"' if generate_cu_trtllm else ''
    num_compute_groups_str = '' if generate_cu_trtllm else 'static constexpr int NUM_COMPUTE_GROUPS = 2;'
    fused_multihead_attention_params_v2_str = 'Fused_multihead_attention_params_v2' if generate_cu_trtllm else f'{params_type}'
    const_fused_multihead_attention_params_v2_str = 'Fused_multihead_attention_params_v2' if generate_cu_trtllm else f'const {params_type}'
@ -2196,8 +2198,19 @@ def get_kernel_code(kspec, kname, lname):
        const int COMPUTE_REG_COUNT = {compute_reg_count};
        asm volatile("{{setmaxnreg.inc.sync.aligned.u32 %0; \n\t}}" ::"n"(COMPUTE_REG_COUNT));'''.format(
        compute_reg_count=compute_reg_count)
-    local_ns_open = ns_open if generate_cu_trtllm else ''
-    local_ns_close = ns_close if generate_cu_trtllm else ''
+    abi_ns_open = r"""
+TRTLLM_NAMESPACE_BEGIN
+namespace kernels
+{
+// clang-format off
+"""
+    abi_ns_close = r"""
+// clang-format on
+} // namespace kernels
+TRTLLM_NAMESPACE_END
+"""
+    local_ns_open = abi_ns_open if generate_cu_trtllm else ''
+    local_ns_close = abi_ns_close if generate_cu_trtllm else ''

    tmp = dict(locals(), **kspec._asdict())

@ -2330,6 +2343,8 @@ def get_api_code(specs_names):
                f'&& sage_block_size_k == {sage_block_size_k} ' \
                f'&& sage_block_size_v == {sage_block_size_v} '

+            il_check += '&& enable_skip_softmax ' if kspec.enable_skip_softmax else '&& !enable_skip_softmax '
+
        il_check += '&& params.use_int8_scale_max ' if kspec.has_scale_max else '&& !params.use_int8_scale_max '

        slen = kspec.seq_len * kspec.ctas_per_head if not kspec.flash_attention else 0
@ -2606,6 +2621,7 @@ const bool warp_specialization               = launch_params.warp_specialization
 const bool use_tma                           = launch_params.use_tma;
 const bool use_flash_attention               = launch_params.flash_attention;
 const bool enable_attn_logit_softcapping     = launch_params.enable_attn_logit_softcapping;
+const bool enable_skip_softmax               = launch_params.enable_skip_softmax;
 const int  attention_input_layout            = static_cast<int>(launch_params.attention_input_layout);
 // tiled variant uses ldgsts
 const bool  use_tiled            = launch_params.use_granular_tiling;
@ -2784,6 +2800,8 @@ def get_kernel_traits_code(specs_names):
        enable_attn_logit_softcapping_flag = pythonBoolean2cpp[
            kspec.enable_attn_logit_softcapping]

+        enable_skip_softmax_flag = pythonBoolean2cpp[kspec.enable_skip_softmax]
+
        tmp = dict(locals(), **kspec._asdict())

        if effective_sm < 90:
@ -2902,7 +2920,8 @@ def get_kernel_traits_code(specs_names):
                                  {input_layout_flag},
                                  __use_tma_store__ /* USE_TMA_STORE */,
                                  {enable_attn_logit_softcapping_flag},
-                                  {return_softmax_stats_flag}>;
+                                  {return_softmax_stats_flag},
+                                  {enable_skip_softmax_flag}>;

            printf("%s %d %d %s %d %d\\n",
                \"{kname}\",
@ -3061,20 +3080,32 @@ def get_kernel_traits_code(specs_names):
 # For now:
 # 1. Hopper head_size 128 kernel uses cubins for performance regressions.
 # 2. Hopper sm89 with e4m3/e4m3_fp32 dtype uses cubins for accuracy regressions (will be fixed).
+# 3. For skip-softmax attention feature, we force not to use cubins.
 # You should set the condition `use_cubin_header` to false if you have modified the source codes of those kernels that use cubins.
 # This ensures that the kernels will be recompiled using the updated source code rather than relying on precompiled cubins.
-def use_cubin_header(sm, head_size, dtype):
+def use_cubin_header(sm,
+                     head_size,
+                     dtype,
+                     output_dtype=None,
+                     enable_skip_softmax=False):
+    if enable_skip_softmax:
+        return False
+    if 'e4m3' in dtype and output_dtype in ['bf16', 'fp16']:
+        return False
    return (sm == 90 and head_size == 128) or (sm == 89 and 'e4m3' in dtype)


 def get_cubin_header(kernel_traits, specs_names):
    cubins = []
    cubin_lens = []
+    launchers = []
    cubins_dict = {}
    cubin_lens_dict = {}
+    launchers_dict = {}
    for kspec, fname, lname, kname in specs_names:
        if generate_cu_trtllm and not use_cubin_header(
-                kspec.sm, kspec.head_size, kspec.dtype):
+                kspec.sm, kspec.head_size, kspec.dtype, kspec.output_dtype,
+                kspec.enable_skip_softmax):
            continue
        name = fname.replace('.', '_')
        data = 'extern unsigned char cubin_{name}_cubin[];'.format(name=name)
@ -3106,8 +3137,9 @@ def get_cubin_header(kernel_traits, specs_names):
                                'q_kv_', '').replace('q_paged_kv_', '').replace(
                                    'q_k_v_', '').replace('ws_', '').replace(
                                        'softcapping_',
-                                        '').replace('sage_',
-                                                    '').replace('output_', ''))
+                                        '').replace('sage_', '').replace(
+                                            'skipSoftmax_',
+                                            '').replace('output_', ''))
        flash_attention = 'flash_attention' in kname
        warp_specialization = 'tma_ws' in kname
        toks = tname.split('_')
@ -3204,6 +3236,8 @@ def get_cubin_header(kernel_traits, specs_names):
        return_softmax_stats_flag = pythonBoolean2cpp[sm != '90' or (
            sm == '90' and '_softmax' in kname)]

+        enable_skip_softmax_flag = pythonBoolean2cpp['_skipSoftmax' in kname]
+
        # meta_unroll_step
        meta_unroll_step = unroll_step if ('_nl' in kname
                                           or '_ws' in kname) else '0'
@ -3229,7 +3263,9 @@ def get_cubin_header(kernel_traits, specs_names):
            if generate_cu_trtllm:

                def get_lname_from_kname(kname: str) -> str:
-                    if use_cubin_header(int(sm), int(head_size), prec.lower()):
+                    if use_cubin_header(int(sm), int(head_size), prec.lower(),
+                                        output_prec.lower(),
+                                        enable_skip_softmax_flag):
                        return 'nullptr'
                    lname = kname.replace('_kernel', '')
                    mask_types = [
@ -3247,14 +3283,15 @@ def get_cubin_header(kernel_traits, specs_names):
 {sage_block_sizes[0]}, {sage_block_sizes[1]}, {sage_block_sizes[2]}, kSM_{sm}, {cubin_name}, \
 {cubin_name}_len, \"{kname}\", {smem}, {threads}, {meta_unroll_step}, {attention_mask_type_value}, \
 {attention_input_layout_value}, {is_il}, {is_flash_atten}, {is_warp_specialization}, {is_fp32_accu}, \
-{is_alibi_supported}, {is_tiled}, {has_softcapping_scale}, {return_softmax_stats_flag}, {lname}}}\
+{is_alibi_supported}, {is_tiled}, {has_softcapping_scale}, {return_softmax_stats_flag}, {enable_skip_softmax_flag}, {lname}}}\
 '''.format(**locals()) if use_cubin_header(int(sm), int(head_size),
-                                           prec.lower()) else '''\
+                                           prec.lower(), output_prec.lower(),
+                                           enable_skip_softmax_flag) else '''\
 {{ DATA_TYPE_{prec}, DATA_TYPE_{output_prec}, {seq_len}, {q_step}, {kv_step}, {head_size}, {head_size_v}, \
 {sage_block_sizes[0]}, {sage_block_sizes[1]}, {sage_block_sizes[2]}, kSM_{sm}, nullptr, \
 0, \"{kname}\", {smem}, {threads}, {meta_unroll_step}, {attention_mask_type_value}, \
 {attention_input_layout_value}, {is_il}, {is_flash_atten}, {is_warp_specialization}, {is_fp32_accu}, \
-{is_alibi_supported}, {is_tiled}, {has_softcapping_scale}, {return_softmax_stats_flag}, {lname}}}\
+{is_alibi_supported}, {is_tiled}, {has_softcapping_scale}, {return_softmax_stats_flag}, {enable_skip_softmax_flag}, {lname}}}\
 '''.format(**locals())
            else:
                code = '''\
@ -3262,7 +3299,7 @@ def get_cubin_header(kernel_traits, specs_names):
 {sage_block_sizes[0]}, {sage_block_sizes[1]}, {sage_block_sizes[2]}, kSM_{sm}, {cubin_name}, \
 {cubin_name}_len, \"{kname}\", {smem}, {threads}, {meta_unroll_step}, {attention_mask_type_value}, \
 {attention_input_layout_value}, {is_il}, {is_flash_atten}, {is_warp_specialization}, {is_fp32_accu}, \
-{is_alibi_supported}, {is_tiled}, {has_softcapping_scale}, {return_softmax_stats_flag}}}\
+{is_alibi_supported}, {is_tiled}, {has_softcapping_scale}, {return_softmax_stats_flag}, {enable_skip_softmax_flag}}}\
 '''.format(**locals())
            if sm in metadata_v2_dict:
                metadata_v2_dict[sm].append(code)
@ -3273,11 +3310,11 @@ def get_cubin_header(kernel_traits, specs_names):
            if generate_cu_trtllm and lname != 'nullptr':
                launcher = 'extern void {lname}(Fused_multihead_attention_params_v2& params, const Launch_params& launch_params, cudaStream_t stream);'.format(
                    lname=lname)
-                if int(sm) in cubins_dict:
-                    if launcher not in cubins_dict[int(sm)]:
-                        cubins_dict[int(sm)].append(launcher)
+                if int(sm) in launchers_dict:
+                    if launcher not in launchers_dict[int(sm)]:
+                        launchers_dict[int(sm)].append(launcher)
                else:
-                    cubins_dict[int(sm)] = [launcher]
+                    launchers_dict[int(sm)] = [launcher]
        elif 'mhca' in kname:
            code = '''\
 {{ DATA_TYPE_{prec}, {seq_len}, {q_step}, {kv_step}, {head_size}, kSM_{sm},  {cubin_name}, {cubin_name}_len, \"{kname}\", {smem}, {threads}, {meta_unroll_step}, {is_il} }}\
@ -3300,17 +3337,33 @@ def get_cubin_header(kernel_traits, specs_names):
    else:
        metadata_v2 = ',\n'.join(metadata_v2)
    # Add macros to only include needed cubins during compilation.
-    for sm in cubins_dict.keys():
+    # Collect all SM versions from all dictionaries
+    all_sms = sorted(
+        set(
+            list(cubins_dict.keys()) + list(cubin_lens_dict.keys()) +
+            list(launchers_dict.keys())))
+
+    for sm in all_sms:
        macro_begin = f"#ifndef EXCLUDE_SM_{sm}"
        macro_end = f"#endif\n"
-        cubins.extend([macro_begin] + cubins_dict[sm] + [macro_end])
+
+        # Add cubin array declarations
+        if sm in cubins_dict:
+            cubins.extend([macro_begin] + cubins_dict[sm] + [macro_end])
+
+        # Add cubin length declarations
        if sm in cubin_lens_dict:
            cubin_lens.extend([macro_begin] + cubin_lens_dict[sm] + [macro_end])

+        # Add launcher declarations
+        if sm in launchers_dict:
+            launchers.extend([macro_begin] + launchers_dict[sm] + [macro_end])
+
    unroll_config_v1 = ',\n'.join(unroll_config_v1)
    unroll_config_v2 = ',\n'.join(unroll_config_v2)
    cubins = '\n'.join(cubins)
    cubin_lens = '\n'.join(cubin_lens)
+    launchers = '\n'.join(launchers)
    local_ns_open = ns_open
    local_ns_close = ns_close if generate_cu_trtllm else '}'
    launcher_line = '''
@ -3354,7 +3407,8 @@ static const struct FusedMultiHeadAttentionKernelMetaInfoV2
    bool mAlibiSupported;
    bool mTiled;
    bool mEnableAttnLogitSoftcapping;
-    bool mReturnSoftmaxStats;{launcher_line}
+    bool mReturnSoftmaxStats;
+    bool mEnableSkipSoftmax;{launcher_line}
 }} sMhaKernelMetaInfosV2[] = {{
 {metadata_v2}
 }};
@ -3415,6 +3469,7 @@ static const struct TestMetaV2
    bool mTiled;
    bool mEnableAttnLogitSoftcapping;
    bool mReturnSoftmaxStats;
+    bool mEnableSkipSoftmax;
 }} metaV2[] = {{
 {metadata_v2}
 }};
@ -3422,7 +3477,159 @@ static const struct TestMetaV2

 '''.format(**locals(), copyright=copyright)

-    return code
+    # Generate header content (.h file)
+    if "GENERATE_CUBIN" in os.environ:
+        header_content = '''\
+{copyright}
+#pragma once
+
+#include "tensorrt_llm/common/config.h"
+
+TRTLLM_NAMESPACE_BEGIN
+namespace kernels{{
+
+struct FusedMultiHeadAttentionKernelMetaInfoV2
+{{
+    Data_type mDataTypeIn;
+    Data_type mDataTypeOut;
+    unsigned int mS;
+    unsigned int mStepQ;
+    unsigned int mStepKV;
+    unsigned int mD;
+    unsigned int mDV;
+    unsigned int mSageBlockSizeQ;
+    unsigned int mSageBlockSizeK;
+    unsigned int mSageBlockSizeV;
+    unsigned int mSM;
+    const unsigned char* mCubin;
+    unsigned int mCubinSize;
+    const char* mFuncName;
+    unsigned int mSharedMemBytes;
+    unsigned int mThreadsPerCTA;
+    unsigned int mUnrollStep;
+    int mAttentionMaskType;
+    int mAttentionInputLayout;
+    bool mInterleaved;
+    bool mFlashAttention;
+    bool mWarpSpecialization;
+    bool mFP32Accumulation;
+    bool mAlibiSupported;
+    bool mTiled;
+    bool mEnableAttnLogitSoftcapping;
+    bool mReturnSoftmaxStats;
+    bool mEnableSkipSoftmax;{launcher_line}
+}};
+
+extern const FusedMultiHeadAttentionKernelMetaInfoV2 sMhaKernelMetaInfosV2[];
+extern const int sMhaKernelMetaInfosV2Size;
+
+}} // namespace kernels
+TRTLLM_NAMESPACE_END
+'''.format(**locals(), copyright=copyright)
+        # Generate source content (.cpp file)
+        source_content = '''\
+{copyright}
+
+#include "tensorrt_llm/common/config.h"
+
+#include <cstddef>
+#include <cstdint>
+#include <cuda_runtime_api.h>
+
+{local_ns_open}
+
+//--- Cubin Arrays
+{cubins}
+
+//--- Cubin Lengths
+{cubin_lens}
+
+{local_ns_close}
+
+using namespace tensorrt_llm::kernels;
+
+namespace tensorrt_llm::TRTLLM_ABI_NAMESPACE::kernels {{
+
+class Fused_multihead_attention_params_v2;
+class Launch_params;
+
+//--- Kernel Launchers
+{launchers}
+
+// FIXME: These are duplicated declarations, we should remove them in the future.
+constexpr int32_t kSM_70 = 70;
+constexpr int32_t kSM_72 = 72;
+constexpr int32_t kSM_75 = 75;
+constexpr int32_t kSM_80 = 80;
+constexpr int32_t kSM_86 = 86;
+constexpr int32_t kSM_89 = 89;
+constexpr int32_t kSM_90 = 90;
+constexpr int32_t kSM_100 = 100;
+constexpr int32_t kSM_100f = 10100;
+constexpr int32_t kSM_103 = 103;
+constexpr int32_t kSM_120 = 120;
+constexpr int32_t kSM_121 = 121;
+
+// FIXME: These are duplicated declarations, we should remove them in the future.
+enum Data_type
+{{
+    DATA_TYPE_BOOL,
+    DATA_TYPE_FP16,
+    DATA_TYPE_FP32,
+    DATA_TYPE_INT4,
+    DATA_TYPE_INT8,
+    DATA_TYPE_INT32,
+    DATA_TYPE_BF16,
+    DATA_TYPE_E2M1,
+    DATA_TYPE_E4M3,
+    DATA_TYPE_E5M2
+}};
+
+struct FusedMultiHeadAttentionKernelMetaInfoV2
+{{
+    Data_type mDataTypeIn;
+    Data_type mDataTypeOut;
+    unsigned int mS;
+    unsigned int mStepQ;
+    unsigned int mStepKV;
+    unsigned int mD;
+    unsigned int mDV;
+    unsigned int mSageBlockSizeQ;
+    unsigned int mSageBlockSizeK;
+    unsigned int mSageBlockSizeV;
+    unsigned int mSM;
+    const unsigned char* mCubin;
+    unsigned int mCubinSize;
+    const char* mFuncName;
+    unsigned int mSharedMemBytes;
+    unsigned int mThreadsPerCTA;
+    unsigned int mUnrollStep;
+    int mAttentionMaskType;
+    int mAttentionInputLayout;
+    bool mInterleaved;
+    bool mFlashAttention;
+    bool mWarpSpecialization;
+    bool mFP32Accumulation;
+    bool mAlibiSupported;
+    bool mTiled;
+    bool mEnableAttnLogitSoftcapping;
+    bool mReturnSoftmaxStats;
+    bool mEnableSkipSoftmax;{launcher_line}
+}};
+
+extern const FusedMultiHeadAttentionKernelMetaInfoV2 sMhaKernelMetaInfosV2[] = {{
+{metadata_v2}
+}};
+
+extern const int sMhaKernelMetaInfosV2Size = sizeof(sMhaKernelMetaInfosV2) / sizeof(sMhaKernelMetaInfosV2[0]);
+}} // namespace tensorrt_llm::TRTLLM_ABI_NAMESPACE::kernels
+'''.format(**locals(), copyright=copyright)
+    else:
+        # Non-GENERATE_CUBIN mode: use old behavior
+        header_content = code
+        source_content = None
+
+    return header_content, source_content


 # This is used to add some kernels running in cubins for passing CI cases.
@ -3440,9 +3647,20 @@ def modify_cubin_header(cubin_header):
        return result

    target = "#ifndef EXCLUDE_SM_80"
-    addition = """extern unsigned char cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin[];
-extern uint32_t cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin_len;"""
-    result = add_kernel_line(result, target, addition)
+    addition_cubin_array = """
+#ifndef EXCLUDE_SM_80
+extern unsigned char cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin[];
+#endif
+"""
+    addition_cubin_length = """
+#ifndef EXCLUDE_SM_80
+extern uint32_t cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin_len;
+#endif
+"""
+    # Add cubin array and length into there corresponding sections.
+    result = add_kernel_line(result, "//--- Cubin Arrays", addition_cubin_array)
+    result = add_kernel_line(result, "//--- Cubin Lengths",
+                             addition_cubin_length)

    def modify_kernel_line(result, target, new_line):
        lines = result.split('\n')
@ -3453,7 +3671,7 @@ extern uint32_t cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_
        return '\n'.join(lines)

    target = "fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_causal_sm80_kernel_nl_tiled"
-    new_line = '{ DATA_TYPE_FP16, DATA_TYPE_FP16, 0, 64, 128, 128, 128, 0, 0, 0, kSM_80, cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin, cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin_len, "fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_causal_sm80_kernel_nl_tiled", 81920, 128, 64, 1, 2, false, true, false, false, true, true, false, true, nullptr},'
+    new_line = '{ DATA_TYPE_FP16, DATA_TYPE_FP16, 0, 64, 128, 128, 128, 0, 0, 0, kSM_80, cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin, cubin_fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_sm80_cu_cubin_len, "fmha_v2_flash_attention_fp16_64_128_S_q_paged_kv_128_causal_sm80_kernel_nl_tiled", 81920, 128, 64, 1, 2, false, true, false, false, true, true, false, true, false, nullptr},'
    result = modify_kernel_line(result, target, new_line)

    # make sure only one empty line at the end
@ -3525,13 +3743,22 @@ def generate_files(specs_names):
    output = output.decode('utf-8').strip()
    # this gives: kname, smem bytes, threads_per_cta, loop_step
    kernel_traits = [traits.split() for traits in output.splitlines()]
-    cubin_header = get_cubin_header(kernel_traits, valid_specs_names)
+    # Use new function to generate both fmha_cubin.h and fmha_cubin.cpp files
+    # To switch back to old behavior, replace get_cubin_header_and_source with get_cubin_header
+    cubin_header, cubin_source = get_cubin_header(kernel_traits,
+                                                  valid_specs_names)
    if generate_cu_trtllm:
-        cubin_header = modify_cubin_header(cubin_header)
+        cubin_source = modify_cubin_header(cubin_source)

+    # Write fmha_cubin.h file
    with open('./generated/fmha_cubin.h', 'w') as f:
        f.write(cubin_header)

+    # Write fmha_cubin.cpp file (same directory as fmha_cubin.h file)
+    if cubin_source is not None:
+        with open('./generated/fmha_cubin.cpp', 'w') as f:
+            f.write(cubin_source)
+

 def enumerate_hgmma_tma_kernels(specs, sm=90):
    specs.append(
@ -3608,7 +3835,10 @@ def enumerate_hgmma_ldgsts_kernels(specs, sm=90, dtype='fp16'):


 # Note this will be used in TRT-LLM.
-def enumerate_hgmma_flash_warpspec_kernels(specs, sm=90, dtype='fp16'):
+def enumerate_hgmma_flash_warpspec_kernels(specs,
+                                           sm=90,
+                                           dtype='fp16',
+                                           enable_skip_softmax=False):

    scheduling_mode = int(os.getenv('SCHEDULING_MODE', '1'))

@ -3658,7 +3888,8 @@ def enumerate_hgmma_flash_warpspec_kernels(specs, sm=90, dtype='fp16'):
                    enable_attn_logit_softcapping=enable_attn_logit_softcapping,
                    return_softmax_stats=return_softmax,
                    scheduling_mode=scheduling_mode,
-                    input_layout=input_layout))
+                    input_layout=input_layout,
+                    enable_skip_softmax=enable_skip_softmax))

            specs.append(
                kernel_spec(
@ -3690,7 +3921,8 @@ def enumerate_hgmma_flash_warpspec_kernels(specs, sm=90, dtype='fp16'):
                    enable_attn_logit_softcapping=enable_attn_logit_softcapping,
                    return_softmax_stats=return_softmax,
                    scheduling_mode=scheduling_mode,
-                    input_layout=input_layout))
+                    input_layout=input_layout,
+                    enable_skip_softmax=enable_skip_softmax))

            specs.append(
                kernel_spec(
@ -3722,7 +3954,8 @@ def enumerate_hgmma_flash_warpspec_kernels(specs, sm=90, dtype='fp16'):
                    enable_attn_logit_softcapping=enable_attn_logit_softcapping,
                    return_softmax_stats=return_softmax,
                    scheduling_mode=scheduling_mode,
-                    input_layout=input_layout))
+                    input_layout=input_layout,
+                    enable_skip_softmax=enable_skip_softmax))
        '''
        smem size = (q_step * d * q_buffers * NUM_COMPUTE_GROUPS
                    + (kv_step * d + kv_step * dv) * kv_buffers) * ele_size
@ -3774,7 +4007,8 @@ def enumerate_qgmma_flash_warpspec_kernels(specs,
                                           sm=90,
                                           dtype='e4m3',
                                           sage_block_sizes=None,
-                                           output_dtype=None):
+                                           output_dtype=None,
+                                           enable_skip_softmax=False):

    scheduling_mode = int(os.getenv('SCHEDULING_MODE', '1'))

@ -3791,7 +4025,7 @@ def enumerate_qgmma_flash_warpspec_kernels(specs,
            continue
        # for normal attention, we do not need return softmax for ws fp8 kernels currently.
        # also fp8 input and bf16 output is only needed for MLA kernel.
-        skip_combination = return_softmax or (output_dtype is not None)
+        skip_combination = return_softmax
        # for context mla, we need separate qkv as input layout when returning softmax.
        skip_mla_combination = return_softmax and input_layout != InputLayout.SEPARATE_Q_K_V
        if not skip_combination:
@ -3828,7 +4062,8 @@ def enumerate_qgmma_flash_warpspec_kernels(specs,
                    scheduling_mode=scheduling_mode,
                    input_layout=input_layout,
                    sage_block_sizes=sage_block_sizes,
-                    output_dtype=output_dtype))
+                    output_dtype=output_dtype,
+                    enable_skip_softmax=enable_skip_softmax))

            # 64 < D <=128: KV_STEP = 128
            specs.append(
@ -3863,7 +4098,8 @@ def enumerate_qgmma_flash_warpspec_kernels(specs,
                    scheduling_mode=scheduling_mode,
                    input_layout=input_layout,
                    sage_block_sizes=sage_block_sizes,
-                    output_dtype=output_dtype))
+                    output_dtype=output_dtype,
+                    enable_skip_softmax=enable_skip_softmax))

            # 128 < D <=256: KV_STEP = 128
            specs.append(
@ -3899,7 +4135,8 @@ def enumerate_qgmma_flash_warpspec_kernels(specs,
                    scheduling_mode=scheduling_mode,
                    input_layout=input_layout,
                    sage_block_sizes=sage_block_sizes,
-                    output_dtype=output_dtype))
+                    output_dtype=output_dtype,
+                    enable_skip_softmax=enable_skip_softmax))

        if not skip_mla_combination:
            # context MLA (192x128)
@ -6181,13 +6418,21 @@ def enumerate_kernels():
    enumerate_igmma_kernels(specs, sm=90)
    enumerate_qgmma_kernels(specs, sm=90)
    # need to add bf16 kernels if needed
-    enumerate_hgmma_flash_warpspec_kernels(specs, sm=90, dtype='fp16')
-    enumerate_hgmma_flash_warpspec_kernels(specs, sm=90, dtype='bf16')
-    enumerate_qgmma_flash_warpspec_kernels(specs, sm=90, dtype='e4m3')
-    enumerate_qgmma_flash_warpspec_kernels(specs,
-                                           sm=90,
-                                           dtype='e4m3',
-                                           output_dtype="bf16")
+    for enable_skip_softmax in [False, True]:
+        if enable_skip_softmax and 'DISABLE_SKIP_SOFTMAX' in os.environ:
+            continue
+        enumerate_hgmma_flash_warpspec_kernels(
+            specs, sm=90, dtype='fp16', enable_skip_softmax=enable_skip_softmax)
+        enumerate_hgmma_flash_warpspec_kernels(
+            specs, sm=90, dtype='bf16', enable_skip_softmax=enable_skip_softmax)
+        enumerate_qgmma_flash_warpspec_kernels(
+            specs, sm=90, dtype='e4m3', enable_skip_softmax=enable_skip_softmax)
+        enumerate_qgmma_flash_warpspec_kernels(
+            specs,
+            sm=90,
+            dtype='e4m3',
+            output_dtype="bf16",
+            enable_skip_softmax=enable_skip_softmax)

    # For now SageAttention only needs BF16
    # block_size_q should be divisible by 64
@ -6389,6 +6634,16 @@ def enumerate_kernels():
                  and kspec.cross_mha     == False
                  and kspec.flash_attention == True
                  and kspec.input_layout != InputLayout.SEPARATE_Q_K_V)
+                  # Gemma3 VL support.
+                  or  (kspec.sm           == 100
+                  and kspec.dtype         in ['fp16', 'bf16', 'fp16_fp32', 'e4m3', 'e4m3_fp32']
+                  and kspec.head_size     == 72
+                  and kspec.head_size_v   == 0
+                  and kspec.sage_block_sizes is None
+                  and kspec.version       == 2
+                  and kspec.cross_mha     == False
+                  and kspec.flash_attention == True
+                  and kspec.input_layout != InputLayout.SEPARATE_Q_K_V)
                  # Deepseek MLA (generation 576/512 paged)
                  or (kspec.sm            in [90, 100, 120]
                  and kspec.dtype         in ['bf16', 'e4m3_fp32']
--- a/cpp/kernels/fmha_v2/src/convert.cu
+++ b/cpp/kernels/fmha_v2/src/convert.cu
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #include <fmha/numeric_types.h>
--- a/cpp/kernels/fmha_v2/src/fmha/alibi_params.h
+++ b/cpp/kernels/fmha_v2/src/fmha/alibi_params.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/fragment.h
+++ b/cpp/kernels/fmha_v2/src/fmha/fragment.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/gemm.h
+++ b/cpp/kernels/fmha_v2/src/fmha/gemm.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/gmem_tile_o.h
+++ b/cpp/kernels/fmha_v2/src/fmha/gmem_tile_o.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/gmem_tile_o_packed.h
+++ b/cpp/kernels/fmha_v2/src/fmha/gmem_tile_o_packed.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/gmem_tile_ps.h
+++ b/cpp/kernels/fmha_v2/src/fmha/gmem_tile_ps.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/gmem_tile_qkv.h
+++ b/cpp/kernels/fmha_v2/src/fmha/gmem_tile_qkv.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/gmem_tile_qkv_packed.h
+++ b/cpp/kernels/fmha_v2/src/fmha/gmem_tile_qkv_packed.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/arrive_wait.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/arrive_wait.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/compute_tile.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/compute_tile.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/fragment.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/fragment.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_o_packed.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_o_packed.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_qkv_packed.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/gmem_tile_qkv_packed.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/gmma_descriptor.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/gmma_descriptor.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/kernel_traits.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/kernel_traits.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/cpp/kernels/fmha_v2/src/fmha/hopper/smem_tile.h
+++ b/cpp/kernels/fmha_v2/src/fmha/hopper/smem_tile.h
@ -1,13 +1,18 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2011-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
- * SPDX-License-Identifier: NVIDIA TensorRT Source Code License Agreement
+ * SPDX-FileCopyrightText: Copyright (c) 2011-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
 *
- * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
- * property and proprietary rights in and to this material, related
- * documentation and any modifications thereto. Any use, reproduction,
- * disclosure or distribution of this material and related documentation
- * without an express license agreement from NVIDIA CORPORATION or
- * its affiliates is strictly prohibited.
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
 */

 #pragma once
--- a/Show More
+++ b/Show More
				`@ -1 +0,0 @@`
				`Subproject commit 9fa5965e265e27995f539e0dd73a06351a8a9eaf`
				`@ -1 +0,0 @@`
				`Subproject commit a1ceb0677f67371ed29a2b1c022794f077db5fe7`
				`@ -1 +0,0 @@`
				`Subproject commit c94c20743ed7d4aa37835a5c46567ab0790d4acc`
				`@ -1 +0,0 @@`
				`Subproject commit f3fde58372d33e9a5650ba7b80fc48b3b49d40c8`
				`@ -1 +0,0 @@`
				`Subproject commit eb787304d67ec22f7c3a184ee8b4c481d04357fd`
				`@ -1 +0,0 @@`
				`Subproject commit 1408756a88e52a25196b759eaf8db89d2b51b5a1`
				`@ -1 +0,0 @@`
				`Subproject commit 55f93686c01528224f448c19128836e7df245f72`
				`@ -1 +0,0 @@`
				`Subproject commit a0ed2587f1089ef7657e2ed49ad6756b01c74e9f`
				`@ -1 +0,0 @@`
				`Subproject commit f99ffd7e03001810a3e722bf48ad1a9e08415d7d`
				`@ -1 +0,0 @@`
				`Subproject commit 16eaa57c8d98c8ef54d666a2d2b11e76cfa565f5`