Commit Graph

352 Commits

Author SHA1 Message Date
Kaiyu Xie
e58afa510e
doc: Add README for wide EP (#6356)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-29 00:36:12 -04:00
Liana Koleva
96d004d800
doc: fix invalid link in llama 4 example documentation (#6340)
Signed-off-by: Liana Koleva <43767763+lianakoleva@users.noreply.github.com>
2025-07-26 11:27:10 -04:00
Michal Guzek
08d57123f9
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
Signed-off-by: moraxu <mguzek@nvidia.com>
2025-07-25 18:10:40 -04:00
Simeng Liu
7bff341553
[doc] Add NGram tech blog (#6311)
Signed-off-by: Simeng Liu <simengl@nvidia.com>
2025-07-25 10:26:33 -07:00
nv-guomingz
31d3eff24b
doc: fix invalid links related with llm api example (#6317)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-24 00:46:51 -04:00
Yechan Kim
83c3ed128b
chore: set default device to cpu on Multimodal models (#5994)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-22 21:45:31 -07:00
Kaiyu Xie
f08286c679
doc: Refactor documents and examples of disaggregated serving and wide ep (#6054)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-23 09:20:57 +08:00
Lucas Liebenwein
41fb8aa8b1
[AutoDeploy] merge feat/ad-2025-07-07 (#6196)
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Neta Zmora <96238833+nzmora-nvidia@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Gal Hubara-Agam <96368689+galagam@users.noreply.github.com>
Co-authored-by: Neta Zmora <nzmora@nvidia.com>
Co-authored-by: nvchenghaoz <211069071+nvchenghaoz@users.noreply.github.com>
Co-authored-by: Frida Hou  <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Co-authored-by: Grzegorz Kwasniewski <213329731+greg-kwasniewski1@users.noreply.github.com>
2025-07-23 05:11:04 +08:00
Yiqing Yan
3e18ee5fe1
chore: bump version to 1.0.0rc5 (#6252)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-22 16:24:28 +08:00
Mike Iovine
9645814bdf
[chore] Clean up quickstart_advanced.py (#6021)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-07-21 15:00:59 -04:00
Linda
3efad2e58c
feat: nanobind bindings (#6185)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-21 08:56:57 +01:00
nv-guomingz
b4c7e8c9a5
doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… (#6150)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-21 10:49:29 +08:00
bhsueh_NV
2e14c8f443
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 (#6065)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-20 10:25:25 +08:00
wili
82d3587bb8
[refactor] Unify name of NGram speculative decoding (#5937)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-19 12:59:57 +08:00
Zhenhuan Chen
992b273045
[https://nvbugs/5387375] fix(scaffolding): fix scaffolding aime test in test_e2e (#6140)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-18 10:34:37 +08:00
Iman Tabrizian
b75e53ab69
Revert "feat: nanobind bindings (#5961)" (#6160)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-18 10:12:54 +08:00
Linda
5bff317abf
feat: nanobind bindings (#5961)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-17 22:42:52 +08:00
Enwei Zhu
21efb50068
[TRTLLM-6406] feat: Enable guided decoding with overlap scheduler (#6000)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-17 17:46:10 +08:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
Wanli Jiang
2d2b8bae32
feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support (#5644)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-17 06:30:58 +08:00
Yan Chunwei
a02606a9e2
[TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend (#5752)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-16 16:42:59 +08:00
Yiqing Yan
e51c541617
chore: Bump version to 1.0.0rc4 (#6086)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-16 13:02:23 +08:00
Xiaodong (Vincent) Huang
0523f77b36
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… (#5684)
Signed-off-by: Vincent Huang <vincenth@nvidia.com>
2025-07-15 18:34:21 +03:00
nv-guomingz
4e4d18826f
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#6003)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-15 15:50:03 +09:00
jiahanc
24dfd4cd0b
Doc: Update llama-3.3-70B guide (#6028)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-07-15 11:37:26 +09:00
Yechan Kim
2320f12321
doc: update EXAONE 4.0 news (#6034)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-15 10:26:51 +09:00
Yechan Kim
63139fdcff
feat: EXAONE4.0 support (#5696)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-14 22:28:10 +09:00
Zhanrui Sun
3a0ef73414
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check (#5709)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-07-14 18:52:13 +09:00
Yan Chunwei
9c673e9707
[TRTLLM-6160] chore: add sampling examples for pytorch (#5951)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-14 15:28:32 +09:00
Yan Chunwei
c30eead09f
[TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch (#5956)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-14 14:09:39 +08:00
Xianjie Qiao
c7ffadf692
Fix errors in wide-ep scripts (#5992)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-07-14 14:07:27 +09:00
Shi Xiaowei
49359574c1
[TRTLLM-5673] Doc: ensure the disagg doc is up to date (#5938) 2025-07-11 17:39:05 +09:00
William Tambellini
fbb4cc7379
[TRTLLM-4770][feat] Enhance cpp executor cmake to listen to ENABLE_MU… (#5104)
...LTI_DEVICE

Signed-off-by: William Tambellini <wtambellini@sdl.com>
2025-07-11 10:59:44 +08:00
Iman Tabrizian
c32c9e2fad
doc: Add instructions for running gemma in disaggregated serving (#5922)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-10 10:21:19 -07:00
wili
2e3cf42e03
[refactor] Simplification of Speculative decoding configs (#5639)
Signed-off-by: wili-65535 <wili-65535@users.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@users.noreply.github.com>
2025-07-10 11:37:30 -04:00
Yan Chunwei
07f6da763d
[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner (#5876)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-10 11:31:35 +08:00
Erin
e277766f0d
chores: merge examples for v1.0 doc (#5736)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-07-08 21:00:42 -07:00
jiahanc
c24eb67054
Doc: fix link in llama4 Maverick example (#5864)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-07-09 11:09:58 +09:00
jiahanc
607bf4c395
Doc: Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-07-09 10:10:02 +09:00
Yan Chunwei
e50d95c40d
chore [TRTLLM-6161]: add LLM speculative decoding example (#5706)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-07-09 07:33:11 +08:00
Yiqing Yan
5203a0f6df
chore: bump version to 1.0.0rc3 (#5819)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-07-08 16:04:40 +09:00
Zhenhuan Chen
dee6644ed9
feat(scaffolding): add streaming scaffolding_llm.generate_async support (#5345)
Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>
2025-07-08 15:08:40 +09:00
nv-guomingz
0be41b6524
Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" (#5818) 2025-07-08 13:15:30 +09:00
nv-guomingz
5a8173c121
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… (#5795)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-08 08:52:36 +08:00
DylanChen-NV
5ca2b9bb15
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow (#5615)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-07 18:04:57 +08:00
bhsueh_NV
85e934a7fe
[Doc] update the document of qwen3 and cuda_graph usage (#5703)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-07-07 09:44:25 +08:00
Xianjie Qiao
b1976c2add
Add wide-ep benchmarking scripts (#5760)
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-05 19:29:39 +08:00
Stefan Niebler
d1112aac37
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-05 01:35:13 +09:00
nv-guomingz
c434147366
chore: update doc by replacing use_cuda_graph with cuda_graph_config (#5680)
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
2025-07-04 15:39:15 +09:00
Linda
94f0252b46 Doc: Update invalid hugging face URLs (#5683)
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-04 13:14:13 +08:00