Commit Graph

360 Commits

Author SHA1 Message Date
tburt-nv
b331d62f98
add sqlite to rocky container (#3114)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-04-10 13:30:24 +08:00
yuxianq
16c8f39fc5
feat: Support TLLM_OVERRIDE_LAYER_NUM and TLLM_TRACE_MODEL_FORWARD for debugging (#3417)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-10 13:18:30 +08:00
hlu1
fbcf954d9c
[MLA] Deallocate tensors after use (#3286)
Signed-off-by: Hao Lu <haolu@nvidia.com>
2025-04-09 21:36:07 -07:00
brb-nv
c59abae436
feat: Add Gemma3 text-only model support (#3247)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-10 12:34:58 +08:00
QI JUN
b5473f7eca
waive llama3.1 8B test cases with pipeline parallelism (#3433)
* waive llama3.1 8B test cases with pipeline parallelism

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* update

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-10 11:07:58 +08:00
Frank
9307ff95ae
fix: Add nested aliases for Llama 4 (#3381)
* Add nested aliases for Llama 4

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix missed alias.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-04-10 10:18:53 +08:00
peaceh-nv
215fb20567
chore : split GptExecutor tests out of gpt tests to reduce single test time (#3412)
Signed-off-by: peaceh <103117813+peaceh-nv@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-10 09:08:15 +08:00
tburt-nv
8d164f40d7
update allowlist (#3428)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-10 06:41:40 +08:00
Yechan Kim
943218b54a
feat: Add Qwen2.5-VL and refactor Qwen2-VL (#3156)
* feat: Add Qwen2.5-VL and refactor Qwen2-VL

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix yapf and codespell

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix test_e2e

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* generalize get_rope_index

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix qwen2.5-vl on REAME

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix image test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-10 04:09:03 +08:00
Maximiliano Levi
996696203f
fix: #3137 speculative decoding and multimodal input support (#3276)
* fix: broadcast embeddings input when using speculative decoding

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>

* fix: use shape tensor instead of tuple

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>

* fix: comment

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>

---------

Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-09 23:40:19 +08:00
danielafrimi
47f5cf6c0d
lora_tests (#3201)
LoRA tests and layers

Signed-off-by: Ubuntu <dafrimi@nvidia.com>
Co-authored-by: Ubuntu <dafrimi@nvidia.com>
2025-04-09 18:06:52 +03:00
WeiHaocheng
6eee15900e
feat: Enhance the integrated robustness of scaffolding with __init__.py #3305 (#3312)
Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-09 21:13:47 +08:00
石晓伟
c069abc7d8
Update gh pages build script (#3405)
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-04-09 19:58:38 +08:00
Gabriel Wu
4d78f51608
fix: remove DeepGEMM line info (#3411)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
2025-04-09 18:01:02 +08:00
wili
6f1b2cdb83
Doc: update steps of using Draft-Target-Model (DTM) in the documents. (#3366)
Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>
2025-04-09 17:35:01 +08:00
QI JUN
d0671494cd
chore: fix wheel version <= 0.45.1 (#3391)
* fix wheel version to 0.45.1

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* relax version

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-09 12:31:55 +08:00
sugunav14
64abb01a36
Fix failing DSV3 unit tests (#3385)
* Skipping DSV3 module patch unit tests

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* update tested

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

* Fixed failing unit test

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-04-09 11:57:05 +08:00
tburt-nv
3a8443f1e1
extend allowlist (#3379)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-09 11:10:42 +08:00
Iman Tabrizian
8401722245
test: Add single gpu disaggregated tests (#3295)
* test: Add single gpu disaggregated tests

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Add deepseek with overlap tests

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Use updated prompt

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

* Move test to disaggregated folder

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>

---------

Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
2025-04-09 09:34:45 +08:00
Tracin
2a2b7bfc66
Fix miss bias add for FP4Linear. (#3361)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-09 09:17:54 +08:00
Mike Iovine
5bdf997963
Add Llama 4 (#3302)
Signed-off-by: Mike Iovine <miovine@nvidia.com>
2025-04-09 03:35:21 +08:00
yuxianq
7225bd8b91
chore: Refine attention backend interface. (#3271)
Refine attention backend interface.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-09 02:34:53 +08:00
Zhanrui Sun
7199588796
infra: [TRTLLM-4450] Support more files for pytorch only mode (#3365)
* infra: [TRTLLM-4450] Support more files for pytorch only mode

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Test pytorch only mode

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Revert "Test pytorch only mode"

This reverts commit b32f54d7858bd2432251734bc7b31669147ed94b.

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Fix review

Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-09 01:39:04 +08:00
wili
54ad95eaa8
Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338)
* feat/Variable-Beam-Width-Search-Part3, v1.0

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat/Variable-Beam-Width-Search-Part3, v1.1

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat/Variable-Beam-Width-Search-Part3, v1.2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>
2025-04-08 23:51:27 +08:00
sugunav14
84fc07b011
feat: [TRTLLM-3510] DeepseekV3 support in AutoDeploy (#3281)
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
2025-04-08 21:47:57 +08:00
pcastonguay
02f446a9ff
chore: Adding DS V3-lite tests with overlap + cuda graph (#3342)
* chore: Adding DS V3-lite tests with overlap + cuda graph

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing pre-commit

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-08 09:36:09 -04:00
Zhanrui Sun
63b0194c50
chore: bump version to 0.19.0.dev2025041500 (#3360)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-08 20:45:27 +08:00
Void
316e5c3be3
feat: fix and improve allreduce and fusion kernels (#3064)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-04-08 19:33:52 +08:00
yuxianq
7b03350527
Add thread leak check and fix thread/memory leak issues. (#3270)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-08 19:03:18 +08:00
liji-nv
dca6397d1e
feat: Introduce UB allocator for pytorch flow (#3257)
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
  are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.

* In common usecase, the Userbuffers will be allocated correctly during
  warm up stage. There is no dynamic allocation during inference.

* UB fusion pattern is rewroten using the new UB Allocator. It contains
  following passes:

1. Fuse Quant with allreduce, replace with UB impl, and insert a
   copy_to_userbuffers. Currently the normal allreduce still does not
   support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
   writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
   allreduce.

Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-04-08 18:39:49 +08:00
Zhanrui Sun
c692474b59
infra: Fix bot help error when " in bot command (#3314)
* Fix bot help error when " in bot command

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Delete a.txt

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-04-08 18:16:05 +08:00
Chuang Zhu
cdb0906be4
disagg test single h100 (#3353) 2025-04-08 17:45:35 +08:00
amirkl94
e04f6a1b9b
fix: Fix p-tuning test bug (#3326)
* fix: Fix p-tuning test bug

* A change in the vocab_size calculation for T5Tokenizer,
introduced in transformers version 4.34, caused addition of incorrect vtokens for ptuning.
In general, instead of adding tokens which are outside the vocabulary, tokens inside the vocabulary were added.

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-04-08 17:14:00 +08:00
Yan Chunwei
deb876ecdb
clean up trtllm-llmapi-launch logs (#3358)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-08 16:00:59 +08:00
Enwei Zhu
8ee019f8c4
test: Accuracy test improvement (Part 3.4): Move LLaMA tests (#3350)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-08 15:07:57 +08:00
Pengyun Lin
60e02a3684
Use llm.tokenizer in OpenAIServer (#3199)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-04-08 14:55:02 +08:00
Yukun He
c678774c99
feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151)
* Several optimizations and fixings on the Autotuner.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on current linear for nvFP4 data type.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Fixing and revising according to reviewer's suggestions.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Replace runner with runner id to achieve a serializable cache.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Code clean up and minor fixings.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Move all tunable runners and custom ops into torch_custom_ops.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-08 14:28:36 +08:00
Gabriel Wu
f1655afb0d
feat: enable DeepGEMM by default (#3341)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
2025-04-08 13:58:57 +08:00
Fanrong Li
62e0876e39
Waive unittest/trt/model/test_mamba.py::TestMamba::test_loaders_mamba_130m_hf_from_checkpoint. Will fix it later. (#3356)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-07 22:36:35 -07:00
MinaHuai
31422e7e46
add tp=2 ci test for vision encoder (#3319)
Signed-off-by: mhuai <mhuai@nvidia.com>
2025-04-07 21:46:08 -07:00
Gabriel Wu
42c8574e93
fix: revert extra cmake var (#3351)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-08 11:57:16 +08:00
Chuang Zhu
1c88af1378
feat: use cudaMalloc to allocate kvCache (#3303) 2025-04-08 10:59:14 +08:00
Kaiyu Xie
0a4e1d5a55
breaking change: perf: Make ipc_periodically the default responses_handler (#3102)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-08 10:36:39 +08:00
pcastonguay
add5e5cd93
feat: Add option to run disaggregated serving without ctx servers,… (#3243)
* feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing comment in sanity check

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-07 21:56:03 -04:00
Void
efe2ecfb37
fix: runtime error in est_deepseek_allreduce.py (#3226)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-04-08 09:19:47 +08:00
Ivan Sorokin
d40fce474a
fix: redrafter sampling (#3278)
* Fix redrafter sampling

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

* Rename redrafter bream search var

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

* Remove _beam_search_candidates_v0

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

* Remove unused import

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>

---------

Signed-off-by: Ivan Sorokin <isorokin@nvidia.com>
2025-04-08 07:49:32 +08:00
Enwei Zhu
ba019a43d6
test: Accuracy test improvement (Part 3.3): Move DeepSeek tests (#3260)
add skip



fix



fix



update



update test list



fixqa list



move bf16 to postmerge

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-08 07:19:04 +08:00
Chuang Zhu
f3237e52ed
update readme for disaggregated (#3323)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-07 21:29:15 +08:00
Gabriel Wu
376731013d
feat: use NVRTC for DeepGEMM JIT compilation (#3239)
* feat: use NVRTC for DeepGEMM JIT compilation

Signed-off-by: Zihua Wu 

* fix: add license

Signed-off-by: Zihua Wu

* feat: store NVRTC JIT results in memory by default

Signed-off-by: Zihua Wu


* feat: refinement

Signed-off-by: Zihua Wu

* feat: refinement

Signed-off-by: Zihua Wu

* test: set timeout to 7200

Signed-off-by: Zihua Wu

---------

Signed-off-by: Zihua Wu
2025-04-07 20:29:23 +08:00
YueWeng
aab6214801
test: fix conflicting test names (#3316)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-04-07 20:10:01 +08:00