Commit Graph

926 Commits

Author SHA1 Message Date
Zhanrui Sun
17d48e0009
infra: [TRTLLM-5072] Add SBSA release images (#4231)
* infra: [TRTLLM-5072] Add SBSA release images and move SBSA to blossom

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Fix review

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Easy to review

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Fix BUILD_JOBS

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Use gitlab mirror for nixl and ucx

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Update BuildDockerImage.groovy

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-18 00:00:06 +08:00
Venky
fb663b637a
Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) (#4195)
* add ll-nm-nano tests that map to nim requirements

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

* prune some pytorch cases (fp8)

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

* removing pyt backend test changes

- When validating the pytorch tests with the isl/osl/conc/quant settings (that is done for cpp backend too), seeing hangs that need further debugging.
- Therefore don't want to block this PR, hence removing them.
- Seeing

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
2025-05-17 22:46:21 +08:00
Yuxian Qiu
cc1bba1686
test: Waive tests for nvbugs/5286795. (#4409)
* Waive tests for nvbugs/5286795.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

* Apply suggestions from code review

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

---------

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
2025-05-17 19:41:05 +08:00
Jinyang Yuan
b618e1f55b
perf: Eliminate the need for attention DP padding when possible (#3439)
Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>
Co-authored-by: raccoonliukai <raccoonliu@tencent.com>
2025-05-17 13:30:55 +08:00
hlu1
befb93cbff
[Deepseek] Add accuracy test references for fp8 kvcache (#4374)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>
2025-05-17 11:23:00 +08:00
Lucas Liebenwein
7c85890ec7
[AutoDeploy] eager pattern matcher new pattern (#4370)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:44 -04:00
Lucas Liebenwein
0e872ef0b0
[AutoDeploy] fix: proper process group clean up (#4373)
[AutoDeploy] proper process group clean up

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 12:35:25 -04:00
Netanel Haber
9cd8148f28
API Breaking Change + Readability: "decoder"->"sampler" (#4121)
* *decoder*->*sampler*; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors

* **Breaking Change**, as it changes public interfaces, main changes:
* PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler.
* Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py.

---------

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-05-16 23:52:25 +08:00
ixlmar
13b61405e8
fix: improve PyExecutor resource allocations (#4299)
chore: restore symmetry of worker start/shutdown
chore: fix return type of cal_max_tokens
chore: type some more return values
fix: free resources before re-claiming

Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-16 16:28:10 +01:00
Tracin
7b19acfab1
fix: Fix chat template kwargs bug. (#4387)
* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Fix chat template kwargs bug.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 23:07:46 +08:00
Lucas Liebenwein
8e4320ede5
[AutoDeploy] configurable cache resize (#4372)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 10:07:09 -04:00
Robin Kobus
4e370a509a
refactor: Copy sequence lengths once in decoder setup (#4102)
* refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update DecoderInputBuffers to remove duplicated buffers

- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move getEmbeddingBias to anonymous namespace

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Filter context requests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: GenerateRequestOptions using more fine-grained functions

- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMDecoder

- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove const where we modify input buffers

- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-16 22:03:55 +08:00
Fridah-nv
bce281d592
feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) (#3638)
* add docstring to summarize current rope support

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor: replace call_method, adjust inserting order of cos_sin_cache calculation node

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* add unit test for triton rope and ds rope

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update rope matcher to match DS RoPE, add custom op for reference, add unit test case

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* cache cos[pos_idx].unsqueeze and sin[pos_idxs].unsqueeze

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor doc update

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate pattern matching and optimization for explicit and complex rope + minor updates

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* clean rope impl in repo

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* replace fused_flattened_mla_with_cache's rope impl with torch_apply_rope_with_qk_interleaving, update unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* separate layout infer and transpose to a new transformation

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* update rope_with_explicit_freqs and rope_with_input_interleaved to expose unsqueeze_dim and support match_rope_layout, add unit tests

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* solve merge conflict in transform.py, need to fix optimize_rope with cuda graph capture

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor clean up after rebase

Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>

* fix pre-commit

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* support map to bnsd layout and infer unsqueeze_dim from op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix cos/sin not the same across prompts in the same batch issue when mapping to flashinfer op

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix for unit test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* fix custom op input/output node ordering issue for DeepSeek V3 rope

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* clean code

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* move flattening of cos_sin_cache to the graph, update flashinfer op docstring and test

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

* debug transform unit test failure

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

---------

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Ubuntu <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
2025-05-16 09:55:32 -04:00
Kefeng-Duan
f5b6d453aa
doc: DS r1 min latency blog (#4386)
* add best perf practice on DSR1

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* add ds-r1 min latency tech blog

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* rm redundant doc

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* refine table content

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* refine table content

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* relative path for images

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* refine precommit

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

* pr4280 is merged

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>

---------

Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>
2025-05-16 20:20:28 +08:00
liji-nv
fb437ed709
[CI] waive accuracy/test_cli_flow.py::TestTinyLlama1_1BChat::test_pp4 (#4397)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-16 20:18:07 +08:00
Nikita Korobov
fa3879629e
feat: TRT-LLM Gen integration for BMM and MoE refactoring (#4280)
- Adds BatchedGemm cubins and the respective call interface from TensorRT-LLM Generator. 
- Refactors TRT-LLM Gen MoE runner to call to BMM interface
- The accuracy is verified for DeepSeek R1 FP4 

Signed-off-by: Nikita Korobov <nkorobov@nvidia.com>
2025-05-16 13:31:53 +02:00
Emma Qiao
27bdd0c82d
[TRTLLM-4886][infra]Try another timeout opt to exit test thread directly instead of gracefully (#4341)
* Try another timeout opt to kill test thread

Signed-off-by: qqiao <qqiao@nvidia.com>

* Return true when try to delete non-existing result file

Signed-off-by: qqiao <qqiao@nvidia.com>

* quick test for the result file

Signed-off-by: qqiao <qqiao@nvidia.com>

* Change back the global timeout setting

Signed-off-by: qqiao <qqiao@nvidia.com>

* Try to kill test in internal pytest

Signed-off-by: qqiao <qqiao@nvidia.com>

---------

Signed-off-by: qqiao <qqiao@nvidia.com>
2025-05-16 17:56:40 +08:00
NVJiangShao
a6f2a1e918
Fix test_fused_moe_w4afp8 (#4393)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
2025-05-16 17:21:33 +08:00
Daniel Cámpora
df19430629
chore: Mass Integration 0.19 (#4255)
* fix: Fix/fused moe 0.19 (#3799)

* fix bug of stream init

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix: Add pre-download of checkpoint before benchmark. (#3772)

* Add pre-download of checkpoint before benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add missing remote code flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move from_pretrained to throughput benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move download and use snapshot_download.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Removed trusted flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix benchmark command in iteration log test.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fuse

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* TRTLLM-4875 feat: Add version switcher to doc (#3871)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* waive a test (#3897)

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939)

Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>

* fix: remote mpi session abort (#3884)

* fix remote mpi session

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* skip fp8 gemm for pre-hopper (#3931)

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update multigpu list

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix namings

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* Doc: Fix H200 DeepSeek R1 perf doc (#4006)

* fix doc

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* update perf number

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* Fix the perf regression caused by insufficient cache warmup. (#4042)

Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* doc: Update 0.19.0 release notes (#3976)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* Optimize the AutoTuner cache access code to reduce host code overhead. (#4060)

The NVFP4 Linear op is very sensitive to the host overhead.
This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Update switcher (#4098)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* doc: update release notes (#4108)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* docs:update 0.19 doc. (#4120)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* docs:add torch flow supported model list. (#4129)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* doc: Release V0.19 Perf Overview Update (#4166)

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>

* Fix readme of autodeploy.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update tensorrt_llm/_torch/pyexecutor/llm_request.py

Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Revert mgmn worker node.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Change to disable_overlap_scheduler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>
2025-05-16 10:53:25 +02:00
ixlmar
f7ad49bb9b
chore: improve log-level setting UX (#4352)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-16 09:47:44 +01:00
HuiGao-NV
d5578b37fc
Change the method to calculate kv memory size in tests (#4332)
* Change the method to calculate kv memory size in tests
* Set larger peak memory size to llama case

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-16 15:35:40 +08:00
Yuan Tong
f5ddb7ab4a
fix: support TensorRT 10.11+ in FindTensorRT.cmake (#4353)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-05-16 14:04:56 +08:00
xinhe-nv
500b43e90c
test: [CI] remove closed bugs (#4345)
update waive list

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>
2025-05-16 13:47:42 +08:00
Barry Kang
0e14941b7f
[fix] Fixed incorrect mixed precision MoE conversion (#4351)
Fix for mixed precision MoE conversion

Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-05-16 13:43:41 +08:00
Tracin
46c5a56444
Support dynamic per-tensor FP8 (#4250)
* Support dynamic per-tensor FP8

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

* Update test cases.

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>

---------

Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 13:33:58 +08:00
Stanley Sun
11aa50d1ea
test: add kv cache aware test cases to qa test list (#4257)
add kv cache_aware test cases

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-05-16 12:47:01 +08:00
WeiHaocheng
54d28718c7
feat: support benchmark on scaffolding (#3328) (#4286)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
2025-05-16 12:28:49 +08:00
Zhanrui Sun
23a63ef9c1
update README version (#4381)
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-05-16 10:36:39 +08:00
QI JUN
c4cd403af9
[CI] waive test_chunked_prefill test cases (#4380)
waive test_chunked_prefill

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-16 10:27:20 +08:00
NVJiangShao
6cc3f2093a
Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow (#4348)
Signed-off-by: Jiang Shao <91270701+StudyingShao@users.noreply.github.com>
Co-authored-by: AIDC-AI <AIDC-AIB@365fanyi.com>
2025-05-16 10:02:30 +08:00
yuxianq
a1daa22970
doc: Add docstring for Attention and MLA module. (#4354)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-16 09:37:04 +08:00
QI JUN
13cdf98278
[CI] update multi-gpu test triggering file list (#4378)
update multi-gpu test triggering file list

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-05-16 09:05:44 +08:00
Suyog Gupta
b0f7522c82
[AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile (#4240)
* add a torch-compile backend

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* readme changes

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* plumb torch-compile through build_and_run_ad.py

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* add torch-cudagraph backend

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* update readme

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* update readme

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>

* further enhanced compiler backends

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* further enhance readme

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* better specified defaults in simple_config.py

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* fix typo in simple_config.py

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* updated deepseek-v3 support

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

* revert accidental deletion in AD Readme

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

---------

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Co-authored-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-16 08:38:15 +08:00
rakib-hasan
25407249a5
[TRTLLM-5054][fix] Removing repeated loading of input processor (#4161)
removing repeated loading of input processor

Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-16 08:04:58 +08:00
Lucas Liebenwein
4883121477
[AutoDeploy] fix: disable overlap scheduler until supported (#4365)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-05-15 16:19:30 -07:00
Yechan Kim
c6e2111f4e
feat: enhance trtllm serve multimodal (#3757)
* feat: enhance trtllm serve multimodal

1. made the load_image and load_video asynchronous
2. add image_encoded input support to be compatible with genai-perf
3. support text-only on multimodal mdoels(currently, Qwen2-VL & Qwen2.5-VL)

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* add test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* fix bandit

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* trimming uils

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* trimming for test

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* genai perf command fix

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* command fix

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* refactor chat_utils

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

* stress test genai-perf command

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

---------

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-05-15 16:16:31 -07:00
Iman Tabrizian
4c7191af67
Move Triton backend to TRT-LLM main (#3549)
* Move TRT-LLM backend repo to TRT-LLM repo

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Address review comments

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* debug ci

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Update triton backend

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

* Fixes after update

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

---------

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-16 07:15:23 +08:00
Erin
c44cf34373
fix: update checks that broke medusa tests when use_py_session=True (#4339)
fix check

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-15 15:47:28 -07:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
Venky
5ebe32f06f
enh: Enable option in trtllm-bench build subcommand to avoid loading weights (#4142)
* expose load_format 

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

* yapf

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
2025-05-16 03:50:53 +08:00
Venky
adb0839a33
test(perf): Add Phi-4-mini-instruct to perf tests (#4267)
* add phi-4-mini-instruct

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

* trim tests

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>

---------

Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-05-15 21:27:03 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. (#4244)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
Yanchao Lu
5ce1102a02
Revert "[test] add qa test mentioned in docs" (#4355)
Revert "[test] add qa test mentioned in docs (#4248)"

This reverts commit b0ce1371ee.
2025-05-15 18:47:30 +08:00
Stanley Sun
9d3e05486b
test: add qa test list for rtx5090 and rtx_pro_6000 (#4254)
* add test list for rtx5090 and rtx_pro_6000

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

* add 2gpu llama70b test cases

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

* remove duplicate and invalid test cases

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

* add 2gpus test cases

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>

---------

Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
2025-05-15 17:57:31 +08:00
zhhuang-nv
d6b741ddfe
[fix] test_no_kv_cache_reuse for overlap_scheduler (#4350)
fix test_no_kv_cache_reuse for overlap_scheduler

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
2025-05-15 16:43:53 +08:00
Yuan Tong
593f65ff6a
fix: better method to help torch find nvtx3 (#4110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-05-15 16:42:30 +08:00
ixlmar
4ee82fc0fd
chore: reduce code duplication (#4297)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-15 09:25:37 +01:00
Zongfei Jing
f0ca60a95d
Add allreduce and rmsnorm fusion for qwen3 (#4304)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-05-15 16:22:11 +08:00
xinhe-nv
14bfb5e0d6
test: FIX test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus (#4283)
* update test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

* skip llava-v1.6-mistral-7b-hf-vision-trtllm on L40S

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>

---------

Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-05-15 15:57:44 +08:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA

load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* add CI test

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* use GPTJ style RoPE for MLA

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix rebase error and some docs

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix kv_lens

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* tiny fix

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: use normal device memory instead of pinned memory for unit test

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* fix L0 tests

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile after rebase

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments again

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00