Commit Graph

125 Commits

Author SHA1 Message Date
Daniel Cámpora
cc3f8e6431
fix: Fix trtllm sampler beam width bug (#4507)
* Fix TRTLLMSampler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added type hint.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-21 14:21:39 +08:00
liji-nv
58e405624a
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 (#3952)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-05-19 22:12:25 +08:00
Yuxian Qiu
cf6cd940e5
feat: Add pp support for hybrid attn/mamba model (#4358)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-19 14:47:45 +08:00
shaharmor98
27afcb9928
add changes for fp8, nemotron-nas, API (#4180)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-18 23:27:25 +08:00
Netanel Haber
9cd8148f28
API Breaking Change + Readability: "decoder"->"sampler" (#4121)
* *decoder*->*sampler*; new_tensors_device: dict[str, torch.Tensor] -> device: SampleStateTensors

* **Breaking Change**, as it changes public interfaces, main changes:
* PyTorchConfig [consumed via LLM(pytorch_backend_config)]: Configuration parameters mixed_decoder and enable_trtllm_decoder -> sampler.
* Command-line argument --enable_trtllm_decoder becomes --enable_trtllm_sampler in examples/pytorch/quickstart_advanced.py.

---------

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-05-16 23:52:25 +08:00
ixlmar
13b61405e8
fix: improve PyExecutor resource allocations (#4299)
chore: restore symmetry of worker start/shutdown
chore: fix return type of cal_max_tokens
chore: type some more return values
fix: free resources before re-claiming

Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-16 16:28:10 +01:00
Robin Kobus
4e370a509a
refactor: Copy sequence lengths once in decoder setup (#4102)
* refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update DecoderInputBuffers to remove duplicated buffers

- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move getEmbeddingBias to anonymous namespace

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Filter context requests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: GenerateRequestOptions using more fine-grained functions

- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMDecoder

- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove const where we modify input buffers

- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-16 22:03:55 +08:00
Daniel Cámpora
df19430629
chore: Mass Integration 0.19 (#4255)
* fix: Fix/fused moe 0.19 (#3799)

* fix bug of stream init

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix: Add pre-download of checkpoint before benchmark. (#3772)

* Add pre-download of checkpoint before benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Add missing remote code flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move from_pretrained to throughput benchmark.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Move download and use snapshot_download.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Removed trusted flag.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* Fix benchmark command in iteration log test.

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

---------

Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5241495][fix] CUDA Graph padding with overlap scheduler (#3839)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fuse

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* TRTLLM-4875 feat: Add version switcher to doc (#3871)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* waive a test (#3897)

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* docs:fix https://nvbugs/5244616 by removing new invalid links. (#3939)

Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>

* fix: remote mpi session abort (#3884)

* fix remote mpi session

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* skip fp8 gemm for pre-hopper (#3931)

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

* [https://nvbugspro.nvidia.com/bug/5247148][fix] Attention DP with overlap scheduler (#3975)

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* update multigpu list

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix namings

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* Doc: Fix H200 DeepSeek R1 perf doc (#4006)

* fix doc

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* update perf number

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

* Fix the perf regression caused by insufficient cache warmup. (#4042)

Force tuning up to 8192 sequence length for NVFP4 linear op. Also, make this runtime-selectable with UB enabled.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* doc: Update 0.19.0 release notes (#3976)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* Optimize the AutoTuner cache access code to reduce host code overhead. (#4060)

The NVFP4 Linear op is very sensitive to the host overhead.
This PR introduces customizable `find_nearest_profile` and `get_cache_key_specifc`, which allow users to override the default method for generating the cache key.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Update switcher (#4098)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* doc: update release notes (#4108)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>

* docs:update 0.19 doc. (#4120)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* docs:add torch flow supported model list. (#4129)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>

* doc: Release V0.19 Perf Overview Update (#4166)

Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>

* Fix readme of autodeploy.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update tensorrt_llm/_torch/pyexecutor/llm_request.py

Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>

* Revert mgmn worker node.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Change to disable_overlap_scheduler.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: zpatel <22306219+zbpatel@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
Co-authored-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>
Co-authored-by: Zac Patel <22306219+zbpatel@users.noreply.github.com>
2025-05-16 10:53:25 +02:00
yuxianq
4f8afe4cc6
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism (#4034)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-16 04:16:53 +08:00
yuxianq
0e87fcc228
refactor: use x is None instead of x == None. (#4244)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-15 20:00:04 +08:00
ixlmar
4ee82fc0fd
chore: reduce code duplication (#4297)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-05-15 09:25:37 +01:00
zhhuang-nv
97bc680cd8
feat: support kv cache reuse for MLA (#3571)
* support kv cache reuse for MLA

load compressed_kv and k_pe and do up-projection
use 192/128 head size MLA context kernel
support Blackwell and Hopper now

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* add CI test

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: set k_pe head_num to 1 for kernel 2 and kernel 2V2

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* use GPTJ style RoPE for MLA

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix rebase error and some docs

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix kv_lens

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* tiny fix

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix: use normal device memory instead of pinned memory for unit test

Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

* fix L0 tests

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* fix torch compile after rebase

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

* resolve comments again

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>

---------

Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>
Signed-off-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
Signed-off-by: zhhuang-nv <145532724+zhhuang-nv@users.noreply.github.com>
Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>
2025-05-15 15:22:21 +08:00
Kaiyu Xie
b4e5df0ee0
Breaking change: perf: Enable scheduling overlap by default (#4174)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-15 14:27:36 +08:00
Mike Iovine
f9adac3dea
[feat] Enable chunked context for flashinfer (#4132)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-15 10:59:38 +08:00
HuiGao-NV
f4059c6e2e
Add test case for kv memory estimation (#4158)
* Add test case for kv memory estimation
* Dump running log into file and parse kv cache memory size from file
* Set bigger peak memory size for mixed percision case and test_ptp_quickstart_advanced_eagle3 case
* Revert change to usage of fraction
* use context manager to guard temp files

Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-14 18:39:25 +08:00
Anurag Mukkara
b0a03a289c
fix: Merge PP overlap and non-overlap executor loop (#3878)
Signed-off-by: Anurag Mukkara <134339030+amukkara@users.noreply.github.com>
2025-05-14 06:04:36 +08:00
Perkz Zheng
e8d7834c50
fix: [https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled (#4101)
Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-05-13 14:58:54 +08:00
nvpohanh
13c8e5a8a8
feat: Prefetch safetensors files before loading them (#4140)
Prefetching safetensors files so that they are stored in the system file
cache. This significantly speeds up the model weight loading for the
very first run after entering the docker container.

This is beneficial because model weight loading is done layer-by-layer,
which means reading from the safetensors chunk-by-chunk, and that cannot
utilize the internet bandwidth very well, assuming that these files are
stored in some network drives. Instead, loading the whole files in bulk
can achieve higher internet bandwidth utilization.

When running with world_size>1, all ranks collaboratedly prefetch these
files.

In theory, we should add heuristics to decide whether to prefetch the
files or not, but that is beyond the scope of this commit.

For example, when the CPU memory is small, doing prefetching may result
in file cache thrashing, resulting in slower weight loading time.

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-05-13 13:35:30 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156)
* feat: Add per-request stats support with PyT backend

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing stats unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing test with overlap

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Erin
4becf32360
fix: reshape token_ids for lp in torch backend (#4239)
reshape token_ids

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-13 08:43:47 +08:00
Yixin Dong
c90ebadd84
feat: Support the Structural Tag in guided decoding (#4066)
* finish

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* update

Signed-off-by: Ubospica <ubospica@gmail.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* exc overlap scheduler

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add test

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix api ref

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 17:24:50 +08:00
Chuang Zhu
1333f4f5d5
remove cache_transceiver_prealloc_size (#4153)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-12 11:53:53 +08:00
Fanrong Li
0cf0fce5d3
[fix] Fix add_dummy_requests for spec decoding cases (#4084)
* fix add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add max_seq_len to eagle3 test and fix add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix prompt_len in add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add prepare_resource condition in add_dummy_requests.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* add some description of token_nums to add_dummy_requests and fix token_nums in torch compile warmup.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

* fix available_tokens.

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>

---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-09 16:52:51 +08:00
Erin
cdf5ae1547
fix: change pp broadcast pattern for LPs (#4130)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-08 20:07:13 -07:00
Yi Zhang
91bf5e6a8e
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804)
Add Piecewise CUDA Graph Support

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-05-09 11:04:01 +08:00
Mike Iovine
9afe510367
[fix] Fix llama4 + eagle3 (#3998)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-05-08 19:20:27 -04:00
shaharmor98
7d94c9561f
feat: support multi lora adapters and TP (#3885)
* support multi lora, tp

Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-05-08 23:45:45 +08:00
Yuan Tong
5b93273156
feat: adopt new logprob definition in PyTorch flow (#4057)
feat: align logprob definition of PyTorch flow

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
2025-05-08 20:16:40 +08:00
zihaok
81cc60a0fd
[feat/] enable attention DP in Llama4 maverick model - part 1 (#4065)
* add feature

cosmetic changes

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

address precommit fix

cosmetic

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* add feature

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* fix bug

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* address comments

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* remove WAR

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* fix format precommit

Signed-off-by: Zihao Kong <zihaok@nvidia.com>

* Update tensorrt_llm/_torch/models/modeling_llama.py

Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>

---------

Signed-off-by: Zihao Kong <zihaok@nvidia.com>
Signed-off-by: zihaok <161090975+zihaok@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-05-08 05:06:40 +08:00
Daniel Cámpora
c56a2aca46
fix: Properly get decoding mode according to same logic as cpp. (#4026)
* Properly get decoding mode according to same logic as cpp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Cross reference getDecodingMode implementations in pytorch - cpp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Better bindings for DecodingMode.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Revert to version in main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Revert configuration.py.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-06 21:53:17 +08:00
Robin Kobus
72057a0a64
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625)
* disable overlap in encoder

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: invokeGatherBatch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: overlap same batch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: add enableTrtOverlap to ExecutorConfig

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* disable overlap for beam search and spec decode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* skip overlap tests with beam search or speculative decoding

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* enable overlap in GptChunkedLongContextTests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Enable overlap in gptManagerBenchmark

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Improve early exit

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use OptionalRef for newOutputTokens tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Add overlap scheduling support to TRTLLMDecoder

- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: allNewTokens in PP

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 15:06:46 +02:00
HuiGao-NV
5a4794b387
fix: skip add new slot if request has slot 0 (#3991)
Signed-off-by: Hui Gao <huig@nvidia.com>
2025-05-06 07:46:39 +02:00
Daniel Cámpora
aa980dc92f
fix: instantiate decoder early in pytorch (#4029)
* Instantiate decoder early to have better mem estimation.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Improve mem estimation by instantiating decoder earlier.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-05 10:31:53 +02:00
yuxianq
266fef88f2
feat: support to trace executor loop. (#3983)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-05 10:26:33 +08:00
Daniel Cámpora
cb2c1cc829
[https://nvbugs/5248923] fix: Correctly sizes seqslotmanager considering pp. (#3984)
* Correctly sizes seqslotmanager considering pp.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Adapt order.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-02 20:06:32 +02:00
Daniel Cámpora
c7cf032b89
fix: Move all casters to customCasters. (#3945)
* Move all casters to customCasters.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use customCasters in all bindings.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added customCasters to userbuffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-05-02 19:08:28 +08:00
Erin
8fe7bdeacf
feat: LogitsProcessor in PyTorch backend (#3145)
* support lp in pytorch backend

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

* fix tp

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

---------

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-01 14:15:30 -07:00
YueWeng
b1621e8d4e
feat: add relaxed acceptance for DS (#3865)
* add relaxed acceptance for DS R1

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* clean and update docs

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* Modified based on review

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

* fix mtp manager issue

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

---------

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-05-01 21:50:36 +08:00
Kate Cheng
7dbe618683
feat: Add multimodal embedding field in LlmRequest (#3855)
* Add a new param to LlmRequest and Request to natively support mm

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* update comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update tests to match the new LlmRequest constructor parameters

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify unitTest and modify mm_embeding's dict name in llama4

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix LlmRequest initialization in kvCacheManagerTest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up code for promt_tuning_config

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up prompt_tuning_config in GenerationRequest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:23:30 +08:00
Mike Iovine
8c2c969fcb
[fix] Pad requests to maximum draft length in spec decode (#3957)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-30 11:02:18 -04:00
Fanrong Li
e6b482ef47
fix: change the seq_lens sync copy to an async one (#3786)
---------

Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-29 23:56:49 +08:00
yuxianq
0f8ec693b2
fix: get head_dim from model’s config. (#3916)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 23:04:29 +08:00
Dom Brown
8709fe8b53
chore: bump version to 0.19.0 (#3598) (#3841)
test: add test cases for 0.19 release (#3608)

* fix test name



* add quickstart test for nemotron-ultra



* add rcca multi-node test case for deepseek-v3



* add rcca info



---------




squash (#3642)



fix: nvbugs/5187237: fix deterministic mode crash (#3448)

* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error

* remove waive


* Revert "remove waive"

This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.



* revert ar fusion



---------



update fp8 doc (#3647)




tests: change qa perf test to trtllm-bench (#3619)




 fix: FP8 quantized lm_head (NvBug 5214229) (#3567)



infra: Add PR approval protection for the release branch (#3634)



fix: nvbugs/5231298: pytorch allreduce issue (#3673)



Fix: nvbugs/5222698 variable not defined (#3630)

* Fix: nvbugs/5222698 variable not defined



* Tidy code



---------



test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)



test:restore fp8 kv cache testing for L0 (#3671)



doc: Update DeepSeek perf docs (#3693)

* Update DeepSeek perf docs



* update



* Apply suggestions from code review




---------




tests: waive test_llm_multi_node (#3664)



fix: update test_user_buffers_mm_add_prologue atol (#3711)



Fix: cherry-pick hmac encryption from main branch (#3635)

* security fix cherry-pick changes from main



* fix hmac in remote mpi session (#3649)



---------





Un-waive DS-V3-Lite tests. (#3621)



fix: FP8 kv accuracy (#3675)

* fix FP8 kv accuracy



* update doc



---------



Fix script options for engines. (#3622)



unwaive multi-node test (#3721)



chore : Split more tests out of gpt tests (#3524) (#3674)



doc:add torch examples link into torch backend documentation (#3749)




test: Get Eagle tests working (#3593) (#3722)




Waive L0 test (#3756)



waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)





Update ds v3 parameters in stress test. (#3676)

waive gemma on L20 (#3766)



https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)

Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.



fix: PP4 fixes and cleanup (#3688)




remove benchmark test list (#3643)



skip disagg deepseek test if sm!=90 (#3720)



test: skip failed cases on B200 (#3710)

* add skip condition to tests



* fix error



---------



test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)

* skip_pre_ada for fp8 cases



* update



* update after rebase



---------



add know issue to deepseek doc. (#3800)



Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)




Waive L0 tests (#3826)



fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)

* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.



* Fix fused_moe fallback issue. (#3652)

min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.



---------



[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)




Fix pre-commit



Fix again



Address some review comments for the MI

Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-29 16:57:22 +08:00
bhsueh_NV
2e230b73ec
change log level of some text from info to debug (#3930)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-29 13:38:34 +08:00
bhsueh_NV
0610d0ff84
add num_scheduled_requests into print_log (#3914)
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-29 11:22:22 +08:00
Perkz Zheng
35c5e4f1c5
feat: add CGA reduction fmha kernels on Blackwell. (#3763)
* update cubins

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* add trtllm-gen kernels for eagle3 and also kernels with cga-reduction

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

* address the comments

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

---------

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>
2025-04-29 10:43:54 +08:00
yuxianq
b91da764de
chore: remove DummyKvCacheManager. (#3896)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-04-29 09:59:37 +08:00
Mike Iovine
e534bf09cc
[fix] Fix flashinfer + speculation issues (#3686)
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
2025-04-28 14:34:22 -04:00
bhsueh_NV
76f2c631fb
fix: add warmup flag into py_executor to prevent enable profiler during wa… (#3852)
* add warmup flag into py_executor to prevent enable profiler during warmup

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug of pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* change setting warmup to all ranks

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-27 19:22:42 +08:00
Chuang Zhu
e2318756ed
cacheTransceiver buffer manager (#3798)
* cacheTransceiver buffer manager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix args

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* cpp kvCacheManager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-27 11:48:15 +08:00