Commit Graph

102 Commits

Author SHA1 Message Date
Yilin Fan
01423ac183
[None][feat] perf_metrics endpoint functionality improvement (#8005)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
Signed-off-by: nv-yilinf <206948969+nv-yilinf@users.noreply.github.com>
2025-10-02 17:43:25 -07:00
Patrice Castonguay
fefa7d8fa3
[None][feat] Support for cancelling requests with disaggregation (#8114)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-10-02 11:04:26 -07:00
Iman Tabrizian
33282351a2
[TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path (#6348)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-09-27 19:29:30 -04:00
Zheng Duan
24fc1f9acf
[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-09-15 07:26:01 -04:00
Chuang Zhu
77657a1c12
[TRTLLM-7361][feat] KV cache transfer for uneven pp (#7117)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-09-08 13:37:46 -04:00
Chang Liu
23500b55c3
[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse (#7106)
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
2025-09-06 17:58:32 -04:00
Enwei Zhu
5ff3a65b23
[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-09-03 15:16:11 -07:00
Tian Zheng
e257cb3533
[None][feat] Support NVFP4 KV Cache (#6244)
Signed-off-by: Tian Zheng <29906817+Tom-Zheng@users.noreply.github.com>
2025-09-01 09:24:52 +08:00
Richard Huo
ce580ce4f5
[None][feat] KV Cache Connector API (#7228)
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: richardhuo-nv <rihuo@nvidia.com>
Co-authored-by: jthomson04 <jwillthomson19@gmail.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-08-28 23:09:27 -04:00
qixiang-99
b165f8bc97
fix/improve kvcache allocation in PyTorch runtime (#5933)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-08-26 12:40:22 +08:00
Robin Kobus
45c7518032
[None][refactor] Simplify decoder state initialization (#6559)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-12 21:44:41 +02:00
Ziyi Xiong
de472828b9
[TRTLLM-6637][feat] Resolve KV cache divergence issue (#6628)
Signed-off-by: ziyixiong-nv <219238287+ziyixiong-nv@users.noreply.github.com>
2025-08-09 23:15:04 +08:00
pcastonguay
453a06e6ab
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events (#6563)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-08-07 14:17:07 +02:00
amitz-nv
85af62184b
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter (#6510)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-08-07 09:05:36 +03:00
ixlmar
1ebceb790d
[TRTLLM-5508][feat] check input tokens + improve error handling (#5170)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-08-05 18:27:43 +01:00
Olya Kozlova
13cc1c4878
[TRTLLM-5271][feat] best_of/n for pytorch workflow (#5997)
Signed-off-by: Olya Kozlova <okozlova@nvidia.com>
2025-08-04 14:08:06 +02:00
Robin Kobus
d3c14682f0
refactor: Remove unused buffers and bindings from sampler (#6484)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-08-01 00:43:03 -04:00
amitz-nv
98428f330e
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction (#5616)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-07-20 08:00:14 +03:00
Robin Kobus
ec2b953e7e
refactor: Enhanced handling of decoder requests and logits within the batch manager (#6055)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-18 12:12:08 +02:00
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
qixiang-99
e09e409dfb
Fix: Enhance ModelConfig for kv cache size calculations (#5868)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-16 14:41:31 -07:00
Robin Kobus
6d4b045d1f
refactor: Remove enforced sorted order of batch slots (#3502)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-14 17:23:02 +02:00
Daniel Cámpora
1260e2f33f
feat: Optimize TRTLLM Sampler perf single beam single step (#5550)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-07-07 15:44:47 +02:00
Stefan Niebler
d1112aac37
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow (#5333)
Signed-off-by: Stefan Niebler <82932102+stnie@users.noreply.github.com>
2025-07-05 01:35:13 +09:00
Robin Kobus
1a3bd140ed
chore: Remove unused isFullContextRequest method (#5666)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-07-03 15:08:09 +02:00
qixiang-99
ca7b6ec8d8
Feat/pytorch vswa kvcachemanager (#5151)
Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>
2025-07-02 15:58:00 +08:00
Robin Kobus
a8141a4513
refactor: Speculative decoding buffers part 2 (#5316)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-27 17:41:48 +02:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
Robin Kobus
e2a8cbc80b
refactor: manage cache indirection in decoder state (#5315)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-24 09:15:59 +02:00
Robin Kobus
627062c265
refactor: Update decoder buffer and logits management (#4450)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-18 08:10:32 +08:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
Robin Kobus
b6ca677741
refactor: remove decoder request from decoder interface (#5129)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-16 09:12:30 +02:00
liji-nv
10ab9791ec
[fix] Do not reuse dummy request KVCache (#4804)
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
2025-06-12 15:24:50 +08:00
Netanel Haber
e692779ead
Solve underallocation in VSWA+/VGQA (#4667)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2025-06-12 12:12:46 +08:00
Daniel Cámpora
d68b8180d3
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-10 07:28:34 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Robin Kobus
3de02582dd
refactor: Separate DecoderState from GptDecoderBatched (#4700)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-06-03 09:42:01 +02:00
Robin Kobus
b9263a8e10
fix: max_num_sequences calculation with overlap scheduling (#4532)
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-03 09:31:22 +02:00
Daniel Cámpora
69c7fe8905
[TRTLLM-4987][feat] Partial support of context logits in TRTLLMSampler (#4538)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-06-01 03:32:43 +08:00
Robin Kobus
7b2818a47b
refactor: CreateNewDecoderRequests (#4452)
* refactor: CreateNewDecoderRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Consolidate request generation in CreateNewDecoderRequests

- Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests.
- Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options.
- Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy.
- Cleaned up associated includes and references throughout the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Simplify request handling in CreateNewDecoderRequests

- Removed the generateRequestOptions method and integrated its logic directly into the operator() method.
- Updated the request generation process to improve clarity and reduce redundancy.
- Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests

- Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling.
- Removed redundant request generation logic from the operator() method, streamlining the process.
- Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests

- Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management.
- Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality.
- Cleaned up unnecessary includes and improved code organization for better maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters

- Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder.
- Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling.
- Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests

- Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency.
- Updated bindings and method signatures for decoder stream handling.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-23 22:54:37 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243)
* feat: Enabling dis serving with TRT backend with Python runtime

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing formatting

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg mtp test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Chuang Zhu
44cfd757b2
Agent interface impl for NIXL (#4125)
* agentConnection

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

recv

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

agentState

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

NIXL interfaces

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

update cmakelists

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

nixl improve

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

remove cppzmq

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

transferAgent remove register

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

work for cache Test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

reduce sleep time

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

intergarte

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

nixl env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix rebase error

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

cpp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

stash for send metaData

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

loadRemoteMD after fetchRemoteMD

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

workaround for mixed gen and context

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

test_env

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

avoid port conflict in test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use std::string

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* typo

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix transferAgentTest

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-05-22 09:09:41 +08:00
Robin Kobus
4e370a509a
refactor: Copy sequence lengths once in decoder setup (#4102)
* refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update DecoderInputBuffers to remove duplicated buffers

- Renamed and reorganized buffer variables in decoderBuffers.h and decoderBuffers.cpp for better readability.
- Adjusted references in generateRequestOptions.cpp to align with the new buffer structure.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Move getEmbeddingBias to anonymous namespace

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Filter context requests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: GenerateRequestOptions using more fine-grained functions

- Added a new method `createDecoderRequests` to encapsulate the logic for creating decoder requests from finished context requests.
- Updated the `operator()` method to utilize the new method, improving code clarity and maintainability.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update TRTLLMDecoder

- Updated the `generate_request_options` call.
- Updated the `make_decoding_batch_input_output` call.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Remove const where we modify input buffers

- Changed `DecoderInputBuffers` parameters from const references to non-const references in multiple functions to allow modifications.
- Updated related function calls to ensure compatibility with the new parameter types.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fixup! refactor: Copy sequence lengths once in decoder setup

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-16 22:03:55 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156)
* feat: Add per-request stats support with PyT backend

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing stats unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing test with overlap

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Robin Kobus
72057a0a64
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625)
* disable overlap in encoder

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: invokeGatherBatch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: overlap same batch

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: add enableTrtOverlap to ExecutorConfig

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* disable overlap for beam search and spec decode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* skip overlap tests with beam search or speculative decoding

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* enable overlap in GptChunkedLongContextTests

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Enable overlap in gptManagerBenchmark

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Improve early exit

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Use OptionalRef for newOutputTokens tensor

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* feat: Add overlap scheduling support to TRTLLMDecoder

- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* fix: allNewTokens in PP

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-05-06 15:06:46 +02:00
Kate Cheng
7dbe618683
feat: Add multimodal embedding field in LlmRequest (#3855)
* Add a new param to LlmRequest and Request to natively support mm

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* update comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Update tests to match the new LlmRequest constructor parameters

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Modify unitTest and modify mm_embeding's dict name in llama4

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix based on comments

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix comment

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Fix LlmRequest initialization in kvCacheManagerTest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up code for promt_tuning_config

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

* Clean up prompt_tuning_config in GenerationRequest

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>

---------

Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-05-01 12:23:30 +08:00
Chuang Zhu
e2318756ed
cacheTransceiver buffer manager (#3798)
* cacheTransceiver buffer manager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* fix args

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* cpp kvCacheManager

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* format

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-27 11:48:15 +08:00
Daniel Cámpora
1299f27c74
fix: Fix C++ decoder synchronization in PyTorch (#3106)
* Use updateDecoderBuffers in python decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix synchronize in trtllm decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Enable by default.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use guided_decoder to setup seqslots and free them.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use always decode_async and update_requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update decoder buffers.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix speculative decoding tests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Send new_tensors_host instead of assuming dict.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make default False in enable_trtllm_decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Partially fix mtp, partially fix py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update request states before sending disagg ctx cache.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix disagg test for torch decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add disagg serving case to guided decoder.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Get overlap scheduling to work.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update cutlass to main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update after rebasing.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update to use decode async and update requests.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Properly pass information to update_requests

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make disaggregated serving a step closer to working.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Fix rebase and format.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Copy new device tokens more pythonic.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Restore MTP add dummy reqs.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add ordereddict import to py_executor.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Added seq slot manager. Add test.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Use transmission for single tensor except when list of tensors is received.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add TRTLLMDecoder allocation to estimate max kv cache tokens.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add stream synchronization

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Format

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Add decoder creation to estimate max kv.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Formatting.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

* Update submodule UCXX inline with main.

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>

---------

Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-04-23 23:55:27 +08:00
Robin Kobus
8340657ae4
refactor: Introduce DecoderOutputBuffers per batch (#3506)
* refactor: Restructure DecoderBuffers and DecoderStepAsyncSend

- Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`.
- Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend

- Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`.
- Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity.
- Refactored related methods to align with the new class structure, improving maintainability and readability of the code.

These changes enhance the structure and readability of the batch manager's decoding process.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* chore: Log DecodingMode

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Introduce DecoderOutputBuffers and update related classes

- Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose.
- Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic.
- Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability.
- Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces.

These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Update MPI communicator handling

- Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity.
- Updated related methods and constructors to reflect the new parameter type, enhancing maintainability.
- Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`.

These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

* refactor: Replace shared_ptr with unique_ptr for buffer management

- Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`.
- This change enhances memory management and ownership semantics, reducing overhead and improving performance.

These modifications contribute to a cleaner and more efficient architecture in the batch manager.

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>

---------

Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
2025-04-22 12:25:53 +08:00
Zheng Duan
ae48abefc1
bind block key and hasher (#3712) 2025-04-21 18:50:57 +08:00