Commit Graph

80 Commits

Author SHA1 Message Date
Chuang Zhu
44c70c88f9
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service (#5234)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-07-17 17:42:07 +08:00
shaharmor98
e0836f9ca9
[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats (#5372)
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-17 00:50:30 +08:00
Enwei Zhu
055c4a9fe6
[NvBug 5370718, 5371538] fix: Fix incremental detokenization (#5825)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-07-10 16:30:00 +08:00
DylanChen-NV
74dca0aa7b
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support (#5029)
Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>
2025-07-09 23:16:42 +08:00
Yechan Kim
5bc3a15f10
feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL (#5522)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-07-07 18:03:12 -07:00
Shunkangz
32339d1b20
Raise shut down error for each request (#4936)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-07-04 18:58:24 +09:00
Kaiyu Xie
f9a455651b
perf: Use tokenizers API to optimize incremental detokenization perf (#5574)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-01 09:35:25 -04:00
Aurelien Chartier
833c0dea4a
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI (#5497)
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-06-27 17:03:05 +02:00
Yibin Li
0f3bd7800e
[TRTLLM-4971]: Use safe deserialization in ParallelConfig (#4630)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-27 09:58:41 +08:00
amitz-nv
e0bb123ae7
[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request (#5080)
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
2025-06-26 08:15:06 +03:00
QI JUN
d93a5e04b5
Chore: remove unused variables (#5314)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-24 22:27:32 +08:00
Yan Chunwei
3946e798db
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances (#4727)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-19 06:13:53 +08:00
QI JUN
f899c4d294
Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-17 21:28:09 +08:00
amirkl94
8451a87742
chore: Mass integration of release/0.20 (#5082)
Signed-off-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Co-authored-by: Stanley Sun <190317771+StanleySun639@users.noreply.github.com>
Co-authored-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
Co-authored-by: Frank <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Yechan Kim <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
2025-06-17 14:32:02 +03:00
Yilin Fan
dd29063538
[feat] Add llm args to tune python gc threshold (#5141)
Signed-off-by: Yilin Fan <206948969+nv-yilinf@users.noreply.github.com>
2025-06-16 17:45:22 +08:00
Kaiyu Xie
dce1dcc4f9
feat: Support post_proc for bench (#5122)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-15 13:02:38 +08:00
Yuan Tong
6bce7337a9
perf: avoid dynamic import overhead in is_llm_response with duck typing (#5110)
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
2025-06-15 07:45:02 +08:00
ixlmar
e055af1bc9
chore: improve disagg test failure detection (#4738)
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
2025-06-15 01:28:26 +08:00
Yibin Li
b79eb34bfe
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn (#5074)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-13 11:37:50 +08:00
Yuxian Qiu
08dc369a4d
fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. (#4890)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-06-10 18:40:29 +08:00
Chang Liu
f70815c945
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) (#4145)
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
2025-06-10 01:59:56 +08:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way (#4799)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
dongxuy04
1e369658f1
feat: large-scale EP(part 6: Online EP load balancer integration for GB200 nvfp4) (#4818)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
Signed-off-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: ShiXiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2025-06-08 10:25:18 +08:00
QI JUN
b8c5e3892b
Revert "fix: build_config in TorchLlmArgs and avoid invalid args" (#4949)
Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-06-05 17:43:30 +08:00
Lucas Liebenwein
743fb0a159
[AutoDeploy] _AutoDeployLlmArgs as primary config object (#4891)
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
2025-06-05 17:20:55 +08:00
Yan Chunwei
ac20159d32
fix: build_config in TorchLlmArgs and avoid invalid args (#4600)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-04 13:17:29 +08:00
Shi Xiaowei
b13f8c9cba
Fix: NVBug 5302895 (#4835)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-06-04 09:31:39 +08:00
Shunkangz
c835f06371
Refactor the first token response in PD (#4692)
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-04 09:11:23 +08:00
Yan Chunwei
80b4026775
chore: remove request_error ipc in LLM.submit (#4763)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 20:55:59 +08:00
Enwei Zhu
3fe4a1842a
fix: Register MoeLoadBalancerConfig to serialization.py (#4864)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-03 19:22:36 +08:00
Yan Chunwei
e013c8cbc2
fix [nvbug5256044]: bench hang due to llmapi ipc (#4798)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-03 10:10:53 +08:00
Enwei Zhu
5b4852b7b5
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) (#4695)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-06-02 01:25:02 +08:00
Yuxian Qiu
a02df6aa4b
fix: re-enable tp/pp for quickstart_advanced.py. (#4766)
Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
2025-05-31 19:13:46 +08:00
Yan Chunwei
93c0632ee4
opt: the perormance for dist-agg streaming generation (#4214)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-31 17:40:32 +08:00
Yan Chunwei
33a9ba55f5
fix: test trtllm-bench mgmn (#4613)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 14:43:47 +08:00
Yan Chunwei
ac17142495
chore: rename ExecutorBindingsWorker/Proxy (#4716)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-29 10:32:35 +08:00
Arthur Rasmusson
812b1abf86
feature: KV Cache GPUDirect Storage (#3209)
Signed-off-by: Arthur Rasmusson <47877520+arthurrasmusson@users.noreply.github.com.>
Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
Co-authored-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
2025-05-28 23:27:43 +00:00
Yan Chunwei
5506f60037
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs (#4603)
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-28 18:43:04 +08:00
Iman Tabrizian
c875184f78
Add missing serialization classes (#4642)
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-05-28 16:40:23 +08:00
Shunkangz
fd27f89df6
fix: Remove duplicate tokenization in generation server (#4492)
* Add nvtx

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Add draft change

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

* Refactor and add support of chat

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>

---------

Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-26 16:43:07 +08:00
Yibin Li
bb2f545729
fix pipeline tests due to rebase (#4640)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-26 08:38:08 +08:00
coldwaterq
1cf0e672e7
fix: [nvbugs/5066257] serialization improvments (#3869)
* added a restricted pcikler and depickler in a sepparate serialization function.

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>

* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>

* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* cleaned up a couple files to reduce conflicts with main.

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* fix unit tests

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* reorder BASE_ZMQ_CLASSES list alphabetically

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* revert changes to import log of tensorrt_llm._torch.models

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes

Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>

* fix tests and move LogitsProcessor registration to base class

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* additional comments for multiprocess approved list sync

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

* add dataclass from tests

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>

---------

Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-23 13:06:29 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243)
* feat: Enabling dis serving with TRT backend with Python runtime

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing formatting

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing disagg mtp test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Yan Chunwei
4798d088d9
chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs (#3823)
* partition LlmArgs

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* update backend

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-22 09:40:56 +08:00
Yan Chunwei
9199793848
fix: llmapi-launch add add trtllm-bench test with engine building (#4091)
* add trtllm-bench mgmn test

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-21 10:18:01 +08:00
Pengyun Lin
039f7e3118
[https://nvbugspro.nvidia.com/bug/5243740][fix] deduce default max_tokens for trtllm-serve (#4265)
* Deduce default max_tokens for trtllm-serve

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Improve executor_config.max_seq_len assignment in TRT workflow

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Enhance error message

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

* Add deduced max_tokens test

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>

---------

Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-19 00:34:40 +08:00
Kaiyu Xie
b4e5df0ee0
Breaking change: perf: Enable scheduling overlap by default (#4174)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-05-15 14:27:36 +08:00
pcastonguay
9643be5f20
[TRTLLM-5050][feat] Enable per-request stats with PyT backend (#4156)
* feat: Add per-request stats support with PyT backend

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing stats unit test

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing test with overlap

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-12 21:35:15 -04:00
Yan Chunwei
0c26059703
chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732)
* beam_width and max_new_token

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove beam_width

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove min_length

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* remove return_num_sequences

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-05-07 13:20:25 +08:00
Erin
cba1793cda
cleanup logprob params (#4039)
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-07 00:50:16 +08:00