Zheng Duan
4d0a5ad384
chore: gracefully exit disagg process in tests; better startup and logging ( #5109 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-13 14:03:55 +08:00
Yibin Li
b79eb34bfe
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn ( #5074 )
...
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-13 11:37:50 +08:00
pcastonguay
3a04c9fa7b
chore: Include prompt_token_ids only for context-only disagg requests ( #5055 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-12 15:00:08 -04:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way ( #4799 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
Shunkangz
c835f06371
Refactor the first token response in PD ( #4692 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-04 09:11:23 +08:00
rakib-hasan
d0eb47d33a
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation ( #4506 )
...
* refactoring the multimodal input prep
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* adding out-of-tree override option
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* adding exceptional case for llava-next
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* fixing typo
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing review comments, adding placement option, handling tokenizer variations
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing pytest-asyncio behavior change
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
---------
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-06-03 12:02:07 -07:00
Shunkangz
ae9a6cf24f
feat: Add integration of etcd ( #3738 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Batsheva Black <bblack@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-06-03 20:01:44 +08:00
Pengyun Lin
bac22ff7b5
[feat] support sharegpt downloading in benchmark_serving ( #4578 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-30 17:27:53 +08:00
Pengyun Lin
971d16a2ee
[TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend ( #4623 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-28 11:36:44 +08:00
Shunkangz
fd27f89df6
fix: Remove duplicate tokenization in generation server ( #4492 )
...
* Add nvtx
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Add draft change
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Refactor and add support of chat
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
---------
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-26 16:43:07 +08:00
coldwaterq
1cf0e672e7
fix: [nvbugs/5066257] serialization improvments ( #3869 )
...
* added a restricted pcikler and depickler in a sepparate serialization function.
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* cleaned up a couple files to reduce conflicts with main.
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* fix unit tests
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* reorder BASE_ZMQ_CLASSES list alphabetically
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* fix tests and move LogitsProcessor registration to base class
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* revert changes to import log of tensorrt_llm._torch.models
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* fix tests and move LogitsProcessor registration to base class
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* additional comments for multiprocess approved list sync
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* add dataclass from tests
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
---------
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-23 13:06:29 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend ( #4243 )
...
* feat: Enabling dis serving with TRT backend with Python runtime
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing formatting
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing disagg mtp test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Kaiyu Xie
2898d268f9
feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856 ) ( #4349 )
...
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>
2025-05-22 11:46:06 +08:00
Zheng Duan
77a0189554
feat: conditional disaggregation in disagg server ( #3974 )
2025-05-21 09:57:46 +08:00
Pengyun Lin
039f7e3118
[ https://nvbugspro.nvidia.com/bug/5243740 ][fix] deduce default max_tokens for trtllm-serve ( #4265 )
...
* Deduce default max_tokens for trtllm-serve
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Improve executor_config.max_seq_len assignment in TRT workflow
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Enhance error message
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Add deduced max_tokens test
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
---------
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-19 00:34:40 +08:00
rakib-hasan
49f993d862
Removing the outdated argument ( #4408 )
...
removing the outdated argument
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-18 15:52:15 +08:00
Tracin
7b19acfab1
fix: Fix chat template kwargs bug. ( #4387 )
...
* Fix chat template kwargs bug.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
* Fix chat template kwargs bug.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
* Fix chat template kwargs bug.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
---------
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 23:07:46 +08:00
Yechan Kim
c6e2111f4e
feat: enhance trtllm serve multimodal ( #3757 )
...
* feat: enhance trtllm serve multimodal
1. made the load_image and load_video asynchronous
2. add image_encoded input support to be compatible with genai-perf
3. support text-only on multimodal mdoels(currently, Qwen2-VL & Qwen2.5-VL)
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* add test
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* fix bandit
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* trimming uils
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* trimming for test
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* genai perf command fix
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* command fix
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* refactor chat_utils
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* stress test genai-perf command
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
---------
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-05-15 16:16:31 -07:00
Zheng Duan
c9e2a963e0
feat: add kv cache aware router ( #3831 )
...
* kv cache aware router
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* add tests
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* router config
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* eviction test
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
add test
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* eviction detect in worker test
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* move worker tests to single gpu
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* reduce memory fraction
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
* fix partial block
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
---------
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-05-12 07:23:57 -04:00
Yixin Dong
c90ebadd84
feat: Support the Structural Tag in guided decoding ( #4066 )
...
* finish
Signed-off-by: Ubospica <ubospica@gmail.com>
* update
Signed-off-by: Ubospica <ubospica@gmail.com>
* update
Signed-off-by: Ubospica <ubospica@gmail.com>
* fix
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* exc overlap scheduler
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* add test
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
* fix api ref
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
---------
Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
Co-authored-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-05-12 17:24:50 +08:00
Yuan Tong
5b93273156
feat: adopt new logprob definition in PyTorch flow ( #4057 )
...
feat: align logprob definition of PyTorch flow
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>
Co-authored-by: Erin <14718778+hchings@users.noreply.github.com>
2025-05-08 20:16:40 +08:00
Pengyun Lin
721f84a0ac
fix: Align default setting & remove unnecessary check for chat and completion ( #3888 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-07 14:42:53 +08:00
Kaiyu Xie
52d4302dda
bench: TRTLLM-4936 Port benchmark_serving.py ( #4011 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-05-07 09:45:14 +08:00
Erin
cba1793cda
cleanup logprob params ( #4039 )
...
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
2025-05-07 00:50:16 +08:00
pansicheng
e84dc6b3c7
feat: add deepseek-r1 reasoning parser to trtllm-serve ( #3354 )
...
* add deepseek-r1 reasoning parser
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
* fix test
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
---------
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-06 08:13:04 +08:00
Erin
83f37614ef
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI ( #3388 )
...
* support return logprob in llmapi
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
update and add test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
stability test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* revert removal of old flag
Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
2025-05-01 12:47:14 -04:00
Yechan Kim
5460d18b10
feat: trtllm-serve multimodal support ( #3590 )
...
* feat: trtllm-serve multimodal support
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* remove disable argument
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* remove disable
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* add and separate tests and move the doc
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
* remove block_resue arg from serve.py
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
---------
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
2025-04-19 05:01:28 +08:00
pcastonguay
ae5671644a
feat: Disaggregated router class ( #3584 )
...
* Add draft scheduler class
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Refactor the design
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* feat: Introduce router class for disaggregated server
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Add unit tests for router class
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Adding tests for disagg_utils
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing missing import
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing disagg integration tests
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Addressing MR review comments
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-19 00:34:12 +08:00
Zheng Duan
bce7ea8c38
test: add kv cache event tests for disagg workers ( #3602 )
2025-04-18 18:30:19 +08:00
Kaiyu Xie
e037d3e99b
chore: Unify Python NVTX call ( #3450 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-15 23:25:36 +08:00
Shunkangz
ea050084ad
feat: Add support of chat completion in PD ( #2985 )
...
* Add support of chat completion in PD
Add support of include_usage in PD
Reformat
* Remove redundant code
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Refactor code
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Add chat completion test
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Refactor code
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
---------
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-11 17:53:28 +08:00
Pengyun Lin
60e02a3684
Use llm.tokenizer in OpenAIServer ( #3199 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
2025-04-08 14:55:02 +08:00
Kaiyu Xie
0a4e1d5a55
breaking change: perf: Make ipc_periodically the default responses_handler ( #3102 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-08 10:36:39 +08:00
pcastonguay
add5e5cd93
feat: Add option to run disaggregated serving without ctx servers,… ( #3243 )
...
* feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing comment in sanity check
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-07 21:56:03 -04:00
pansicheng
ef1ba468a1
feat: support abort disconnected requests ( #3214 )
...
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
2025-04-07 16:14:58 +08:00
Yan Chunwei
b21cfcfed1
chore: refactor the LlmArgs with Pydantic and migrate remaining pybinding configs to python ( #3025 )
...
* make LlmArgs Pydantic
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* amending doc
fix api_stability
fix tests
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* restore yaml groups
refine StackTrace
singleton
clean tests
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix trtllm-bench
fix pytorch
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix serve distagg
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
* fix
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
---------
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-05 13:31:48 +08:00
Zheng Duan
35b828ca2d
fix streaming in dist-serving ( #3087 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-04-02 10:08:07 +08:00
Shunkangz
dda7354d1a
Refactor return of first gen token in PD ( #2986 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-04-01 12:28:27 +08:00
Kaiyu Xie
2631f21089
Update ( #2978 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-03-23 16:39:35 +08:00
Kaiyu Xie
3aa6b11d13
Update TensorRT-LLM ( #2936 )
...
* Update TensorRT-LLM
---------
Co-authored-by: changcui <cuichang147@gmail.com>
2025-03-18 21:25:19 +08:00
Kaiyu Xie
9b931c0f63
Update TensorRT-LLM ( #2873 )
2025-03-11 21:13:42 +08:00
Kaiyu Xie
77d7fe1eb2
Update TensorRT-LLM ( #2849 )
...
* Update TensorRT-LLM
---------
Co-authored-by: aotman <chenhangatm@gmail.com>
2025-03-04 18:44:00 +08:00
Kaiyu Xie
ab5b19e027
Update TensorRT-LLM ( #2820 )
2025-02-25 21:21:49 +08:00
Kaiyu Xie
2ea17cdad2
Update TensorRT-LLM ( #2792 )
...
* Update TensorRT-LLM
---------
Co-authored-by: jlee <jungmoolee@clika.io>
2025-02-18 21:27:39 +08:00
Dan Blanaru
16d2467ea8
Update TensorRT-LLM ( #2755 )
...
* Update TensorRT-LLM
---------
Co-authored-by: Denis Kayshev <topenkoff@gmail.com>
Co-authored-by: akhoroshev <arthoroshev@gmail.com>
Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com>
Update
2025-02-11 03:01:00 +00:00
Kaiyu Xie
535c9cc673
Update TensorRT-LLM ( #2460 )
2024-11-19 18:30:34 +08:00
Kaiyu Xie
c629546ce4
Update TensorRT-LLM ( #2436 )
2024-11-12 15:27:49 +08:00