Pengyun Lin
c1e7fb9042
[TRTLLM-7207][feat] Chat completions API for gpt-oss ( #7261 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-08-28 10:22:06 +08:00
Iman Tabrizian
bc84758626
[None][feat] Add logging for OAI disagg server ( #7232 )
2025-08-26 21:02:03 -07:00
Zheng Duan
cf50ba2980
[TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server ( #6985 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-08-26 15:34:44 +08:00
Yechan Kim
0893afae3d
[TRTLLM-6771][feat] Support MMMU for multimodal models ( #6828 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-21 08:54:12 +08:00
Chang Liu
ce53832610
[TRTLLM-7326][feat] Add standalone multimodal encoder ( #6743 )
...
Signed-off-by: Chang Liu <9713593+chang-l@users.noreply.github.com>
Signed-off-by: Chang Liu (Enterprise Products) <9713593+chang-l@users.noreply.github.com>
2025-08-19 21:42:50 -07:00
Zero Zeng
953f4fd69e
[None][fix] acceptance rate calculation fix in benchmark_serving ( #6746 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-19 17:29:36 +08:00
Bo Deng
d8acca495b
[TRTLLM-6675][infra] Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/6623 ( #6735 )
...
Signed-off-by: Bo Deng <deemod@nvidia.com>
2025-08-14 04:36:38 +00:00
Yechan Kim
12102e2d48
[TRTLLM-6772][feat] Multimodal benchmark_serving support ( #6622 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-08-12 19:34:02 -07:00
Zero Zeng
4b4b91ab51
[None][feat] improve dataloading for benchmark_dataset by using batch… ( #6548 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-11 09:50:41 +08:00
Ye Zhang
bcf5ec0c9a
[None][feat] Core Metrics Implementation ( #5785 )
...
Signed-off-by: Ye Zhang <zhysishu@gmail.com>
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-08-09 02:48:53 -04:00
Venky
61da2daeb4
[TRTLLM-6761][refactor] Replace LogitBiasLogitsProcessor with embedding bias tensor system ( #6464 )
...
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
2025-08-05 07:14:24 -07:00
Zero Zeng
48768fd720
fix: Fix missing key ( #6471 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-08-01 14:25:58 +08:00
shaharmor98
f9cf683e39
add propagation of trust_remote_code to OpenAIServer ( #6446 )
...
Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>
2025-07-30 15:25:41 -04:00
Zheng Duan
c9ed1ab436
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure ( #6135 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-30 10:39:40 +08:00
Zero Zeng
c9b8b6180f
Add Acceptance Rate calculation to benchmark_serving ( #6240 )
...
Signed-off-by: Zero Zeng <38289304+zerollzeng@users.noreply.github.com>
2025-07-28 14:00:58 +08:00
nv-guomingz
b8d4cb8beb
feat: Support JSON Schema in OpenAI-Compatible API ( #6321 )
...
Signed-off-by: noiji <52301388+noiji@users.noreply.github.com>
2025-07-25 12:55:56 -04:00
xiaoqi
a0aecf0476
[feat]: support logit_bias ( #5354 )
...
Signed-off-by: xq25478 <xq25478@qq.com>
Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Signed-off-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>
Co-authored-by: hexiao.xq <hexiao.xq@antgroup.com>
Co-authored-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-25 09:37:41 +00:00
Linda
9a99e6d6d7
fix: integration tests with nanobind ( #6326 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com>
2025-07-25 09:23:20 +08:00
Rashid Kaleem
152e2df43b
[Disaggregated] Add retry knobs and handling ( #5808 )
...
Signed-off-by: Rashid Kaleem <4079439+arekay@users.noreply.github.com>
Signed-off-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
Co-authored-by: Shi Xiaowei <39303645+Shixiaowei02@users.noreply.github.com>
2025-07-19 07:27:59 +08:00
Wanli Jiang
2d2b8bae32
feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support ( #5644 )
...
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
2025-07-17 06:30:58 +08:00
Pengyun Lin
6992616c1f
[nvbug 5004744][fix] rewrite completion API to avoid repetitive tokens ( #5201 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-07-14 17:17:30 +08:00
Kaiyu Xie
7b09a415c1
fix: Make the bench serving script compatible with different usages ( #5905 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-07-10 19:36:26 +08:00
Xianjie Qiao
5ab1cf5ae6
Remove unnecessary benchmarking results ( #5852 )
...
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
2025-07-09 11:19:06 +08:00
Iman Tabrizian
c508b994b6
Fix lost requests for disaggregated serving ( #5815 )
...
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
2025-07-09 08:42:45 +09:00
Yegor
b01d1c28f7
[feat] Detokenize option in /v1/completions request ( #5382 )
...
Signed-off-by: Yegor <75512761+Wokzy@users.noreply.github.com>
Signed-off-by: Yegor Yershov <yegor6741@gmail.com>
2025-07-08 19:36:04 +08:00
Zheng Duan
de10774c2e
chore: log stack trace on error in openai server ( #5749 )
...
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
2025-07-07 14:54:36 +08:00
Xianjie Qiao
b1976c2add
Add wide-ep benchmarking scripts ( #5760 )
...
Signed-off-by: Xianjie <5410381+qiaoxj07@users.noreply.github.com>
Signed-off-by: Xianjie Qiao <5410381+qiaoxj07@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-07-05 19:29:39 +08:00
Talor Abramovich
70e34a3291
[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve ( #5376 )
...
Signed-off-by: Talor Abramovich <talora@nvidia.com>
2025-06-29 12:46:30 +00:00
Kaiyu Xie
2eb6502b1d
feat: Add support for TRTLLM CustomDataset ( #5511 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-26 18:27:37 +08:00
Kaiyu Xie
c5ae3272b9
feat: Make benchmark_serving part of the library ( #5428 )
...
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-06-25 23:13:56 +08:00
Shunkangz
d5354897c0
feat: Dynamically remove servers in PD ( #5270 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-25 09:50:04 +08:00
Yan Chunwei
9bd42ecf9b
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default ( #5312 )
...
Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-06-20 03:01:10 +08:00
Yi Zhang
e44f7687af
feat: Add no_kv_cache_reuse option and streaming support for trtllm serve bench ( #4971 )
...
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
2025-06-18 13:37:31 +08:00
Zheng Duan
4d0a5ad384
chore: gracefully exit disagg process in tests; better startup and logging ( #5109 )
...
Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>
2025-06-13 14:03:55 +08:00
Yibin Li
b79eb34bfe
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn ( #5074 )
...
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-06-13 11:37:50 +08:00
pcastonguay
3a04c9fa7b
chore: Include prompt_token_ids only for context-only disagg requests ( #5055 )
...
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-06-12 15:00:08 -04:00
Yechan Kim
8b4104d34a
feat: add HyperCLOVAX-SEED-Vision support in refactored way ( #4799 )
...
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
2025-06-09 11:04:04 +08:00
Shunkangz
c835f06371
Refactor the first token response in PD ( #4692 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-06-04 09:11:23 +08:00
rakib-hasan
d0eb47d33a
[TRTLLM-5053] Refactoring and Unifying the Multimodal input preparation ( #4506 )
...
* refactoring the multimodal input prep
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* adding out-of-tree override option
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* adding exceptional case for llava-next
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* fixing typo
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing review comments, adding placement option, handling tokenizer variations
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing pytest-asyncio behavior change
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
---------
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-06-03 12:02:07 -07:00
Shunkangz
ae9a6cf24f
feat: Add integration of etcd ( #3738 )
...
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Signed-off-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Batsheva Black <bblack@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: BatshevaBlack <132911331+BatshevaBlack@users.noreply.github.com>
2025-06-03 20:01:44 +08:00
Pengyun Lin
bac22ff7b5
[feat] support sharegpt downloading in benchmark_serving ( #4578 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-30 17:27:53 +08:00
Pengyun Lin
971d16a2ee
[TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend ( #4623 )
...
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-28 11:36:44 +08:00
Shunkangz
fd27f89df6
fix: Remove duplicate tokenization in generation server ( #4492 )
...
* Add nvtx
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Add draft change
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
* Refactor and add support of chat
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
---------
Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
2025-05-26 16:43:07 +08:00
coldwaterq
1cf0e672e7
fix: [nvbugs/5066257] serialization improvments ( #3869 )
...
* added a restricted pcikler and depickler in a sepparate serialization function.
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
* updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests.
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
* removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list.
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* cleaned up a couple files to reduce conflicts with main.
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* fix unit tests
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* reorder BASE_ZMQ_CLASSES list alphabetically
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* fix tests and move LogitsProcessor registration to base class
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* revert changes to import log of tensorrt_llm._torch.models
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
* fix tests and move LogitsProcessor registration to base class
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* additional comments for multiprocess approved list sync
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
* add dataclass from tests
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
---------
Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com>
Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-05-23 13:06:29 +08:00
pcastonguay
d7d455e7ea
[feat][TRTLLM-5018] Dis serving python runtime trt backend ( #4243 )
...
* feat: Enabling dis serving with TRT backend with Python runtime
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing formatting
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing disagg mtp test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-05-22 22:01:06 -04:00
Kaiyu Xie
2898d268f9
feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856 ) ( #4349 )
...
Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>
2025-05-22 11:46:06 +08:00
Zheng Duan
77a0189554
feat: conditional disaggregation in disagg server ( #3974 )
2025-05-21 09:57:46 +08:00
Pengyun Lin
039f7e3118
[ https://nvbugspro.nvidia.com/bug/5243740 ][fix] deduce default max_tokens for trtllm-serve ( #4265 )
...
* Deduce default max_tokens for trtllm-serve
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Improve executor_config.max_seq_len assignment in TRT workflow
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Enhance error message
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
* Add deduced max_tokens test
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
---------
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
2025-05-19 00:34:40 +08:00
rakib-hasan
49f993d862
Removing the outdated argument ( #4408 )
...
removing the outdated argument
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
2025-05-18 15:52:15 +08:00
Tracin
7b19acfab1
fix: Fix chat template kwargs bug. ( #4387 )
...
* Fix chat template kwargs bug.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
* Fix chat template kwargs bug.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
* Fix chat template kwargs bug.
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
---------
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
2025-05-16 23:07:46 +08:00