TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Zheng Duan	ce7f5fae5a	sort llm request state (#4607 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-05-26 13:47:01 +08:00
QI JUN	4a81991b65	Chore: refine shutdown signal of PyExecutor (#4614 ) * refine shutdown signal of PyExecutor Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-05-26 11:14:54 +08:00
Yiqing Yan	2fee408536	Waive L0 tests (#4645 ) * Waive L0 tests Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> * Apply suggestions from code review Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> --------- Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>	2025-05-26 11:05:01 +08:00
Yuxian Qiu	8f055f5d14	feat: Skip sampler for intermediate pp stages. (#4514 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-26 10:08:51 +08:00
Perkz Zheng	4d711be8f4	Feat: add sliding-window-attention generation-phase kernels on Blackwell (#4564 ) * move cubins to LFS Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add sliding-window-attention generation-phase kernels on Blackwell Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-05-26 09:06:33 +08:00
Yibin Li	bb2f545729	fix pipeline tests due to rebase (#4640 ) Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-05-26 08:38:08 +08:00
shaharmor98	2b8f6d2871	Fix snake case format (#4559 ) fix snake case format Signed-off-by: Shahar Mor <17088876+shaharmor98@users.noreply.github.com>	2025-05-25 17:57:17 +08:00
juney-nvidia	9472c86661	Update main README.md with the LLaMA4 perf news (#4636 ) Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-05-25 16:57:48 +08:00
Anton	5dff0bff8f	[#4633 ][doc] Fixed typo in scaffolding README.md (#4634 ) * Fixed typos in the scaffolding README.MD Signed-off-by: Anton <44649959+amemov@users.noreply.github.com> * Fixed links for 'More examples' and 'Contribute Guide' Signed-off-by: Anton <44649959+amemov@users.noreply.github.com> --------- Signed-off-by: Anton <44649959+amemov@users.noreply.github.com>	2025-05-25 09:04:12 +08:00
Yiqing Yan	7a067a8edf	[TRTLLM-5327] - Add scan stage (#4602 ) * [TRTLLM-5327] - Add scan stage Signed-off-by: Yiqing Yan * Add post-merge condition Signed-off-by: Yiqing Yan * fix Signed-off-by: Yiqing Yan --------- Signed-off-by: Yiqing Yan	2025-05-25 08:55:08 +08:00
hlu1	4a236d107d	[Fix][Deepseek] Fix bugs in TestDeepSeekR1 (#4413 ) [Deepseek] Fix bugs in TestDeepSeekR1 Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-24 09:52:57 +08:00
Chuang Zhu	b60846b47d	fix datatype check (#4606 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-24 08:36:17 +08:00
Yanchao Lu	20c15fc04f	Fix invalid testcase name (#4626 ) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>	2025-05-24 00:40:00 +08:00
Yao Yao	ef763b0ddc	fix: rename some terms (#4534 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-05-23 23:23:49 +08:00
Robin Kobus	7b2818a47b	refactor: CreateNewDecoderRequests (#4452 ) * refactor: CreateNewDecoderRequests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Consolidate request generation in CreateNewDecoderRequests - Removed the GenerateRequestOptions class and integrated its functionality into CreateNewDecoderRequests. - Updated the constructor of CreateNewDecoderRequests to accept parameters for speculative decoding and normalization options. - Modified the operator() method to handle request generation directly, improving code organization and reducing redundancy. - Cleaned up associated includes and references throughout the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Simplify request handling in CreateNewDecoderRequests - Removed the generateRequestOptions method and integrated its logic directly into the operator() method. - Updated the request generation process to improve clarity and reduce redundancy. - Adjusted the return type to streamline the handling of batch slots, decoder requests, and sampling configurations. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance createDecoderRequests method in CreateNewDecoderRequests - Updated the createDecoderRequests method to include additional parameters for decoder state and CUDA streams, improving flexibility in request handling. - Removed redundant request generation logic from the operator() method, streamlining the process. - Adjusted the newRequest method to utilize the updated decoder request structure, enhancing clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use MedusaBuffers instead of RuntimeBuffers in CreateNewDecoderRequests - Updated references from RuntimeBuffers to MedusaBuffers across the CreateNewDecoderRequests class and its methods, enhancing clarity in buffer management. - Adjusted method signatures and internal logic to accommodate the new MedusaBuffers type, ensuring compatibility with existing functionality. - Cleaned up unnecessary includes and improved code organization for better maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update CreateNewDecoderRequests to use DecoderState and CudaStream parameters - Modified method signatures in CreateNewDecoderRequests to replace GptDecoderBatched with runtime::decoder::DecoderState and added a separate CudaStream for the decoder. - Adjusted the implementation of the operator() method to accommodate the new parameters, enhancing flexibility in request handling. - Updated associated bindings in the pybind11 interface to reflect the changes in method signatures, ensuring consistency across the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update TRTLLMSampler to use refactored create_new_decoder_requests - Updated the sampler.py to reflect changes in the request handling logic, replacing generate_request_options with create_new_decoder_requests for improved clarity and consistency. - Updated bindings and method signatures for decoder stream handling. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update gptDecoderBatchedTest to use CreateNewDecoderRequests::newRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 22:54:37 +08:00
dominicshanshan	ca3eaf4070	[nvbug/5028235][fix]pytest bindings tokens logtis comparison. (#4424 ) * fix bug 5028235. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> * fix bug 5028235 and update comments. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> * Update tests/unittest/bindings/test_executor_bindings.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com> * Remove redundant code. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> * Update based on review comments. Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> --------- Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com> Signed-off-by: dominicshanshan <30051912+dominicshanshan@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-05-23 20:41:00 +08:00
juney-nvidia	7b2bb67491	Update CODEOWNERS for PyTorch backend - runtime component (#4620 ) Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-05-23 20:40:44 +08:00
Robin Kobus	15a59e57f6	[nvbugs/5301492] ci: waive test_workers_kv_cache_aware_router (#4617 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-23 20:14:28 +08:00
zhhuang-nv	8452775db8	[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA (#4535 ) * optimize kv cache reuse workflow for MLA write kv cache first and only call up-projection GEMM once relax contiguous requirements of k/v for setting paged kv cache return two contiguous tensors when loading MLA KV Cache Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * support fp8 kv cache for MLA kv cache reuse Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> * resolve comments Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com> --------- Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-05-23 19:47:50 +08:00
Anthony Chang	bbea2647b1	Qwen3 supports TRTLLM FP4 MoE backend (#4530 ) * MoE TRTLLM backend for Qwen3 Signed-off-by: Anthony Chang <anchengc@nvidia.com> * add extra moe_backend to test Signed-off-by: Anthony Chang <anchengc@nvidia.com> * address comments Signed-off-by: Anthony Chang <anchengc@nvidia.com> * conditionally compile kernels on newer archs Signed-off-by: Anthony Chang <anchengc@nvidia.com> * missing positional arg Signed-off-by: Anthony Chang <anchengc@nvidia.com> * Update the routing kernels Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Revise usage of TLLM_LOG_ERROR Signed-off-by: Christina Zhang <christinaz@nvidia.com> * Add unit test for Qwen3 moe (trtllm_gen backend) Signed-off-by: Christina Zhang <christinaz@nvidia.com> * improve weight processing speed of moe_backend=TRTLLM; roughly 2x Signed-off-by: Anthony Chang <anchengc@nvidia.com> * tidy and minor fix Signed-off-by: Anthony Chang <anchengc@nvidia.com> * temporarily disable accuracy test that has known issue Signed-off-by: Anthony Chang <anchengc@nvidia.com> --------- Signed-off-by: Anthony Chang <anchengc@nvidia.com> Signed-off-by: Christina Zhang <christinaz@nvidia.com> Co-authored-by: Christina Zhang <christinaz@nvidia.com>	2025-05-23 18:31:08 +08:00
juney-nvidia	419151f358	Update the GH main page to expose tech blogs (#4610 ) * Update the main page to expose the tech blogs Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * fix formating issue Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> * fixing format issue Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com> --------- Signed-off-by: Jun Yang <143764042+juney-nvidia@users.noreply.github.com>	2025-05-23 17:03:56 +08:00
Yiqing Yan	3ca05330f9	Waive L0 test (#4609 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-05-23 15:54:11 +08:00
Bo Li	9ae705af1b	perf: Add fused q_norm/k_norm/RoPE for Qwen3. (#4482 ) * Add Julien's origina kernel. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Get rid of UpdateKVCache functionality. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add kernels. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add torch OP. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Update cmake. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Torch OP must use double as argument dtype. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add unittest. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix misaligned access when head_dim=64. In this case, numElemsPerThread=2, numVecPerThread=0. But the store code incorrectly perform vectorized store, some threads (e.g., lane1) issue store to address that is not aligned to 64 bit. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Remove unroll (compiler can do that). Cleanup code. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add switch for interleave. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Refactor vectorized load/store. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Implement is_neox. Result not correct yet. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Fix is_neox=True. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> * Add q_weight and k_weight. Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com> --------- Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>	2025-05-23 15:31:04 +08:00
bhsueh_NV	6527c055cf	chore: fix bug of llama lora test (#4566 ) * fix bug of llama lora test Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * Update test_llm.py fix bug detected by pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-23 14:06:40 +08:00
Fanrong Li	862bde99b6	draft[doc]: add mtp tech blog (#4580 ) * add mtp tech blog. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * update figure size. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * update the figure caption style and add some code/pr links. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix figure captions. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix figure size and perf data. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> * fix based on comments Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix figure links. Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>	2025-05-23 13:54:21 +08:00
bhsueh_NV	d69c662215	[Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP (#4575 ) * fix bug of qwen3 fp4 workflow with EP Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of qwen3_moe with ep Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-05-23 13:34:05 +08:00
coldwaterq	1cf0e672e7	fix: [nvbugs/5066257] serialization improvments (#3869 ) * added a restricted pcikler and depickler in a sepparate serialization function. Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com> * updated IPC to remove approved classes, removed the serialization function because it didn't work for all objects that made debugging harder, added tests. Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com> * removed LLM arg and moved class registration to a serialization module function. Also added missing classes to approved list. Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com> * cleaned up a couple files to reduce conflicts with main. Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com> * fix unit tests Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * reorder BASE_ZMQ_CLASSES list alphabetically Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * fix tests and move LogitsProcessor registration to base class Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * revert changes to import log of tensorrt_llm._torch.models Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * added comments to explain why BASE_ZMQ_CLASSES has to be passed into spawned child processes Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com> * fix tests and move LogitsProcessor registration to base class Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * additional comments for multiprocess approved list sync Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * add dataclass from tests Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> --------- Signed-off-by: coldwaterq@users.noreply.github.com <coldwaterq@users.noreply.github.com> Signed-off-by: coldwaterq <coldwaterq@users.noreply.github.com> Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> Co-authored-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-05-23 13:06:29 +08:00
djns99	87f734b563	[https://nvbugs/5297775 ] fix: Correct memory guard for large MOE tests to account for TP space (#4553 ) fix: Correct memory guard for large MOE tests to account for TP space Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-05-23 14:57:49 +12:00
Yuxian Qiu	38241b2346	fix: Fix moe_ep_groups/moe_cluster_groups in Mapping. (#4555 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-05-23 10:41:49 +08:00
CarstyYou	ef280e687e	[feat] support fp8 blockscale gemm on sm89 (#4481 ) * [feat] integrate ada blockwise gemm Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] align scale M Signed-off-by: CarstyYou <xiy@nvidia.com> * [feat] swizzle mma output Signed-off-by: CarstyYou <xiy@nvidia.com> * [test] add ut for sm89 Signed-off-by: CarstyYou <xiy@nvidia.com> * [delete] remove useless comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] codestyle Signed-off-by: CarstyYou <xiy@nvidia.com> * [fix] fix review comments Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> * [chore] fix license Signed-off-by: CarstyYou <xiy@nvidia.com> --------- Signed-off-by: CarstyYou <xiy@nvidia.com> Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>	2025-05-23 10:39:10 +08:00
Enwei Zhu	d7443b6068	[https://nvbugspro.nvidia.com/bug/5181262 ] [test] Unwaive Mistral Nemo test (#4515 ) unwaive Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-05-23 10:14:00 +08:00
nv-guomingz	e3a534d0ee	chore: guardword clean for header file. (#4540 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 10:08:14 +08:00
pcastonguay	d7d455e7ea	[feat][TRTLLM-5018] Dis serving python runtime trt backend (#4243 ) * feat: Enabling dis serving with TRT backend with Python runtime Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing formatting Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing disagg mtp test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-05-22 22:01:06 -04:00
Kunyao Wu	60a6c20174	Scaffoldingllm supports MCP (#4410 ) * support mcp # Conflicts: # tensorrt_llm/scaffolding/worker.py Signed-off-by: wu1du2 <wu1du2@gmail.com> * move all into contrib/mcp # Conflicts: # examples/scaffolding/contrib/mcp/mcptest.py # tensorrt_llm/scaffolding/__init__.py # tensorrt_llm/scaffolding/contrib/__init__.py # tensorrt_llm/scaffolding/contrib/mcp/__init__.py # tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py # tensorrt_llm/scaffolding/task.py # tensorrt_llm/scaffolding/worker.py Signed-off-by: wu1du2 <wu1du2@gmail.com> * support sandbox, websearch # Conflicts: # examples/scaffolding/contrib/mcp/mcptest.py # examples/scaffolding/contrib/mcp/weather/weather.py # tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py # tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py # tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py # tensorrt_llm/scaffolding/worker.py Signed-off-by: wu1du2 <wu1du2@gmail.com> * remove pics Signed-off-by: wu1du2 <wu1du2@gmail.com> * pre-commit fix # Conflicts: # tensorrt_llm/scaffolding/contrib/mcp/__init__.py # tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py # tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py Signed-off-by: wu1du2 <wu1du2@gmail.com> * fix spell Signed-off-by: wu1du2 <wu1du2@gmail.com> * rebase Signed-off-by: wu1du2 <wu1du2@gmail.com> --------- Signed-off-by: wu1du2 <wu1du2@gmail.com>	2025-05-23 01:54:49 +00:00
dongxuy04	338744fba6	fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer (#4573 ) fix moe possible race cond and add bypass worker thread for no updates Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-05-23 09:24:23 +08:00
QI JUN	1e55d616da	Chore: clean up _gather_dp_requests_num method of PyExecutor (#4571 ) clean up _gather_dp_requests_num method of PyExecutor Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-05-23 08:37:39 +08:00
nv-guomingz	3549b68c1c	chroe:clean useless flag (#4567 ) Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>	2025-05-23 07:05:15 +08:00
Mike Iovine	9c0de251db	[feat] Integrate Hopper chunked attention kernels (#4330 ) * Integrate chunked attention kernels Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix cache key Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> * Fix lint Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> --------- Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-22 17:10:57 -04:00
Mike Iovine	14fc48ada7	[nvbug/5285881][fix] Fix chunked prefill + overlap scheduler (#4402 ) [fix] Fix chunked prefill + overlap scheduler Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-23 04:38:22 +08:00
Venky	c713eb5799	test(perf): Add `Llama-3_1-Nemotron-Ultra-253B-v1` perf tests (cpp) (#4446 ) ultra Signed-off-by: Venky Ganesh <23023424+venkywonka@users.noreply.github.com>	2025-05-22 13:07:33 -07:00
Robin Kobus	e5c90883a9	fix: Move cv2 import to load_video function (#4541 ) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-22 17:56:07 +02:00
Chuang Zhu	558eaecf16	fix sequence data race (#4565 ) stash for debug broken promise Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 23:13:48 +08:00
QI JUN	1e5d526db4	Chore: clean up _merge_dummy_request method of PyExecutor (#4438 ) * clean up _merge_dummy_request method of PyExecutor Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * update comment Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-05-22 18:19:07 +08:00
xinhe-nv	22c01d5b21	test: [CI] Add failed cases into waives.txt (#4549 ) * update waive list Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> * fix test issues Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> --------- Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-05-22 17:18:53 +08:00
ruodil	1a45890dae	test: waive hanging cases for perf test (#4562 ) waive hanging cases Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com>	2025-05-22 15:50:05 +08:00
Chuang Zhu	3410508020	cache_transceiver_config (#4556 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-05-22 13:59:51 +08:00
Iman Tabrizian	e741d2b8d0	Add tritonrelease container (#4455 ) * Add tritonrelease container Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * Review comments Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * Update docker/Makefile Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> --------- Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Signed-off-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Co-authored-by: Martin Marciniszyn Mehringer <11665257+MartinMarciniszyn@users.noreply.github.com>	2025-05-21 23:47:50 -04:00
Kaiyu Xie	2898d268f9	feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856 ) (#4349 ) Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856 Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com> Co-authored-by: Dhruv Singal <dhruvsingalabc@gmail.com>	2025-05-22 11:46:06 +08:00
HuiGao-NV	bc9f1dbede	fix[nvbug-5228840]: Remove test cases of feature not supported anymore (#3972 ) * Remove waived cases * Remove test cases of not supported feature Signed-off-by: Hui Gao <huig@nvidia.com>	2025-05-22 11:18:58 +08:00
Aurelien Chartier	f491244c84	feat: add dataset support for benchmark_core_model with LLMAPI (#4457 ) * feat: add dataset support for benchmark_core_model with LLMAPI Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-05-21 19:18:43 -07:00

1 2 3 4 5 ...

1056 Commits