TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
QI JUN	26ebd95302	chore: update multi gpu trigger file list (#3665 ) * update multi gpu trigger file list Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * update multi gpu trigger file list Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-17 11:19:01 -07:00
QI JUN	91660939fd	tests: waive test_llm_multi_node (#3664 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-18 01:59:16 +08:00
Frank	5a6cb2b985	fix: Correct reporting of text dtype for Llama 4 (#3494 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-18 00:07:49 +08:00
Yukun He	83b36ebecd	Fix fused_moe fallback issue. (#3652 ) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-17 23:17:04 +08:00
yuxianq	b9b1c1368c	feat: Support unfused rope in MLA. (#3610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 16:50:49 +08:00
Ivy Zhang	ad19ca3cbf	remove benchmark test list (#3644 ) Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-04-17 16:23:41 +08:00
Netanel Haber	3c52ac098f	feat: allocate minimal blocks per window size (#3028 ) * implement variable window attention by breaking the block manager into window block managers per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * revert isCyclic to be true if the min attention window is reached, not per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add explanatory comment to mCyclicThreshold Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * load correct gemma config Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * if TYPE_CHECKING Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * pass dtype as well Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * test_gemma variable sliding window attention Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove \|\| mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * turn off request delaying for MaxUtil Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * make comments better Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * windowSizesTotalSum using std::accumulate Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comments Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove assert that kills disagg tests, since it isn't necessary Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add Gemma3 to SUPPORTED_HF_ARCHITECTURES Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * support Gemma3 Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix kvfactor field for deepseek Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix gemma-3 entries in testlist to include vswa Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * only quantize gemma2 VSWA Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> remove misleading comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix: disable KV cache reuse if using attention sink (#3021) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-17 16:04:57 +08:00
Yiqing Yan	1c6f3debbb	Waive L0 tests (#3651 ) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>	2025-04-17 15:13:56 +08:00
xinhe-nv	b82a4e8d01	test: [CI] Add failed cases into waives.txt (#3627 ) * update waive list Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> * fix waives Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com> --------- Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-04-17 14:45:41 +08:00
Tao Li @ NVIDIA	e4476bf521	update fp8 doc (#3647 ) (#3650 ) Signed-off-by: taoli <litaotju@users.noreply.github.com> Co-authored-by: taoli <litaotju@users.noreply.github.com>	2025-04-17 13:37:08 +08:00
danielafrimi	0f084d9566	added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455 ) Loraop integration into torch modules Signed-off-by: Ubuntu <dafrimi@nvidia.com>	2025-04-17 12:48:27 +08:00
yuxianq	239fe0ff26	chore: Use ellipsis as default value to detect whether residual argument is provided (#3626 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 12:31:58 +08:00
Luis Vega	a06bff5052	Fix rotary_emb param in NemotronH attention (#3646 ) Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>	2025-04-16 21:03:07 -07:00
Void	950cadf2bd	add support for smaller hidden_dim (#3609 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-17 12:00:32 +08:00
Ivy Zhang	b2fb0fe843	test: add quickstart test for nemotron-ultra (#3596 ) * add quickstart test for nemotron-ultra Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> * fix test name Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> --------- Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-04-17 11:16:41 +08:00
ruodil	5e2ebebe76	tests: change qa perf test to trtllm-bench (#3189 ) Signed-off-by: Ruodi <200874449+ruodil@users.noreply.github.com> Co-authored-by: Larry <197874197+LarryXFly@users.noreply.github.com>	2025-04-17 09:53:32 +08:00
Chuang Zhu	f4ddc304f2	disable ib for ucx test (#3613 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-17 06:43:57 +08:00
QI JUN	57cafe7f9b	waive test_fp8_scaled_mm (#3637 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-16 15:07:30 -07:00
Luis Vega	0bda1f9780	feat: Nemotron-H model support (#3430 ) * added files for nemotron-h Signed-off-by: Luis Vega <lvega@nvidia.com> * use try/except to import RMSNorm Signed-off-by: Luis Vega <lvega@nvidia.com> --------- Signed-off-by: Luis Vega <lvega@nvidia.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-16 14:05:56 -07:00
Mike Iovine	41a6c98544	Support CUDA graphs for EAGLE3 (#3176 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-17 04:53:50 +08:00
hlu1	b6bae33453	Clean up linear.py, mlp.py, gated_mlp.py (#3553 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-16 12:21:44 -07:00
Yibin Li	351808efeb	fix: Use hmac authentication for pickle encryption (#3384 ) * hmac initial implementation to encrypt worker and proxy queue Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * set different hmac key for each pair of server/client queue Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * fix comments Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * fix style Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> --------- Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-04-17 00:40:13 +08:00
QI JUN	fac1a905e9	waive test_llm_multi_node_with_postproc (#3628 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-16 05:49:39 -07:00
Olya Kozlova	b3e6723dbc	feat: Adding FP8 BMM from Codegen (#3541 ) * Adding FP8 BMM from Codegen Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com> * Fixed licenses Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com> --------- Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com> Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com> Co-authored-by: Olya Kozlova <okozlova@6u1g-0018.nvidia.com> Co-authored-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com>	2025-04-16 10:37:15 +02:00
Yiteng Niu	ca88674210	update user list (#3614 ) Signed-off-by: Yiteng Niu <6831097+niukuo@users.noreply.github.com>	2025-04-16 15:13:29 +08:00
Gabriel Wu	2e0cd7922e	fix: add SM90 guard for FP8 Blockscale GEMM (#3575 ) * fix: add SM90 guard for FP8 Blockscale GEMM Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: add SM90 guard for FP8 Blockscale GEMM Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> --------- Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-04-16 14:44:37 +08:00
yuxianq	fd8ded2b2b	feat: Support cos_sin_cache in all cases. (#3517 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-16 13:48:44 +08:00
QI JUN	ab29348db2	waive test_llm_phi_quantization_1gpu (#3603 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-16 13:33:46 +08:00
Jinyang Yuan	efabf6b443	chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs (#3572 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 21:51:42 -07:00
Zhanrui Sun	9d88ee3e45	chore: bump version to 0.20.0rc0 (#3561 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-16 11:41:21 +08:00
narutolhy	ccd73c71a5	feat: Add stream generation task scaffolding examples (#3527 ) * stream generation task/controller Signed-off-by: narutolhy <582909902@qq.com> * edit README Signed-off-by: narutolhy <582909902@qq.com> * rename README Signed-off-by: narutolhy <582909902@qq.com> --------- Signed-off-by: narutolhy <582909902@qq.com>	2025-04-16 11:33:55 +08:00
Yan Chunwei	409c294c4e	fix trtllm-bench mgmn (#3563 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-16 11:04:09 +08:00
Yan Chunwei	63f3fba679	waive test_llm_multi_node_pytorch (#3592 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-16 10:49:07 +08:00
Enwei Zhu	44da0e8d60	fix: LLM API _hf_model_dir for non-cached case (#3562 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-16 10:39:34 +08:00
Daniel Cámpora	41ce5440fe	chore: Mass integration of release/0.18 (#3421 ) * [Infra][TRTLLM-4063] - Branch out for the TRT-LLM v0.18.0 release Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> (cherry picked from commit de90312020e51c22ba5e75b3502c7ee90c059265) * [Infra][TRTLLM-3652] - Update dependencies to TRT 10.9 / CUDA 12.8.1 / DLFW 25.03(Internal) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> (cherry picked from commit 58db1340ef7db22f1910f878d220a92be5b830d1) * [None][Doc] - Update docs for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit d23e75bc95619ce3b116213d55319272888e0c88) * [Infra] - Fix or WAR issues in the package sanity check stages Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit e874e2b127515c52ba10c8df1cc2631627f74ffe) * [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path Signed-off-by: Yuki Huang <yukih@nvidia.com> (cherry picked from commit 731811d4e182d70a66193d646152cb71dfafe83a) * cherry-pick 'test: Updat cluster and multi node test lists and trtllm-bench' test to fix perf drop issue Signed-off-by: Ruodi Lu <ruodil@nvidia.com> (cherry picked from commit 5214616283fbc15ae98871a1d84c78d8e1f2e6e8) * Revert "Merge branch 'user/yukih/fix_5173454_5173432' into 'release/0.18'" Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 8d34831cb2b81ee2dfa8021b68e7158b33789a5f) * [Infra]Restrict setuptools version to avoid sasb pip install issue Signed-off-by: Emma Qiao <qqiao@nvidia.com> (cherry picked from commit 1e60ad29e0dafec0e295bedb5d89b716a02a707c) * [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path Signed-off-by: Yuki Huang <yukih@nvidia.com> (cherry picked from commit 3ed8164e5bfea1d5aa2039b5408439fd6cf59dac) * WAR for bug 5173448 Signed-off-by: Thor Johnsen <tjohnsen@nvidia.com> (cherry picked from commit b6528b2ba15322b6c6a4c81a8b74c04d4973de4f) * [Infra][TRTLLM-3652] - Update dependencies to CUDA 12.8.1 / DLFW 25.03 Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> (cherry picked from commit 6560983d132d9d257ee15849664eb055e94adaa9) * [Docs] - Doc changes for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 26769b61218a947c8f9d070f73b63d576fcc20c4) * [Doc] - Doc change for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4b3b5ed6bfbc2300e3775fe75456083faad7b235) * [Infra] update version to 0.18.1 Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> (cherry picked from commit 59e8326c75639275837d34de8e140358737a3365) * Add back nemotron file. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix recurrentgemma reqs. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Adding WAR for bug 5173448. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Remove duplicated file. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update examples/prompt_lookup/requirements.txt Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Remove glm-4-9b from model dir in chatglm test. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Remove indent change. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Revert changes on l0_test.groovy. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update dev images Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> * Remove duplicated import. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix custom op Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Fix flashinfer & vanilla backend Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Skip problematic case. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Skip problematic test_moe_w4a8_1_14336_4096_8_bfloat16_True_False case. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Ruodi Lu <ruodil@nvidia.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Thor Johnsen <tjohnsen@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-04-16 10:03:29 +08:00
xiweny	da47d5f27e	fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 (#3585 ) * fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * remove waiver Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> --------- Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-04-16 08:31:33 +08:00
Kaiyu Xie	f5f68ded26	Minor fixes for documents (#3577 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-16 07:47:18 +08:00
Robin Kobus	fffb403125	fix: disable KV cache reuse if using attention sink (#3021 ) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-16 03:07:32 +08:00
Pengyun Lin	1899e71364	doc: add genai-perf benchmark & slurm multi-node for trtllm-serve doc (#3407 ) Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>	2025-04-16 00:11:58 +08:00
Kaiyu Xie	e037d3e99b	chore: Unify Python NVTX call (#3450 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-15 23:25:36 +08:00
Kaiyu Xie	258ae9c58c	Revert "infra: move nvrtc_wrapper to conan (#3282 )" (#3573 ) This reverts commit `c0dd6cbce0`. Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-15 22:45:13 +08:00
HuiGao-NV	d35db254e2	test: Enable 4 multi-gpu test cases for deepseek (#3569 ) Signed-off-by: Hui Gao <huig@nvidia.com> Signed-off-by: Hui Gaoâ <huig@nvidia.com>	2025-04-15 22:01:52 +08:00
Yan Chunwei	c27e130be0	unwaive test (#3559 ) Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-15 19:42:06 +08:00
jiahanc	1d3b98b920	perf: Optimize quantization kernels used in DeepSeek on Hopper (#3466 ) Signed-off-by: jiahanc <jiahanc@nvidia.com>	2025-04-15 17:49:57 +08:00
xinhe-nv	5cfa927132	update waive list (#3503 ) Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>	2025-04-15 16:53:53 +08:00
bhsueh_NV	3aa37e6b72	fix bug (#3570 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-15 16:50:22 +08:00
Yuan Tong	d4c0423cdb	refactor: collect executor and decoder states into dataclass (#3234 ) * fix: Proper error bubbling for PyExecutor Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-15 16:31:45 +08:00
Robin Kobus	b7a38feb14	chore: Clean up cpp runtime (#3537 ) * add space in test output Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * perf: reduce executor lock scope Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move TokenRangeRetentionConfig implementation to cpp file Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Improve finished steps handling for external draft tokens - Fixed a bug where the whole finished steps tensor was being zeroes instead of the slices. - Replaced the creation of a temporary tensor for finished steps with a direct slice from the input tensor, improving efficiency and readability. - Updated the tensor management logic to streamline the process of setting zero values for finished steps during batch processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Clean up includes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-15 16:06:14 +08:00
shaharmor98	ede7058544	Feat/ Integrate peftCacheManager in PyExecutor creation (#3372 ) * integrate peftCacheManager in PyExecutor creation Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-15 15:14:43 +08:00
hlu1	5881a65374	Fix test_fp4_quantize_gemm_torch (#3551 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-14 23:58:31 -07:00

1 2 3 4 5 ...

504 Commits