TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-27 14:13:34 +08:00

Author	SHA1	Message	Date
HuiGao-NV	d3608d6818	Remove dummy forward path (#3669 ) Remove dummy forward path	2025-04-18 16:17:50 +08:00
Dom Brown	dbd9a83b0d	feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582 ) * feat: Integrate GPUDirect Storage (GDS) into Executor API Squash of several dev commits Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-18 15:59:21 +08:00
Chang Liu	b8818b45be	fix: llama4: address couple of issues in llama4 attention module (#3491 ) * fix attn module for llama4 * Address comments * Rebase to accommodate latest attn refactor and refactor l4attn * Remove aux_stream from classic attn * Use RMSNorm for L2Norm * Update tensorrt_llm/_torch/models/modeling_llama.py Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: Chang Liu <lc9114@gmail.com> * Add typing informations for _attn_qkv * Remove redundant comment * Simplify llama4 DecoderLayer logic --------- Signed-off-by: Chang Liu <lc9114@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-18 01:54:59 +00:00
rakib-hasan	ff3b741045	feat: adding multimodal (only image for now) support in trtllm-bench (#3490 ) * feat: adding multimodal (only image for now) support in trtllm-bench Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * fix: add in load_dataset() calls to maintain the v2.19.2 behavior Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * re-adding prompt_token_ids and using that for prompt_len Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * updating the datasets version in examples as well Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * api changes are not needed Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * moving datasets requirement and removing a missed api change Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * addressing review comments Signed-off-by: Rakib Hasan <rhasan@nvidia.com> * refactoring the quickstart example Signed-off-by: Rakib Hasan <rhasan@nvidia.com> --------- Signed-off-by: Rakib Hasan <rhasan@nvidia.com>	2025-04-18 07:06:16 +08:00
Frank	5a6cb2b985	fix: Correct reporting of text dtype for Llama 4 (#3494 ) Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-18 00:07:49 +08:00
Yukun He	83b36ebecd	Fix fused_moe fallback issue. (#3652 ) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-17 23:17:04 +08:00
yuxianq	b9b1c1368c	feat: Support unfused rope in MLA. (#3610 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 16:50:49 +08:00
Netanel Haber	3c52ac098f	feat: allocate minimal blocks per window size (#3028 ) * implement variable window attention by breaking the block manager into window block managers per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * revert isCyclic to be true if the min attention window is reached, not per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add explanatory comment to mCyclicThreshold Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * load correct gemma config Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * if TYPE_CHECKING Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * pass dtype as well Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * test_gemma variable sliding window attention Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove \|\| mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * turn off request delaying for MaxUtil Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * make comments better Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * windowSizesTotalSum using std::accumulate Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comments Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove assert that kills disagg tests, since it isn't necessary Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add Gemma3 to SUPPORTED_HF_ARCHITECTURES Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * support Gemma3 Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix kvfactor field for deepseek Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix gemma-3 entries in testlist to include vswa Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * only quantize gemma2 VSWA Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> remove misleading comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix: disable KV cache reuse if using attention sink (#3021) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-17 16:04:57 +08:00
danielafrimi	0f084d9566	added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455 ) Loraop integration into torch modules Signed-off-by: Ubuntu <dafrimi@nvidia.com>	2025-04-17 12:48:27 +08:00
yuxianq	239fe0ff26	chore: Use ellipsis as default value to detect whether residual argument is provided (#3626 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-17 12:31:58 +08:00
Luis Vega	a06bff5052	Fix rotary_emb param in NemotronH attention (#3646 ) Signed-off-by: Luis Vega <vegaluisjose@users.noreply.github.com>	2025-04-16 21:03:07 -07:00
Luis Vega	0bda1f9780	feat: Nemotron-H model support (#3430 ) * added files for nemotron-h Signed-off-by: Luis Vega <lvega@nvidia.com> * use try/except to import RMSNorm Signed-off-by: Luis Vega <lvega@nvidia.com> --------- Signed-off-by: Luis Vega <lvega@nvidia.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-16 14:05:56 -07:00
Mike Iovine	41a6c98544	Support CUDA graphs for EAGLE3 (#3176 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-17 04:53:50 +08:00
hlu1	b6bae33453	Clean up linear.py, mlp.py, gated_mlp.py (#3553 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>	2025-04-16 12:21:44 -07:00
Yibin Li	351808efeb	fix: Use hmac authentication for pickle encryption (#3384 ) * hmac initial implementation to encrypt worker and proxy queue Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * set different hmac key for each pair of server/client queue Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * fix comments Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> * fix style Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com> --------- Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-04-17 00:40:13 +08:00
yuxianq	fd8ded2b2b	feat: Support cos_sin_cache in all cases. (#3517 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-16 13:48:44 +08:00
Jinyang Yuan	efabf6b443	chore: Add comments to modifications that fix TP size of DeepSeek-V3/R1 when using more than 16 GPUs (#3572 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 21:51:42 -07:00
Zhanrui Sun	9d88ee3e45	chore: bump version to 0.20.0rc0 (#3561 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-16 11:41:21 +08:00
Enwei Zhu	44da0e8d60	fix: LLM API _hf_model_dir for non-cached case (#3562 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-04-16 10:39:34 +08:00
Daniel Cámpora	41ce5440fe	chore: Mass integration of release/0.18 (#3421 ) * [Infra][TRTLLM-4063] - Branch out for the TRT-LLM v0.18.0 release Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> (cherry picked from commit de90312020e51c22ba5e75b3502c7ee90c059265) * [Infra][TRTLLM-3652] - Update dependencies to TRT 10.9 / CUDA 12.8.1 / DLFW 25.03(Internal) Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> (cherry picked from commit 58db1340ef7db22f1910f878d220a92be5b830d1) * [None][Doc] - Update docs for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit d23e75bc95619ce3b116213d55319272888e0c88) * [Infra] - Fix or WAR issues in the package sanity check stages Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit e874e2b127515c52ba10c8df1cc2631627f74ffe) * [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path Signed-off-by: Yuki Huang <yukih@nvidia.com> (cherry picked from commit 731811d4e182d70a66193d646152cb71dfafe83a) * cherry-pick 'test: Updat cluster and multi node test lists and trtllm-bench' test to fix perf drop issue Signed-off-by: Ruodi Lu <ruodil@nvidia.com> (cherry picked from commit 5214616283fbc15ae98871a1d84c78d8e1f2e6e8) * Revert "Merge branch 'user/yukih/fix_5173454_5173432' into 'release/0.18'" Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 8d34831cb2b81ee2dfa8021b68e7158b33789a5f) * [Infra]Restrict setuptools version to avoid sasb pip install issue Signed-off-by: Emma Qiao <qqiao@nvidia.com> (cherry picked from commit 1e60ad29e0dafec0e295bedb5d89b716a02a707c) * [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path Signed-off-by: Yuki Huang <yukih@nvidia.com> (cherry picked from commit 3ed8164e5bfea1d5aa2039b5408439fd6cf59dac) * WAR for bug 5173448 Signed-off-by: Thor Johnsen <tjohnsen@nvidia.com> (cherry picked from commit b6528b2ba15322b6c6a4c81a8b74c04d4973de4f) * [Infra][TRTLLM-3652] - Update dependencies to CUDA 12.8.1 / DLFW 25.03 Signed-off-by: Yiqing Yan <yiqingy@nvidia.com> (cherry picked from commit 6560983d132d9d257ee15849664eb055e94adaa9) * [Docs] - Doc changes for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 26769b61218a947c8f9d070f73b63d576fcc20c4) * [Doc] - Doc change for v0.18.0 Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> (cherry picked from commit 4b3b5ed6bfbc2300e3775fe75456083faad7b235) * [Infra] update version to 0.18.1 Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com> (cherry picked from commit 59e8326c75639275837d34de8e140358737a3365) * Add back nemotron file. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix recurrentgemma reqs. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Adding WAR for bug 5173448. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Remove duplicated file. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update examples/prompt_lookup/requirements.txt Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Remove glm-4-9b from model dir in chatglm test. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Remove indent change. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> * Revert changes on l0_test.groovy. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update dev images Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> * Remove duplicated import. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix custom op Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Fix flashinfer & vanilla backend Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> * Skip problematic case. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Skip problematic test_moe_w4a8_1_14336_4096_8_bfloat16_True_False case. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com> Co-authored-by: Yiqing Yan <yiqingy@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Ruodi Lu <ruodil@nvidia.com> Co-authored-by: Emma Qiao <qqiao@nvidia.com> Co-authored-by: Thor Johnsen <tjohnsen@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-04-16 10:03:29 +08:00
xiweny	da47d5f27e	fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 (#3585 ) * fix: nvbugs/5075538: fix cross attention mask when decoder input len > 1 Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * remove waiver Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> --------- Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>	2025-04-16 08:31:33 +08:00
Kaiyu Xie	e037d3e99b	chore: Unify Python NVTX call (#3450 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-15 23:25:36 +08:00
bhsueh_NV	3aa37e6b72	fix bug (#3570 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-15 16:50:22 +08:00
Yuan Tong	d4c0423cdb	refactor: collect executor and decoder states into dataclass (#3234 ) * fix: Proper error bubbling for PyExecutor Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-15 16:31:45 +08:00
shaharmor98	ede7058544	Feat/ Integrate peftCacheManager in PyExecutor creation (#3372 ) * integrate peftCacheManager in PyExecutor creation Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-15 15:14:43 +08:00
Yuan Tong	668a0335e4	fix: Proper error bubbling for PyExecutor (#3321 ) * fix: Proper error bubbling for PyExecutor * fix: Proper shutdown * fix: multi gpu proper shutdown Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-15 14:49:46 +08:00
Jinyang Yuan	0305942808	chore: Modifications that should have been included but were mistakenly overwritten in PR #3467 (#3557 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 14:08:07 +08:00
yuxianq	0e7e949feb	refactor: Split llama4 model from llama model. (#3530 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-15 13:41:05 +08:00
Jinyang Yuan	175adb94ab	chore: Log memory sizes of weights and activations separately (#3467 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-15 09:48:35 +08:00
nv-guomingz	b32ae7ac92	test:add fp8_kv_cache functionality test case. (#3457 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-04-15 09:16:46 +08:00
QI JUN	112f716155	chore: move all distributed related codes into _torch.distributed directory (#3511 ) * move all distributed related codes into _torch.distributed directory Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-15 08:39:17 +08:00
brb-nv	098ca7f68c	test: Fix breaking Phi3 multimodal tests (#3544 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-15 08:02:34 +08:00
Pamela Peng	6cdfc54883	feat: Add FP8 support for SM 120 (#3248 ) * Allow FP8 on SM120 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix sm121 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix pre-commit Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * review update Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-04-14 16:05:41 -07:00
Aurelien Chartier	8cf2785bc6	chore: unify pp_layers helpers (#3429 ) * chore: unify pp_layers helpers Fix assumptions about equal number of layers per PP rank in prepare_attention_inputs Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>	2025-04-15 04:49:17 +08:00
Chang Liu	01cb3ccb04	use global expert idx to load expert weights (#3386 ) Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-14 11:58:30 -07:00
Chang Liu	1902d73eb5	fix: llama4: add an option `apply_router_weight_on_input` for in FusedMoE (#3492 ) * apply a tenative fix to moe bypass kernel update * Pass none to disable final stage in moe Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com> Signed-off-by: Chang Liu <lc9114@gmail.com> --------- Signed-off-by: Chang Liu <lc9114@gmail.com> Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>	2025-04-14 11:56:42 -07:00
Kaiyu Xie	b286b51118	feat: Support torch profiler (#3470 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-14 22:06:06 +08:00
Zhanrui Sun	714ff3eedd	chore: bump version to 0.19.0rc0 (#3535 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-14 18:11:20 +08:00
Zhanrui Sun	ee4ce0379d	chore: bump version to 0.19.0rc0 (#3514 ) * chore: bump version to 0.19.0.rc0 Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> * Update README Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> --------- Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-14 17:32:30 +08:00
dongjiyingdjy	2fb1d65d43	fix: fix max_seq_len in executor_config (#3487 ) Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>	2025-04-14 15:13:29 +08:00
HuiGao-NV	9f41e826bf	fix: remove one duplicated line of code (#3523 ) Signed-off-by: Hui Gao <huig@nvidia.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-14 14:52:46 +08:00
brb-nv	44090a5388	Add support for Phi-4-MM (#3296 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-04-14 14:24:10 +08:00
yuxianq	9d64b6b890	Cache sin cos in model instead of global LRU cache. (#3378 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 11:19:09 +08:00
pcastonguay	fe6f14b2b1	fix: Fixing issue with first gen token being returned twice in streaming (#3427 ) * fix: Fixing issue with first gen token being returned twice with streaming Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing not_expectring_strings in test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-13 22:45:09 -04:00
yuxianq	baeec63dda	refactor: Remove _pp_forward. (#3496 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 09:49:44 +08:00
HuiGao-NV	d0f83d19f1	fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache (#3497 ) * fix: add kv memory size per token of draft model to calculate max number of tokens of kv cache Signed-off-by: Hui Gao * Fix code to get model_config of draft model Signed-off-by: Hui Gao --------- Signed-off-by: Hui Gao	2025-04-13 23:02:14 +08:00
Yan Chunwei	b37c5c0a4d	make LLM-API slurm examples executable (#3402 ) Signed-off-by: chunweiy <328693+Superjomn@users.noreply.github.com>	2025-04-13 21:42:45 +08:00
Yan Chunwei	74850c61e9	fix: switch ZMQ from file socket to tcp socket in RemoteMpiCommSession (#3462 ) * switch ZMQ from file socket to tcp Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix comment Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-13 09:15:55 +08:00
WeiHaocheng	c6081abb0e	feat: Make scaffolding Controller more generic #3408 (#3416 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-12 21:35:38 +08:00
QI JUN	012fb9a1c4	remove useless max_num_tokens member in PyTorchConfig (#3493 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-12 21:09:58 +08:00

1 2 3 4 5

241 Commits