TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-23 04:03:22 +08:00

Author	SHA1	Message	Date
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
hlu1	52edabab30	Fix Deepseek MTP with moe_backend=TRTLLM (#4001 ) Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-05-02 14:47:22 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Lucas Liebenwein	be916b19e0	feat: [AutoDeploy] unfusing attention for native support (#3668 ) * [AutoDeploy] unfused streamlined attention + caching Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * improved unit testing Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * reviewer feedback Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * some updates to attn_mask handling Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> * updated manual benchmarking and cudagraph capture Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> --------- Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>	2025-05-02 09:06:49 +08:00
Yukun He	a1645c922b	Fallback to NCCL for various patterns when input size is large. (#4009 ) When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 15:17:16 -07:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Suyog Gupta	f94af0fb86	[AutoDeploy] Make all ranks agree on kv-cache size (#4007 ) * make all ranks agree on kv-cache size Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * lint Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * minor cleanups Signed-off-by: Suyog Gupta <suyogg@nvidia.com> * use all_gather_object wrapper Signed-off-by: Suyog Gupta <suyogg@nvidia.com> --------- Signed-off-by: Suyog Gupta <suyogg@nvidia.com>	2025-05-02 04:07:28 +08:00
Erin	83f37614ef	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 ) * support return logprob in llmapi Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> update and add test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> stability test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * revert removal of old flag Signed-off-by: Erin Ho <erinh@nvidia.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Erin Ho <erinh@nvidia.com>	2025-05-01 12:47:14 -04:00
bhsueh_NV	129bf19980	model: support Qwen3 (#4010 ) * add qwen3 dense model pytorch backend support, initial commit solve the results error issue add qwen3 moe model pytorch backend support reformat the code * perf - use flash_infer rmsnorm for qwen3 * feat - support qwen3 moe rmsnorm * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. * Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications. * fix bugs of running qwen3 public models and fp8 models Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs due to rebase Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bugs captured by pre-commi Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of attention Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> Co-authored-by: Keddy Jin <jin.gq@aliyun.com> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Co-authored-by: shao <shao@nvidia.com>	2025-05-01 23:12:41 +08:00
YueWeng	b1621e8d4e	feat: add relaxed acceptance for DS (#3865 ) * add relaxed acceptance for DS R1 Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * clean and update docs Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * Modified based on review Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix mtp manager issue Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> --------- Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-01 21:50:36 +08:00
milesial	6ded5f984b	Llama4 processor fixes (#3994 ) * fix: Propagate sampling params Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> * fix: type hints Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> --------- Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:45:53 +08:00
Kate Cheng	7dbe618683	feat: Add multimodal embedding field in LlmRequest (#3855 ) * Add a new param to LlmRequest and Request to natively support mm Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * update comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update tests to match the new LlmRequest constructor parameters Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify unitTest and modify mm_embeding's dict name in llama4 Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix LlmRequest initialization in kvCacheManagerTest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up code for promt_tuning_config Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up prompt_tuning_config in GenerationRequest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:23:30 +08:00
Frank	1e317c98c6	[feat]: Allow for a settable end-of-sequence/padding token in max throughput benchmark. (#3776 ) * Move world options to a different group for clarity. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add eos_id option. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-05-01 09:42:46 +08:00
Yukun He	9cc5922a0b	Clean up allreduce op in Deepseek V3 model. (#3829 ) * Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op. * Minor revision of moe_allreduce op argument names. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 07:56:36 +08:00
Mike Iovine	8c2c969fcb	[fix] Pad requests to maximum draft length in spec decode (#3957 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-30 11:02:18 -04:00
Julien Debache	83670571dd	feat: Mistral-Large-2 support in the Pytorch workflow - Added modelling file for models configured by a `MistralConfiguration` object as it is slightly different from the Llama one	2025-04-30 20:12:39 +08:00
Zhanrui Sun	86e7474a9b	chore: bump version to 0.20.0rc2 (#3949 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-30 11:44:43 +08:00
yuxianq	f568cbb671	chore: Remove duplicated get_sm_version. (#3935 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-30 11:43:53 +08:00
Fanrong Li	e6b482ef47	fix: change the seq_lens sync copy to an async one (#3786 ) --------- Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-29 23:56:49 +08:00
tomeras91	35010e8073	Support NemotronH FP8 Quantization (1) match quant exclude modules names to TRTLLM names (2) No need for any special weight loading for quantization scales weights (#3891) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>	2025-04-29 18:51:43 +03:00
yuxianq	0f8ec693b2	fix: get head_dim from model’s config. (#3916 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 23:04:29 +08:00
HuiGao-NV	8e6eead6a5	refactor: (part1) Add contraints doc for fusedMoe module. (#3882 ) * Add doc string for FusedMoe module * Address comments. Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-29 22:23:02 +08:00
Junhong Liu	06e76020d7	feat: parallel q_b_proj and concat (#3917 ) * add parallel_q_b_proj_and_concat Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * code cleanup Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> * one gemm/concat and then split the latent_cache and pass them separately to context/gen Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com> --------- Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>	2025-04-29 22:07:05 +08:00
Dom Brown	8709fe8b53	chore: bump version to 0.19.0 (#3598 ) (#3841 ) test: add test cases for 0.19 release (#3608) * fix test name * add quickstart test for nemotron-ultra * add rcca multi-node test case for deepseek-v3 * add rcca info --------- squash (#3642) fix: nvbugs/5187237: fix deterministic mode crash (#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive * Revert "remove waive" This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac. * revert ar fusion --------- update fp8 doc (#3647) tests: change qa perf test to trtllm-bench (#3619) fix: FP8 quantized lm_head (NvBug 5214229) (#3567) infra: Add PR approval protection for the release branch (#3634) fix: nvbugs/5231298: pytorch allreduce issue (#3673) Fix: nvbugs/5222698 variable not defined (#3630) * Fix: nvbugs/5222698 variable not defined * Tidy code --------- test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685) test:restore fp8 kv cache testing for L0 (#3671) doc: Update DeepSeek perf docs (#3693) * Update DeepSeek perf docs * update * Apply suggestions from code review --------- tests: waive test_llm_multi_node (#3664) fix: update test_user_buffers_mm_add_prologue atol (#3711) Fix: cherry-pick hmac encryption from main branch (#3635) * security fix cherry-pick changes from main * fix hmac in remote mpi session (#3649) --------- Un-waive DS-V3-Lite tests. (#3621) fix: FP8 kv accuracy (#3675) * fix FP8 kv accuracy * update doc --------- Fix script options for engines. (#3622) unwaive multi-node test (#3721) chore : Split more tests out of gpt tests (#3524) (#3674) doc:add torch examples link into torch backend documentation (#3749) test: Get Eagle tests working (#3593) (#3722) Waive L0 test (#3756) waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656) Update ds v3 parameters in stress test. (#3676) waive gemma on L20 (#3766) https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. fix: PP4 fixes and cleanup (#3688) remove benchmark test list (#3643) skip disagg deepseek test if sm!=90 (#3720) test: skip failed cases on B200 (#3710) * add skip condition to tests * fix error --------- test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718) * skip_pre_ada for fp8 cases * update * update after rebase --------- add know issue to deepseek doc. (#3800) Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761) Waive L0 tests (#3826) fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. * Fix fused_moe fallback issue. (#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. --------- [doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797) Fix pre-commit Fix again Address some review comments for the MI Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-29 16:57:22 +08:00
bhsueh_NV	2e230b73ec	change log level of some text from info to debug (#3930 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-29 13:38:34 +08:00
yuxianq	adfa04745e	fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 11:26:13 +08:00
bhsueh_NV	0610d0ff84	add num_scheduled_requests into print_log (#3914 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-29 11:22:22 +08:00
Frank	cf15efa15e	[TRTLLM-4883][fix]: Update output speed calculation. (#3923 ) * Update gen tps calculation. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add back output speed for comparison. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix issue with f-string. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Fix some spacing. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Replace output speed with per-request genphase tput. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Add gen TPS breakdown. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> * Update some tagging. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> --------- Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>	2025-04-29 11:04:12 +08:00
Perkz Zheng	35c5e4f1c5	feat: add CGA reduction fmha kernels on Blackwell. (#3763 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address the comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-04-29 10:43:54 +08:00
hlu1	d2f312b8e4	Fix fp8 kvcache (#3877 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com> Co-authored-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-29 10:31:10 +08:00
WeiHaocheng	8a994d879f	feat: fix erros on scaffolding README (#3899 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-29 10:15:06 +08:00
yuxianq	b91da764de	chore: remove DummyKvCacheManager. (#3896 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-29 09:59:37 +08:00
Mike Iovine	e534bf09cc	[fix] Fix flashinfer + speculation issues (#3686 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-28 14:34:22 -04:00
Mike Iovine	e6f7ff3a46	[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-04-28 10:58:03 -04:00
Zhenhuan Chen	ad15e45f07	[TRTLLM-4638 ][feat] add best of n support with reward model in scaffolding (#3807 ) Signed-off-by: Zhenhuan Chen <chenzhh3671@gmail.com>	2025-04-28 17:15:33 +08:00
bhsueh_NV	f77252e9ff	fix bug of create cuda stream as default parameter which will be init… (#3764 ) * fix bug of create cuda stream as default parameter which will be initialized during importing Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * add torch.cuda.Stream() for the leader node Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix pre-commit issue Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-28 08:16:03 +08:00
Yan Chunwei	ad4226d946	fix: trtllm-bench build trt engine on slurm (#3825 ) * add submit_sync to RemoteMpiSessionClient Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> add barrier Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> fix comment Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> disable test Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> * fix Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com> --------- Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>	2025-04-27 22:26:23 +08:00
bhsueh_NV	76f2c631fb	fix: add warmup flag into py_executor to prevent enable profiler during wa… (#3852 ) * add warmup flag into py_executor to prevent enable profiler during warmup Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * fix bug of pre-commit Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> * change setting warmup to all ranks Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com> --------- Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-27 19:22:42 +08:00
Chuang Zhu	e2318756ed	cacheTransceiver buffer manager (#3798 ) * cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix args Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * cpp kvCacheManager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-27 11:48:15 +08:00
HuiGao-NV	136aab5c54	fix: Update num_of_ctx_tokens in iteration stats (#3785 ) * Update num_of_ctx_tokens in iteration stats * Revert not neccessary change of importing module	2025-04-27 10:24:47 +08:00
bhsueh_NV	e9fab4f3d9	fix bug of deepseek gropu_size setting (#3860 ) Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>	2025-04-27 09:10:37 +08:00
yuxianq	e6c14ca97a	fix: Detect pmix and raise error when mpirun is not used. (#3858 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-26 21:49:41 +08:00
milesial	362a8272f8	feat: llama4 input processor (#3383 ) Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Signed-off-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com> Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-25 16:47:14 -07:00
sugunav14	5b9897a8cd	fix: [AutoDeploy] update hf loading for e_score_correction_bias (#3847 ) Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>	2025-04-26 02:03:47 +08:00
dongxuy04	16535991b2	feat: Add MNNVL MoE A2A support (#3504 ) * add MNNVL memory mapping support Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments #2 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-04-25 17:29:08 +08:00
Yuan Tong	57944206ba	feat: return logits in PyTorch flow (#3221 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-24 16:56:03 -07:00
hlu1	d72add1794	[Deepseek] Pass hidden_states_fp4 to shared_experts (#3819 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-24 13:12:12 -07:00
HuiGao-NV	7420ddc3d0	fix: fix lora case failure (#3838 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-24 07:29:08 -07:00
WeiHaocheng	3fc2a16920	feat(part 2): Enhance the integrated robustness of scaffolding with __init__.py #3305 (#3731 ) Signed-off-by: fredw (generated by with_the_same_user script) <20514172+WeiHaocheng@users.noreply.github.com>	2025-04-24 18:47:03 +08:00
Zhanrui Sun	ae34d60108	chore: bump version to 0.20.0rc1 (#3834 ) Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-24 17:43:37 +08:00

1 2 3 4 5 ...

326 Commits