TensorRT-LLMs

mirror of https://github.com/NVIDIA/TensorRT-LLM.git synced 2026-01-14 06:27:45 +08:00

Author	SHA1	Message	Date
Robin Kobus	e943ad5a2a	[https://nvbugs/5247414 ] fix: draft/target probs shape (#4055 ) Shape was wrongly changed in DecoderState introduction. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-06 09:56:43 +02:00
Yuan Tong	4b6c19737b	feat: support add internal cutlass kernels as subproject (#3658 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-05-06 11:35:07 +08:00
brb-nv	5b1aeb6730	test: Test OOB access issue in penaltyKernel for endId=-1 (#4035 ) Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>	2025-05-05 10:24:28 -07:00
Mike Iovine	8caf200322	[fix] Skip debugCheckSemaphores in stream capture mode (#4032 ) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>	2025-05-05 10:24:10 -07:00
Robin Kobus	ccff86068e	fix: request termination in pipeline parallelism (#3892 ) * feat: Implement synchronous request termination in batch manager - Added `terminateRequestSync` method to `TrtEncoderModel` and `TrtGptModelInflightBatching` for handling request termination in the next `forwardSync` call. - Updated existing request termination logic to utilize the new synchronous method, ensuring generated tokens are cleared appropriately. - Enhanced logging for clarity in token management during request processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: MockedModelCancelRequest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: terminate with timeout Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! feat: Implement synchronous request termination in batch manager Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * docs: Update doc string for allottedTimeMs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-05 21:51:41 +08:00
Robin Kobus	9f9edd783c	refactor: Introduce MpiTag enumeration and update MPI function signatures (#3893 ) * refactor: Move executor recv functions into classes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance MPI logging and error handling - Updated MPI logging to include destination and tag information for better traceability during send and receive operations. - Added error checking for MPI_Wait and MPI_Cancel calls to ensure proper handling of multi-device requests. - Improved code structure for clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce MpiTag enumeration and update MPI function signatures - Added a new header file `mpiTags.h` to define an enumeration for MPI tags, improving code readability and maintainability. - Updated function signatures in `mpiUtils.h` and `mpiUtils.cpp` to use the new `MpiTag` type instead of raw integers for tags. - Refactored various MPI calls across the codebase to utilize the new `MpiTag` enumeration, enhancing type safety and clarity. - Removed redundant MPI tag constants from several classes, streamlining the code. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Introduce MpiTag enumeration and update MPI function signatures Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Rename tags for consistency Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-04 13:24:29 +02:00
Robin Kobus	403370af62	refactor: Move ModelSpec to core library (#3980 ) * refactor: Move ModelSpec from tests to core library Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move ModelSpec from runtime to separatedir Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Use new bindings path and clean up Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Updated licenses Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove script_dir from path Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-05-04 01:39:09 +08:00
Daniel Cámpora	c7cf032b89	fix: Move all casters to customCasters. (#3945 ) * Move all casters to customCasters. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use customCasters in all bindings. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added customCasters to userbuffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-05-02 19:08:28 +08:00
Simeng Liu	873c7532fd	feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. (#3438 ) * feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator. Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together: ```python input_a, input_b, ... = group_rms_norm([input_a, input_b, ...]) ``` All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead. This MR provides two implementations: GroupRMSNormKernel: Optimized for small-to-medium batch sizes GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface. Signed-off-by: Simeng Liu <simengl@nvidia.com> * Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE Signed-off-by: Simeng Liu <simengl@nvidia.com> --------- Signed-off-by: Simeng Liu <simengl@nvidia.com>	2025-05-02 13:25:30 +08:00
Erin	8fe7bdeacf	feat: LogitsProcessor in PyTorch backend (#3145 ) * support lp in pytorch backend Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * fix tp Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>	2025-05-01 14:15:30 -07:00
Erin	83f37614ef	feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388 ) * support return logprob in llmapi Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> update and add test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> stability test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> * revert removal of old flag Signed-off-by: Erin Ho <erinh@nvidia.com> Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> --------- Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Signed-off-by: Erin Ho <erinh@nvidia.com>	2025-05-01 12:47:14 -04:00
YueWeng	b1621e8d4e	feat: add relaxed acceptance for DS (#3865 ) * add relaxed acceptance for DS R1 Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * clean and update docs Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * Modified based on review Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> * fix mtp manager issue Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> --------- Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-05-01 21:50:36 +08:00
hlu1	1294ecb12f	Add attention workspace memory check (#3970 ) Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com@users.noreply.github.com>	2025-04-30 23:51:09 -07:00
Kate Cheng	7dbe618683	feat: Add multimodal embedding field in LlmRequest (#3855 ) * Add a new param to LlmRequest and Request to natively support mm Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * update comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update tests to match the new LlmRequest constructor parameters Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify unitTest and modify mm_embeding's dict name in llama4 Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix LlmRequest initialization in kvCacheManagerTest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up code for promt_tuning_config Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up prompt_tuning_config in GenerationRequest Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-05-01 12:23:30 +08:00
Yukun He	9cc5922a0b	Clean up allreduce op in Deepseek V3 model. (#3829 ) * Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op. * Minor revision of moe_allreduce op argument names. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-05-01 07:56:36 +08:00
Dom Brown	b40f351b7a	[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests (#3206 ) * Squash of dev commits Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Add timer + waive test with suspected GptSession bug Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Respond to reviewer comments Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com> --------- Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>	2025-05-01 05:31:08 +08:00
tburt-nv	7053d0ad5a	infra: add conan (#3744 ) This MR integrates Conan into the build system, so that it can be used to fetch dependencies in future changes. Also installs all requirements-dev.txt inside a virtualenv instead of the system, since some of Conan's dependencies may conflict with the system packages. Virtualenv is used instead of venv because the triton server backend container has only virtualenv installed. This also allows developers to cache the requirements-dev.txt packages between container launches. Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-30 11:53:14 -07:00
nv-guomingz	dd959de0fd	chore: update internal_cutlass_kernels. (#3973 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-04-30 22:13:17 +08:00
Ming Wei	ed887940d4	infra: open source XQA kernels (#3762 ) Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which consists of two parts: 1. NVRTC glue code 2. XQA kernel code During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via gen_cpp_header.py and passed to NVRTC for JIT compilation. Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>	2025-04-30 18:05:15 +08:00
Bo Li	a80d2373a3	fix: [https://nvbugspro.nvidia.com/bug/5243482 ] If FlashMLA is used, the existence of FMHA based MLA kernels should not be checked. (#3862 ) * Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp. For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Run pre-commit. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com>	2025-04-30 14:27:38 +08:00
djns99	cc989ea49f	perf: Optimise MOE prologue to use fused setup function (#3790 ) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>	2025-04-30 11:44:48 +08:00
Pamela Peng	f98a80f9d9	sync internal cutlass kernel changes (#3968 ) Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>	2025-04-30 08:57:28 +08:00
xiweny	68a19a33d4	TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770 ) * upgrade cutlass to 3.9 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> update latest internal_cutlass_kernels; revert cutlass version update; fix fp4 gemm for sm100 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * update internal cutlass kernels Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * fix file Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * remove unnecessary change Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> * update hash Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com> Co-authored-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com>	2025-04-29 11:19:11 -04:00
Dom Brown	8709fe8b53	chore: bump version to 0.19.0 (#3598 ) (#3841 ) test: add test cases for 0.19 release (#3608) * fix test name * add quickstart test for nemotron-ultra * add rcca multi-node test case for deepseek-v3 * add rcca info --------- squash (#3642) fix: nvbugs/5187237: fix deterministic mode crash (#3448) * nvbugs/5187237 nvbugs/5112075: fix deterministic mode error * remove waive * Revert "remove waive" This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac. * revert ar fusion --------- update fp8 doc (#3647) tests: change qa perf test to trtllm-bench (#3619) fix: FP8 quantized lm_head (NvBug 5214229) (#3567) infra: Add PR approval protection for the release branch (#3634) fix: nvbugs/5231298: pytorch allreduce issue (#3673) Fix: nvbugs/5222698 variable not defined (#3630) * Fix: nvbugs/5222698 variable not defined * Tidy code --------- test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685) test:restore fp8 kv cache testing for L0 (#3671) doc: Update DeepSeek perf docs (#3693) * Update DeepSeek perf docs * update * Apply suggestions from code review --------- tests: waive test_llm_multi_node (#3664) fix: update test_user_buffers_mm_add_prologue atol (#3711) Fix: cherry-pick hmac encryption from main branch (#3635) * security fix cherry-pick changes from main * fix hmac in remote mpi session (#3649) --------- Un-waive DS-V3-Lite tests. (#3621) fix: FP8 kv accuracy (#3675) * fix FP8 kv accuracy * update doc --------- Fix script options for engines. (#3622) unwaive multi-node test (#3721) chore : Split more tests out of gpt tests (#3524) (#3674) doc:add torch examples link into torch backend documentation (#3749) test: Get Eagle tests working (#3593) (#3722) Waive L0 test (#3756) waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656) Update ds v3 parameters in stress test. (#3676) waive gemma on L20 (#3766) https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758) Include Qwen2VLDecoderLayer in the smooth_qwen2_model function. fix: PP4 fixes and cleanup (#3688) remove benchmark test list (#3643) skip disagg deepseek test if sm!=90 (#3720) test: skip failed cases on B200 (#3710) * add skip condition to tests * fix error --------- test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718) * skip_pre_ada for fp8 cases * update * update after rebase --------- add know issue to deepseek doc. (#3800) Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761) Waive L0 tests (#3826) fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793) * Reduce memory usage in fused moe op associated with AutoTuning. * Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens. * Add free_memory logic of workspace in min_latency_mode fused moe path. * Fix fused_moe fallback issue. (#3652) min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression. --------- [doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797) Fix pre-commit Fix again Address some review comments for the MI Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>	2025-04-29 16:57:22 +08:00
zhhuang-nv	94e6167879	optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907 ) Signed-off-by: Zhen Huang <145532724+zhhuang-nv@users.noreply.github.com>	2025-04-29 14:17:07 +08:00
Perkz Zheng	35c5e4f1c5	feat: add CGA reduction fmha kernels on Blackwell. (#3763 ) * update cubins Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> * address the comments Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-04-29 10:43:54 +08:00
Jinyang Yuan	dafc28fb85	fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (#3863 )	2025-04-29 09:09:43 +08:00
Yukun He	5502a522d2	Fixing minor typo in allreduce kernel selection (#3912 ) Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Co-authored-by: Kefeng-Duan <176893526+Kefeng-Duan@users.noreply.github.com>	2025-04-28 23:06:49 +08:00
Chuang Zhu	e2318756ed	cacheTransceiver buffer manager (#3798 ) * cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * fix args Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * cpp kvCacheManager Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * format Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-27 11:48:15 +08:00
Dom Brown	7ff9fd345c	Test: Split C++ unit tests for CI granularity (#3868 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-25 13:30:58 -07:00
qixiang-99	ecd621fb0a	feat: Add head size 72 support for QKV Preprocessing kernel (#3743 ) * refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow - Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels. - Added support for head size 72 in unfused attention kernels(QKVPreprocessing). - Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72). Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> * update: Waive head_dim=72 test cases and enhance test representation - Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues. - Introduced a custom __repr__ method in the Scenario class for pytest substring match. Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com> --------- Signed-off-by: qixiang-99 <203170375+qixiang-99@users.noreply.github.com>	2025-04-25 11:07:40 -07:00
dongxuy04	16535991b2	feat: Add MNNVL MoE A2A support (#3504 ) * add MNNVL memory mapping support Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> * address review comments #2 Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com> --------- Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>	2025-04-25 17:29:08 +08:00
QI JUN	991939a0f4	chore: increase A30 for cpp test (#3811 ) * increase A30 for cpp test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * enable parallel run test for gpt_executor Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * clean Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * decrease freeGpuMemoryFraction of cpp tests Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-24 16:34:39 -07:00
Shi Xiaowei	1d5178814b	Fix: Revert commit `25f9669` (#3832 ) Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>	2025-04-24 14:03:20 +08:00
QI JUN	d0d19e81ca	chore: fix some invalid paths of contrib models (#3818 ) Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-24 05:36:16 +08:00
Kaiyu Xie	dfbcb543ce	doc: fix path after examples migration (#3814 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-24 02:36:45 +08:00
Julien Debache	0c6c8eaffd	fix: 5197419 and removed unused runtime kernels (#3631 ) - Removed kernel under test call, as it was not needed - Removed kernel itself - Removed kernel tests - Removed other unused kernels and their tests - Some static analysis clean up	2025-04-23 18:04:50 +02:00
Daniel Cámpora	1299f27c74	fix: Fix C++ decoder synchronization in PyTorch (#3106 ) * Use updateDecoderBuffers in python decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix synchronize in trtllm decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Enable by default. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use guided_decoder to setup seqslots and free them. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use always decode_async and update_requests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update decoder buffers. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix speculative decoding tests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Send new_tensors_host instead of assuming dict. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make default False in enable_trtllm_decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Partially fix mtp, partially fix py_executor. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update request states before sending disagg ctx cache. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix disagg test for torch decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make isend_tensor_list and recv_tensor_list for sending the tensors_host. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add disagg serving case to guided decoder. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Get overlap scheduling to work. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update cutlass to main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update after rebasing. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update to use decode async and update requests. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Properly pass information to update_requests Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make disaggregated serving a step closer to working. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Fix rebase and format. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Copy new device tokens more pythonic. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Restore MTP add dummy reqs. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add ordereddict import to py_executor. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Added seq slot manager. Add test. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Use transmission for single tensor except when list of tensors is received. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add TRTLLMDecoder allocation to estimate max kv cache tokens. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add stream synchronization Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Format Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Add decoder creation to estimate max kv. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Formatting. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> * Update submodule UCXX inline with main. Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> --------- Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-04-23 23:55:27 +08:00
Shi Xiaowei	25f96697ad	fix: Intercept the error of multi-rank binding to a single card (#3525 )	2025-04-23 15:50:18 +08:00
Zongfei Jing	1e5af736ea	Add smart router for moe (#3641 ) Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-23 12:21:59 +08:00
Perkz Zheng	0324a7389d	add QMMA-based MLA kernels (#3752 )	2025-04-23 10:18:19 +08:00
William Tambellini	44bff85e08	Fix double link to fp8_blockscale_gemm_src (#3707 ) Fix https://github.com/NVIDIA/TensorRT-LLM/issues/3690 Signed-off-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-23 10:16:07 +08:00
Zongfei Jing	7eee9a9d28	doc: Update doc for Deepseek min latency (#3717 ) * Tidy code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Update doc for min latency deepseek Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Throw exception for RouterKernel when not running on sm90+ Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-22 23:07:59 +08:00
Yukun He	0ae7017342	Unify two versions of AllReduce custom op (#3032 ) * Rewrite unit test for unified allreduce op. Removing the legacy unit test. * Revise formats, fusion_op bindings. Put all tensors as optional inputs. * Move the MoeAllreduceOp to a separate custom op. * Move all the fusion patterns to the new version of the AllReduce fusion kernel. Remove the AllReduce strategy config. Revise the AllReduce strategies and fusion pattern definitions. * Add more TODOs, fixing minor bugs, and remove legacy code. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-22 21:58:42 +08:00
Robin Kobus	8340657ae4	refactor: Introduce DecoderOutputBuffers per batch (#3506 ) * refactor: Restructure DecoderBuffers and DecoderStepAsyncSend - Move communication logic from `DecoderBuffers` to `DecoderStepAsyncSend`. - Updated `DecoderStepAsyncSend` constructor to utilize the `DecoderBuffers`, enhancing clarity and reducing parameter complexity. - Refactored related methods to align with the new class structure, improving maintainability and readability of the code. These changes streamline the handling of decoding buffers and improve the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Restructure SlotDecoderBuffers and DecoderSlotAsyncSend - Move communication logic from `SlotDecoderBuffers` to `DecoderSlotAsyncSend`. - Updated `DecoderSlotAsyncSend` constructor to utilize the `SlotDecoderBuffers`, enhancing clarity and reducing parameter complexity. - Refactored related methods to align with the new class structure, improving maintainability and readability of the code. These changes enhance the structure and readability of the batch manager's decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Log DecodingMode Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce DecoderOutputBuffers and update related classes - Moved buffers from `DecoderBuffers` to `DecoderOutputBuffers` to better reflect its purpose. - Updated the `DecoderStepAsyncSend` class to utilize `DecoderOutputBuffers`, enhancing clarity in the communication logic. - Refactored the constructor and methods in `DecoderBuffers` to accommodate the new structure, improving maintainability. - Added Python bindings for `DecoderOutputBuffers` to ensure compatibility with existing interfaces. These changes streamline the handling of output buffers in the decoding process, improving the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update MPI communicator handling - Changed the `commSession` parameter type from `std::shared_ptr<mpi::MpiComm>` to `mpi::MpiComm` in `DecoderStepAsyncSend` and `DecoderSlotAsyncSend` classes for improved clarity and reduced complexity. - Updated related methods and constructors to reflect the new parameter type, enhancing maintainability. - Refactored the `TrtGptModelInflightBatching` class to accommodate these changes, ensuring consistent usage of `MpiComm`. These modifications streamline the communication logic in the decoding process, improving the overall architecture of the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Replace shared_ptr with unique_ptr for buffer management - Updated the `TrtGptModelInflightBatching` class to use `std::unique_ptr` instead of `std::shared_ptr` for various buffer types, including `AllReduceBuffers`, `RuntimeBuffers`, `DecoderBuffers`, and `SlotDecoderBuffers`. - This change enhances memory management and ownership semantics, reducing overhead and improving performance. These modifications contribute to a cleaner and more efficient architecture in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-22 12:25:53 +08:00
Zheng Duan	ae48abefc1	bind block key and hasher (#3712 )	2025-04-21 18:50:57 +08:00
Iman Tabrizian	af04b6f6aa	bug: Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095 ) * Fix hang bug when KV cache is low Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Review comments Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Fix attentiondp typo Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * Add CI test for this case Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> * fix: Fix the insertion order for responder futures Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> * fix: Fix disagg CPP Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> --------- Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com> Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>	2025-04-21 15:16:55 +08:00
Jinyang Yuan	bc2b01d1dd	chore: update FMHA cubin files (#3680 ) Signed-off-by: Jinyang Yuan <154768711+jinyangyuan-nvidia@users.noreply.github.com>	2025-04-21 15:04:04 +08:00
katec846	eeb605abd6	feat: Offloading Multimodal embedding table to CPU in Chunked Prefill Mode (#3380 ) * Feat: Offload ptable to cpu if enable_chunk_context Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Feat: offload ptable to cpu for chunk context mode Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix and add comment Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Update Readme for multimodal and add a new param mm_embedding_offloading Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * fix: Correct prompt table offloading condition in PromptTuningBuffers Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Clean up the code Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Add commits to explain copy from cpu <-> gpu using pinned memory Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix namings based on comments Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Fix format based on precommit Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> * Modify --mm_embedding_offloading flag Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> --------- Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com> Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>	2025-04-21 14:31:01 +08:00
hlu1	31624b079a	feat: [Deepseek] Add trtllm-gen MOE FP4 MOE backend (#3387 ) * Add TRT-LLM Gen MOE to Deepseek fix fused moe rebase bug. Fix atol in test_fp4_gemm_quantize.py fix fused moe rebase bug. Fix FusedMoe. Disable 2nd routing kernel preexit Bump routing reduction to fp32 Disable PDL for fc1 [DEBUG] Lift token limit to 16k [Bugfix] Token limit to 16k + fp32 routing + tanh Make fp8 tileN 8 Fix FP8 MoE + Remove redundent temp output for FP4 [FP8-only] Avoid wasting CTAs for activation kernel fix: unblock FP8 weightloading with trtllm-gen Remove max_token limit for trtllm-gen path perf: avoid type-conversion and fill_ from aten Minor fix Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix rebase issues Signed-off-by: Hao Lu <haolu@nvidia.com> * Fix compile issue Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * CI clean Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Hao Lu <haolu@nvidia.com> Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> Co-authored-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-21 10:01:33 +08:00
QI JUN	d51ae53940	move the reset models into `examples/models/core` directory (#3555 ) * move rest models to examples/models/core directory Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * update multimodal readme Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix example path Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix cpp test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix tensorrt test Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> * fix ci Signed-off-by: junq <22017000+QiJune@users.noreply.github.com> --------- Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-19 20:48:59 -07:00
Dom Brown	dbd9a83b0d	feat: Integrate GPUDirect Storage (GDS) into Executor API (#3582 ) * feat: Integrate GPUDirect Storage (GDS) into Executor API Squash of several dev commits Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-18 15:59:21 +08:00
Yuan Tong	0b0e6d8a0a	refactor: Clean up CMakeLists.txt (#3479 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-18 14:39:29 +08:00
Jackch-NV	1b2b112d44	fix sage attention headsize check error in bertAttentionPlugin.cpp (#3660 ) Signed-off-by: Jackch-NV <69230184+Jackch-NV@users.noreply.github.com>	2025-04-18 09:28:04 +08:00
Netanel Haber	3c52ac098f	feat: allocate minimal blocks per window size (#3028 ) * implement variable window attention by breaking the block manager into window block managers per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * revert isCyclic to be true if the min attention window is reached, not per window size Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add explanatory comment to mCyclicThreshold Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * load correct gemma config Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * don't shadow inputLength in addSequence - it should remain the function scope input length between window size loop iterations Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix KVCacheManagerVariableWindowAttentionWithReuseTest for multiple window block managers Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * if TYPE_CHECKING Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * set temp_attention_window_inputs to None explicitly Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * pass dtype as well Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * test_gemma variable sliding window attention Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * allot a fraction of primary/secondaryBlocks to different window size heaps, depending on the window size's total contribution to the kvcache size (i.e., including all layers) Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove \|\| mEnableBlockReuse which erroneously triggers beamsearch code for cyclic variable attention window code Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * turn off request delaying for MaxUtil Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * make comments better Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * windowSizesTotalSum using std::accumulate Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix error handling of forwardAsync - forwardAsync catch-all catch cleanup code that runs terminateRequest can also fail and must be caught Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comments Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * remove assert that kills disagg tests, since it isn't necessary Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix corrupted expression: 'isNewTask && (peftCacheManager ?' -> '(isNewTask && peftCacheManager) ?' which caused boolean algebra. Main is correct Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * add Gemma3 to SUPPORTED_HF_ARCHITECTURES Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * support Gemma3 Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * finally fix test_gemma - always spread at least {} into generate_summary_cmd, never None Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix kvfactor field for deepseek Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix gemma-3 entries in testlist to include vswa Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * only quantize gemma2 VSWA Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> remove misleading comment Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix test_gemma Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * in sendRequestInfo, fromOldAllocatedBlockIds->fromOldAllocatedBlockIds, like in main Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> * fix: disable KV cache reuse if using attention sink (#3021) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com> Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> Co-authored-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-17 16:04:57 +08:00
danielafrimi	0f084d9566	added loraOp into lora layer + test for mlp and comparison to lora plugin (#3455 ) Loraop integration into torch modules Signed-off-by: Ubuntu <dafrimi@nvidia.com>	2025-04-17 12:48:27 +08:00
Void	950cadf2bd	add support for smaller hidden_dim (#3609 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com> Co-authored-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-17 12:00:32 +08:00
Olya Kozlova	b3e6723dbc	feat: Adding FP8 BMM from Codegen (#3541 ) * Adding FP8 BMM from Codegen Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com> * Fixed licenses Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com> --------- Signed-off-by: Olya Kozlova <okozlova@s4124-0110.nvidia.com> Signed-off-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com> Co-authored-by: Olya Kozlova <okozlova@6u1g-0018.nvidia.com> Co-authored-by: Olya Kozlova <okozlova@s4124-0062.nvidia.com>	2025-04-16 10:37:15 +02:00
Gabriel Wu	2e0cd7922e	fix: add SM90 guard for FP8 Blockscale GEMM (#3575 ) * fix: add SM90 guard for FP8 Blockscale GEMM Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix: add SM90 guard for FP8 Blockscale GEMM Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> --------- Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>	2025-04-16 14:44:37 +08:00
Robin Kobus	fffb403125	fix: disable KV cache reuse if using attention sink (#3021 ) * fix: disable KV cache reuse if using attention sink Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: disable KV cache reuse if sink bubble Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * add comment Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-16 03:07:32 +08:00
Kaiyu Xie	258ae9c58c	Revert "infra: move nvrtc_wrapper to conan (#3282 )" (#3573 ) This reverts commit `c0dd6cbce0`. Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-04-15 22:45:13 +08:00
jiahanc	1d3b98b920	perf: Optimize quantization kernels used in DeepSeek on Hopper (#3466 ) Signed-off-by: jiahanc <jiahanc@nvidia.com>	2025-04-15 17:49:57 +08:00
Robin Kobus	b7a38feb14	chore: Clean up cpp runtime (#3537 ) * add space in test output Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * perf: reduce executor lock scope Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move TokenRangeRetentionConfig implementation to cpp file Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Improve finished steps handling for external draft tokens - Fixed a bug where the whole finished steps tensor was being zeroes instead of the slices. - Replaced the creation of a temporary tensor for finished steps with a direct slice from the input tensor, improving efficiency and readability. - Updated the tensor management logic to streamline the process of setting zero values for finished steps during batch processing. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Clean up includes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-15 16:06:14 +08:00
shaharmor98	ede7058544	Feat/ Integrate peftCacheManager in PyExecutor creation (#3372 ) * integrate peftCacheManager in PyExecutor creation Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-15 15:14:43 +08:00
Pamela Peng	6cdfc54883	feat: Add FP8 support for SM 120 (#3248 ) * Allow FP8 on SM120 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix sm121 Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * fix pre-commit Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> * review update Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> --------- Signed-off-by: Pamela Peng <179191831+pamelap-nvidia@users.noreply.github.com> Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>	2025-04-14 16:05:41 -07:00
tburt-nv	c0dd6cbce0	infra: move nvrtc_wrapper to conan (#3282 ) * add pip scripts dir to path * move nvrtc_wrapper to conan * support building nvrtc wrapper from source --------- Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-15 05:31:01 +08:00
Robin Kobus	f58d4698c8	chore: Clean up cpp runtime (#3505 ) * chore: Remove unused tensors from DecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fix: Remove unused argument from readme Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: remove unused tensor Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove unnecessary newOutputTokens Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove unnecessary event in getDecoderSlotHostOutputs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-14 18:00:03 +08:00
yuxianq	9d64b6b890	Cache sin cos in model instead of global LRU cache. (#3378 ) Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>	2025-04-14 11:19:09 +08:00
pcastonguay	fe6f14b2b1	fix: Fixing issue with first gen token being returned twice in streaming (#3427 ) * fix: Fixing issue with first gen token being returned twice with streaming Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing not_expectring_strings in test Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-13 22:45:09 -04:00
William Tambellini	af67bf00a8	feat: register ENABLE_MULTI_DEVICE and ENABLE_UCX as CMake options (#3343 ) No change of default value (still ON). These were hidden cmake vars before that patch. Fix issue #3289 Signed-off-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-14 10:30:23 +08:00
Chuang Zhu	75e13f4f88	chore: disable some env for disagg defaultly (#3415 ) * disable some env for disagg defaultly Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * doc Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * remove Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-14 10:08:10 +08:00
Chuang Zhu	6ee021a90d	chore: exchange connection id with tagSend/tagRecv (#3320 ) * exchange connection id with tagSend/tagRecv Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * unwaive Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * tag recv/send Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-14 09:30:34 +08:00
Aurelien Chartier	7b38018fa0	feat: Add numNodes to ParallelConfig (#3346 ) * Add numNodes to ParallelConfig If not provided, attempt to find the number of nodes by adding the number of local ranks 0 Update device IDs check accordingly Signed-off-by: Aurelien Chartier <achartier@nvidia.com> * Add ParallelConfig pickle test Signed-off-by: Aurelien Chartier <achartier@nvidia.com> --------- Signed-off-by: Aurelien Chartier <achartier@nvidia.com>	2025-04-13 13:55:04 +02:00
Robin Kobus	ceec4924d9	refactor: batch slot management in decoder classes (#3300 ) * refactor: batch slot management in decoder classes - Changed `forwardBatchSlots` from a single `TensorPtr` to a `std::vector<TensorPtr>` in `decoderBuffers.h` and updated its initialization in `decoderBuffers.cpp`. - Updated `batchSlots` in `iGptDecoderBatched.h` to a `std::vector<TensorPtr>` for better handling of batch sizes. - Modified `mBatchSlotsDecoder` in `statefulGptDecoderBatched.h` to use a `std::vector<TensorPtr>` and adjusted its initialization in `statefulGptDecoderBatched.cpp`. - Ensured proper reshaping of tensors in the setup methods to accommodate the new vector structure. These changes enhance flexibility in managing tensor buffers across different batch sizes. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Setup batch slots outside of the decoder - Refactored batch slot management to utilize `makeBatchSlots`, enhancing clarity and functionality in batch processing. - Introduced `DecoderState` to `MakeDecodingBatchInputOutput` for improved state handling during decoding. - Updated the `operator()` method to include `decoderState` as a parameter, facilitating better integration with the decoding process. - Modified related tests to accommodate changes in batch slot handling and ensure proper functionality. These updates improve the overall structure and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance decoder input structure with maxDecodingEngineTokens - Updated the `Input` class in `iGptDecoderBatched.h` to include a new parameter `maxDecodingEngineTokens` for better control over decoding limits. - Modified the `MakeDecodingBatchInputOutput` algorithm to compute the maximum number of decoding tokens based on active slots. - Adjusted the `GptDecoderBatched` class to utilize the new `maxDecodingEngineTokens` parameter, improving clarity in token management during decoding. - Updated Python bindings to reflect changes in the `Input` class constructor. - Enhanced tests to ensure proper handling of the new parameter. These changes improve the flexibility and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Streamline decoder input creation and batch slot management - Introduced a new function `createDecoderInputs` to encapsulate the logic for creating decoder inputs, improving code organization. - Updated the `operator()` method to utilize the new `createDecoderInputs` function, simplifying the decoding input setup process. - Removed the `maxOfActiveSlots` template function to streamline the logic for determining the maximum number of active decoding engine tokens. - Introduced a direct calculation of `maxActiveDecodingEngineTokens` within the `createDecoderInputs` function, enhancing clarity and reducing complexity. These changes enhance the maintainability and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update logits handling in decoder batch - Modified the `decoder_batch::Input` to accept a vector of vectors for logits, enhancing flexibility in tensor management. - Adjusted the `createDecoderInputs` function to accommodate the new logits structure, ensuring proper batch processing. - Updated Python bindings to reflect changes in the `Input` class constructor, maintaining compatibility with existing interfaces. - Refactored the `GptDecoderBatched` and `StatefulGptDecoderBatched` classes to utilize the updated logits structure, improving clarity in tensor slicing and batch size management. - Enhanced tests to validate the new input structure and ensure correct functionality across various decoding scenarios. These changes streamline the decoding process and improve the overall maintainability of the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Rename maxDecodingEngineTokens to maxDecoderSteps - Updated the `Input` class in `iGptDecoderBatched.h` to rename `maxDecodingEngineTokens` to `maxDecoderSteps` for improved clarity. - Adjusted the `createDecoderInputs` function to reflect the new naming, ensuring consistency in the decoding process. - Modified the `GptDecoderBatched` class to utilize `maxDecoderSteps` in its logic, enhancing readability and maintainability. - Updated Python bindings to expose the renamed parameter, maintaining compatibility with existing interfaces. These changes enhance the clarity of the decoding parameters and improve the overall structure of the codebase. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of `active` vector from prepareForward Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Removed the `active` vector from `decoder_batch::Input` - Removed the `active` vector from the `Input` class constructor in `iGptDecoderBatched.h`, streamlining the input handling for decoding. - Updated the `createDecoderInputs` function and related tests to reflect the changes in the `Input` class, ensuring compatibility and maintaining functionality. - Adjusted Python bindings to accommodate the new constructor signature, enhancing clarity in the interface. These changes improve the maintainability and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of `active` vector from gptDecoderBatchedTest Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Unify the creation of decoder batch inputs in algorithm and tests - Added a new static method `createDecoderBatchInputs` to streamline the creation of decoder batch inputs, enhancing clarity and maintainability. - Updated the implementation to utilize active slots directly, simplifying the logic for managing batch slots and logits. - Refactored the `operator()` method to leverage the new input creation function, ensuring compatibility with existing decoding processes. - Enhanced tests to validate the new input handling approach, ensuring correct functionality across various scenarios. These changes improve the overall structure and readability of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: remove usage of active vector from createDecoderBatchInputs Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update maxDecoderSteps calculation - Replaced integer division with `common::ceilDiv` for calculating `maxDecoderSteps` and `numDecoderSteps`, ensuring correct handling of token counts. These changes enhance the robustness of the decoding batch input creation process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-13 05:05:13 +08:00
Robin Kobus	2ab71f9a80	refactor: decoder buffers (#3307 ) * refactor: remove cumLogProbs and logProbs from DecoderBuffers - Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management. - Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities. These modifications enhance code clarity and maintainability by reducing redundancy in buffer management. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched - Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping. - Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width. These changes enhance clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: integrate DecoderState for sequence length management in decoding process - Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose. - Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management. - Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow. refactor: update SlotDecoderBuffers to manage sequence lengths directly - Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths. - Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy. - Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process. These changes improve maintainability and streamline the decoding workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Delegate to asyncSend method in SlotDecoderBuffers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-12 11:41:24 +02:00
Robin Kobus	1bd84c6d8c	feat: Allow individual gatherContext for each additional output (#3374 ) * refactor: Update ExecutorConfig to use AdditionalModelOutput type - Changed function signatures and member variables across multiple files to replace std::optional<std::vector<std::string>> with std::optional<std::vector<executor::AdditionalModelOutput>> to include gatherContext flag for each additional output. - Updated related serialization and deserialization methods to accommodate the new type. - Adjusted tests to reflect the changes in the output handling structure. This refactor enhances the flexibility and maintainability of the output configuration in the executor and batch manager components. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove equality operator from TrtGptModelOptionalParams - Deleted the operator== implementation from TrtGptModelOptionalParams to simplify the class. - Updated the pybind11 bindings to remove the exposure of the equality operator to Python. This change streamlines the class definition and reduces unnecessary complexity in the bindings. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Enhance copyAdditionalOutputs to utilize AdditionalModelOutput - Updated the copyAdditionalOutputs function to accept a vector of AdditionalModelOutput, allowing for the inclusion of the gatherContext flag. - Adjusted the logic to handle context and non-context outputs separately, improving the output handling mechanism. - Modified related unit tests to incorporate the new gatherContext parameter, ensuring comprehensive testing of the updated functionality. This refactor improves the flexibility and clarity of output management in the batch processing workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Introduce findOutputTensor utility function for output tensor retrieval - Added a new utility function, findOutputTensor, to encapsulate the logic for finding output tensors and checking their validity. - Refactored copyAdditionalOutputs to utilize findOutputTensor, reducing code duplication and improving clarity. - Enhanced error checking for additional context and generation output tensors. This change streamlines the output tensor retrieval process, enhancing maintainability and readability in the batch processing workflow. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Check final indices of additional output tensors and update tests - Added checks to verify the final indices of additional output tensors for context and generation outputs. - Updated unit tests to verify the changes. - Add lastTokenIds input tensor to test engines. - Logits output depends on gatherContextLogits parameter. - Removed gatherContextOutputs parameter from the validate method in LlmRequest. - Context outputs do not depend on computeContextLogits parameter. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Check final indices of additional output tensors and update tests Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Update ExecutorConfig to use AdditionalModelOutput type Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! refactor: Remove equality operator from TrtGptModelOptionalParams Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * docs: Update executor.md Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Clean up includes Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-12 17:00:36 +08:00
Robin Kobus	aeecdb0ab9	fix: Eagle decoding (#3456 ) * fix: eagle packAcceptedPaths Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * test: Add wavefront tests for Eagle Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-11 22:06:38 +08:00
Yukun He	ff82aef99b	Fix the issues related to fused moe path. (#3435 ) * One of the tactic is not supported during dispatch. * final_hidden_states should be unpacked if it is not min_latency_mode. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-11 21:41:15 +08:00
liji-nv	b168adba70	feat: Add NVFP4 UB pattern optimization pass in torch compile (#3371 ) * feat: Add NVFP4 UB pattern optimization pass in torch compile * Add an additional flag for UB fp4 pattern to avoid inverse the scale * Add NVFP4 related UB patterns Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> * Update atol, some points fails for B200 umbriel. Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com> --------- Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com> Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>	2025-04-11 21:25:29 +08:00
pansicheng	143edc8153	fix partialMatch (#3413 ) Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>	2025-04-11 16:42:52 +08:00
Yuan Tong	a139eae425	chore: Stabilize ABI boundary for internal kernel library (#3117 ) chore: Stabilize ABI boundary for internal kernel library Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-11 15:07:50 +08:00
wili	5142c783c0	fix: Beam Search Diversity (#3375 ) Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>	2025-04-11 11:58:59 +08:00
Dom Brown	a8310b01dc	feat: trtllm-gen fp4 GEMM for pytorch workflow (#3423 ) * feat: trtllm-gen fp4 GEMM Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Clean up Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Remove incorrect header Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * Reviewer comment Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> --------- Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-04-11 02:28:07 +08:00
HuiGao-NV	3ade9375ba	feat: Run PyExecutor's inference flow to estimate max_num_tokens for kv_cache_manager (#3092 ) Signed-off-by: Hui Gao <huig@nvidia.com>	2025-04-10 18:29:40 +08:00
Gabriel Wu	4d78f51608	fix: remove DeepGEMM line info (#3411 ) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>	2025-04-09 18:01:02 +08:00
Mike Iovine	5bdf997963	Add Llama 4 (#3302 ) Signed-off-by: Mike Iovine <miovine@nvidia.com>	2025-04-09 03:35:21 +08:00
wili	54ad95eaa8	Feat: Variable-Beam-Width-Search (VBWS) part3 (#3338 ) * feat/Variable-Beam-Width-Search-Part3, v1.0 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.1 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat/Variable-Beam-Width-Search-Part3, v1.2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>	2025-04-08 23:51:27 +08:00
Void	316e5c3be3	feat: fix and improve allreduce and fusion kernels (#3064 ) Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>	2025-04-08 19:33:52 +08:00
liji-nv	dca6397d1e	feat: Introduce UB allocator for pytorch flow (#3257 ) * Instead of allocating UserBuffers at beginning of runtime, UB buffers are now managed with global allocator. The allocator will dynamically assign free UB buffer or allocate new buffer for torch tensor. It makes userbuffers easier to use. * In common usecase, the Userbuffers will be allocated correctly during warm up stage. There is no dynamic allocation during inference. * UB fusion pattern is rewroten using the new UB Allocator. It contains following passes: 1. Fuse Quant with allreduce, replace with UB impl, and insert a copy_to_userbuffers. Currently the normal allreduce still does not support FP8 quant. So this need to be done in UB pass 2. Convert all supported allreduce with UB and insert copy_to_userbuffers. 3. Fuse op before ar with the copy_to_userbuffers. So the op directly writes to the userbuffer 4. Remove userbuffers finalize if the output is connect to another UB allreduce. Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-04-08 18:39:49 +08:00
Yukun He	c678774c99	feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151 ) * Several optimizations and fixings on the Autotuner. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Apply the new Python side Autotuner on current linear for nvFP4 data type. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Apply the new Python side Autotuner on MoE op * Remove routers from cache key to improve inference perf * Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic. * Remove try-catch inside moe profiling process. * Move default tactic -1 to 0 transforms in cpp runner. * Revise relavant tests. * Predefined the bucketizing strategy for fused_moe Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization * Add specific_profile for moe * Add specific profile for linear Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Fixing and revising according to reviewer's suggestions. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Use lru_cache for inference pref optimization. * Revert gen_custom_cache_key feature Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Replace runner with runner id to achieve a serializable cache. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Code clean up and minor fixings. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Move all tunable runners and custom ops into torch_custom_ops. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> * Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> --------- Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>	2025-04-08 14:28:36 +08:00
Gabriel Wu	f1655afb0d	feat: enable DeepGEMM by default (#3341 ) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>	2025-04-08 13:58:57 +08:00
Chuang Zhu	1c88af1378	feat: use cudaMalloc to allocate kvCache (#3303 )	2025-04-08 10:59:14 +08:00
pcastonguay	add5e5cd93	feat: Add option to run disaggregated serving without ctx servers,… (#3243 ) * feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> * Fixing comment in sanity check Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com> --------- Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>	2025-04-07 21:56:03 -04:00
Gabriel Wu	376731013d	feat: use NVRTC for DeepGEMM JIT compilation (#3239 ) * feat: use NVRTC for DeepGEMM JIT compilation Signed-off-by: Zihua Wu * fix: add license Signed-off-by: Zihua Wu * feat: store NVRTC JIT results in memory by default Signed-off-by: Zihua Wu * feat: refinement Signed-off-by: Zihua Wu * feat: refinement Signed-off-by: Zihua Wu * test: set timeout to 7200 Signed-off-by: Zihua Wu --------- Signed-off-by: Zihua Wu	2025-04-07 20:29:23 +08:00
Yao Yao	3545d59635	Support speculative decoding with Hopper XQA (#3269 ) Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>	2025-04-07 17:14:34 +08:00
pansicheng	ef1ba468a1	feat: support abort disconnected requests (#3214 ) Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>	2025-04-07 16:14:58 +08:00
Bo Li	515dd0d78f	feat: Add support for FP8 MLA on Hopper and Blackwell. (#3190 ) * fp8 kv + bf16 ctx MLA + fp8 gen MLA Use BF16 for context MLA. mFP8GenerationMLA and mFP8ContextFMHA shouldn't be enabled together. Allow mSM==90 for mFP8GenerationMLA==true. For FMHA, dataTypeKv should be FP8. For FP8 MLA generation, the output is still in BF16. Refine debug info for FMHA kernel metadata. Use inputType, outputType, SM together to hash kernel list. Add FP8 MLA generation FMHA kernel. Special WAR of NUM_COMPUTE_GROUPS for MLA generation kernel. Separate the implementation of fused_multihead_attention_v2.h to CPP and print some debug info if checkIfKernelExist fails. Refine debug info in fused_multihead_attention_v2.cpp Correct FP8 MLA metadata. New kernel provided by Yuxin, which outputs BF16. smem size is not set correctly, which will lead to illegal mem access. Yuxin fixed the error in FMHA MLA kernel: previously the BF16 isn't correctly written: some parts are repeatedly written, while some others are untouched. There are two bmm1 scales that should be set correctly. New kernel generated by Yuxin. Modificatiosn to common/attentionOp for FP8 MLA on Hopper using FMHA. Not necessary. If mFP8GenerationMLA, is_fp8_out is false, so mFP8ContextFMHA is false. Skip a check in fmhaDispatcher. Modifications in fmhaRunner: - Debug dump. - if (!isFP8GenerationMLA) skips a lot of flag setting. - TMA descriptor modification for qo (by Yuxin). Cleanup debug output. Clean up o tma descriptor modifications. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Apply the patch of FP8 FlashMLA and resolve conflicts. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compilation error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Fix compile error. Signed-off-by: Bo Li <bobboli0202@gmail.com> * pick blackwell support Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * Add copyright notice to fused_multihead_attention_v2.cpp. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Add missing license. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Exclude building flashMLA kernels under sm90. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Revert "Exclude building flashMLA kernels under sm90." This reverts commit `f0c859d459`. Signed-off-by: Bo Li <bobboli0202@gmail.com> * Use macro to skip compiling FlashMLA for non sm90 targets. Signed-off-by: Bo Li <bobboli0202@gmail.com> --------- Signed-off-by: Bo Li <bobboli0202@gmail.com> Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: Dylan Chen <ziqingc@nvidia.com> Co-authored-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>	2025-04-07 15:14:13 +08:00
nv-guomingz	a6a4920b1d	chore: update internal cutlass library base #2981 and #3165 . (#3308 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-04-07 13:53:02 +08:00
Chuang Zhu	5aeef6d4c7	ucx interface (#3306 ) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>	2025-04-07 08:44:34 +08:00
tburt-nv	7a659885e3	chore: remove usernames from comments (#3291 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-05 13:44:28 +08:00
Robin Kobus	e12e7a753d	refactor: Expose DecoderState via bindings and integrate in TRTLLMDecoder (#3139 ) * refactor: Expose DecoderState via bindings and integrate in TRTLLMDecoder - Introduced a new `DecoderState` class in the C++ bindings, encapsulating key functionalities for managing decoding state. - Adjusted the Python `TRTLLMDecoder` to access properties from `decoder_state`, ensuring consistency and clarity in the decoding process. These changes streamline the decoder's architecture and enhance maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * chore: Remove unused new_tokens from DecoderState bindings Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-05 07:42:35 +08:00
qixiang-99	0d4d50a745	feat: no-cache attention in PyTorch workflow (#3085 ) * init trtllm attn no cache Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: fix the seq_len issue and attn metadata prepare for qwen reward model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove unnecessary debug logs and clean up commented code refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: Resolve comments for Python code Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * docs: Add is_dummy_attention field to attention metadata for simulation operations Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * refactor: add KVCacheParams to attention backend interface and import relevant metadata classes Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: fix rebase format issue Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: extend attention mask type handling in MHARunnerFixedParams Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> * fix: enhance attention mask type handling in TllmGenFmhaRunnerParams Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types. Signed-off-by: Qixiang Lin <qixiangl@nvidia.com> --------- Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>	2025-04-05 01:54:32 +08:00
Robin Kobus	77724b0fcb	Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183 ) (#3195 ) * Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183) This reverts commit `75495730bc`. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * fixup! Reapply "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183) Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-04 15:56:28 +02:00
tburt-nv	d96c4e3379	update internal_cutlass version.txt to d03df7b27 (#3279 ) Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>	2025-04-04 15:50:03 +08:00
shaharmor98	ee4aab72ec	feat: Support PeftCacheManager in Torch (#3186 ) * Add PeftCacheManager implementation Signed-off-by: Shahar Mor <smor@nvidia.com>	2025-04-04 12:38:08 +08:00
Yibin Li	32ae1564bd	update FP4 quantize layout (#3045 ) Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>	2025-04-03 13:13:54 -04:00
Robin Kobus	b5bc0a9fcd	chore: Add output of first token to additional generation outputs (#3205 ) - Updated the first dimension of additional output tensors to match mMaxNewTokens. - Copy output of last context token to generation outputs. - Adjusted the expected output size calculations in unit tests to reflect the correct maximum output length. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-02 20:14:16 +08:00
Zheng Duan	c9e94ec807	fix: remove test relies on timing (#3228 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-04-02 18:38:37 +08:00
Zheng Duan	5a72945eec	fix: conditional disagg test name (#3161 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-04-02 15:34:30 +08:00
William Tambellini	dbc0496f37	fix: upgrade cmake minimum from 3.18 to 3.27 (#3208 ) Required to correctly support recent archs like 90a, ... Fix issue #3173 Signed-off-by: William Tambellini <wtambellini@sdl.com> Co-authored-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-02 15:14:36 +08:00
Julien Debache	76a6a62073	fix: segfault in cudaDriverWrapper (#3017 ) * fix segmentation fault in cudaDriverWrapper Signed-off-by: jdebache <jdebache@nvidia.com> * replace cuGetErrorMessage with cuGetErrorString and added tests Signed-off-by: jdebache <jdebache@nvidia.com> --------- Signed-off-by: jdebache <jdebache@nvidia.com>	2025-04-02 08:55:19 +02:00
wili	34e63d07e6	feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133 ) * feat: Variable-Beam-Width-Search Part2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search Part2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search Part2, fix CPP tests Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search Part3, simplify CPP tests Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search Part4, move beam_width_array param Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search, fix CI error Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search part2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search part2 Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search part2, fix pre-commit Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> * feat: Variable-Beam-Width-Search part2, fix review Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> --------- Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com> Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>	2025-04-02 12:31:28 +08:00
Gabriel Wu	05b50b297f	[feat] open source fp8_blockscale_gemm (#3071 ) Signed-off-by: Zihua Wu <zihuaw@nvidia.com>	2025-04-02 12:12:52 +08:00
Chuang Zhu	bc5811da65	chore: Ucx ip port remove mpi depend (#3101 ) * initial ucx support Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * fixes to support dynloading and ucx connection establishment - not stable yet Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * update Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * more connection bringup fixes - faillig on connection vector build Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * executor test pass Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * update Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * passed full benchmark Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * changing to TLLM_THROW and removing cout Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * stoping progress thread at ucxComm destructor Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * fixing build with ENABLE_UCX=0 to not build ucx traget at all and removing includes for ucxConnection for cache transceiver, also delete commented cold code Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * fix copyrights Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * adding ucx flavor to cache transceiver test and insertto the CI pipeline Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * allowing sending non ib interfaces IPs Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * setting UCX port reuse for the tests in pipeline Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * code review fixes Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * querying ep after GID message is sent to avoid UCX Errors Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * fixing more CR issues Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * querying ep to not fail is ep_not_connected yet Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> * remove mpi dependency and debug Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * debug to info Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * mpirun n 2 Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * remove mpi comm split when disaggOrchestrator mode Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * waive disagg_mtp test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * use future instead of thread Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * use future_promise instead of cv wait Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * connectionId type Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * improve test Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * imporve test 2 Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> * gtest_skip Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> --------- Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com> Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com> Co-authored-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>	2025-04-02 09:42:29 +08:00
Zongfei Jing	c7548ad72c	perf: Add optimizations for deepseek in min latency mode (#3093 ) * Add optimizations for deepseek min latency Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Fix compile error Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Update internal cutlass kernel libs Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Format code Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> * Resolve conflicts Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com> --------- Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>	2025-04-02 09:05:24 +08:00
Chang Liu	1d3a5d38af	fix: Update FP8 sf layout for Blackwell and relax blockwise GEMM assertions (#3144 ) * Update fp8 sf layout for blackwell and enable fp8 gemm e2e * Add test case when m needs to be padded * Better comment Signed-off-by: Chang Liu <liuc@nvidia.com> * Add TODO for fp8 quant kernel Signed-off-by: Chang Liu <liuc@nvidia.com> * Enable DCO check Signed-off-by: Chang Liu <liuc@nvidia.com> * Fix lint --------- Signed-off-by: Chang Liu <liuc@nvidia.com>	2025-04-01 13:08:29 -07:00
Robin Kobus	d7386d14a8	refactor: Simplify disableLookahead and improve numDecodingEngineTokens handling (#3103 ) * refactor: Simplifiy disableLookahead method Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * Update DecoderBuffers comments Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move numDecodingEngineTokens to DecoderState This commit introduces new methods in the DecoderState class to manage the number of tokens for each request in a batch. The following changes were made: - Added `getNumDecodingEngineTokens()` to retrieve the number of tokens for all requests. - Added `getNumDecodingEngineTokens(SizeType32 batchIdx)` to get the token count for a specific request. - Added `setNumDecodingEngineTokens(SizeType32 batchIdx, SizeType32 numTokens)` to set the token count for a specific request. - Updated the setup method to initialize the token count vector based on the maximum batch size. - Refactored the `CreateNewDecoderRequests` class to utilize the new token management methods, improving clarity and maintainability. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Improve shape variables in DecoderState Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-04-01 18:47:31 +08:00
Yuan Tong	2994527110	chore: cutlass cleanup (#3165 ) Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>	2025-04-01 13:57:38 +08:00
dongjiyingdjy	22ff81b047	fix：fix illeagel memory access when mtp >= 2 (#3006 ) * fix - fix illeagel memory access when mtp > 2 --------- Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com> Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com> Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>	2025-04-01 13:36:45 +08:00
QI JUN	75495730bc	Revert "refactor: Replace DecoderFinishedEvent with CudaEvent in decoder clas…" (#3183 ) This reverts commit `3ee4332fb1`. Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>	2025-04-01 12:49:27 +08:00
liji-nv	e0d0dde058	None - Add one-shot version for UB AR NORM FP16/BF16 (#2995 ) Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>	2025-03-31 11:16:03 +08:00
William Tambellini	9c484b24e6	fix #3109 : early exit cmake if find_library() does not find any lib (#3113 ) Early exit if find_library() does not find any lib. As today, the find_library_create_target() cmake macro blindly continues even if the lib is not found, adding LIB_PATH-NOTFOUND to the target and making the build failing anyway later with non obvious reasons. This change just early exits if the lib is simply not found with a proper error message. Fix github issue #3109 Signed-off-by: William Tambellini <wtambellini@sdl.com>	2025-03-29 19:59:03 +08:00
Robin Kobus	3ee4332fb1	refactor: Replace DecoderFinishedEvent with CudaEvent in decoder classes (#3078 ) - Updated the `forwardAsync` method in `GptDecoderBatched` and `iGptDecoderBatched` to return `CudaEvent` instead of `DecoderFinishedEventPtr`, simplifying event handling. - Removed the `DecoderFinishedEvent` class and its associated usage across various files, streamlining the codebase. - Adjusted related methods and Python bindings to accommodate the new event structure, ensuring compatibility and maintaining functionality. These changes enhance the clarity and efficiency of the decoding process in the batch manager. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-28 14:50:52 +08:00
Robin Kobus	45134d7095	refactor: Improve decoder finalize function (#3077 ) * refactor: Update gatherTree function to accept CUDA stream parameter This commit modifies the gatherTree function signature to include a runtime::CudaStream parameter, enhancing flexibility in stream management. Additionally, it removes unnecessary buffer manager parameters and stream handling from the function, streamlining the code. The finalize method in GptDecoderBatched is also updated to reflect these changes, improving clarity and maintainability in the decoding process. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Update GptDecoderBatched finalize This commit refactors the GptDecoderBatched class to improve method signatures and reduce code complexity: - Modified finalize method to accept DecoderState as a parameter - Updated method signatures to work with the new DecoderState approach - Improved code organization and readability The changes continue the ongoing refactoring to centralize decoder state management and simplify the decoder implementation. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-28 14:33:59 +08:00
BatshevaBlack	3e37531c6a	feat: Add BW measurement (#3070 )	2025-03-28 10:53:00 +08:00
Dom Brown	60d4dacc47	Port multi GPU changes to GitHub (#3027 ) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>	2025-03-27 05:55:03 +08:00
wili	3e035f2219	v1.2 (#3082 ) Signed-off-by: wili <wili@nvidia.com>	2025-03-26 23:31:29 +08:00
Robin Kobus	d9522c5906	feat: Update cutlass (#2981 ) * chore: update cutlass to v3.8.0 Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: update include directives for consistency and organization in weightOnlyBatchedGemv headers Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * Fix fpA_intB_gemm compilation Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-26 22:36:27 +08:00
Robin Kobus	3c3629c52a	refactor: simplify forward methods in GptDecoderBatched (#3076 ) * refactor: Remove ForwardType enum from GptDecoderBatched - Remove ForwardType enum from GptDecoderBatched - Simplify forwardDispatch and forwardDecoder methods Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Remove forwardDecoder method from GptDecoderBatched - Eliminate the forwardDecoder method to streamline the decoding process. - Update forwardDispatch to directly call forwardAsync when input batch size is greater than zero. Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> * refactor: Move event handling from forwardDispatch to forwardAsync Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com> --------- Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-26 20:45:04 +08:00
Robin Kobus	94dd456bd0	refactor: Remove speculative decoding parameters from stateful decoders (#3024 ) Simplify StatefulGptDecoderBatched constructor: - Remove speculative decoding mode parameter - Initialize with default mode=None - Update GptSession class accordingly Simplify setup method signatures in StatefulGptDecoder and StatefulGptDecoderBatched: - Remove maxTokensPerStep parameter - Initialize decoders with default maxTokensPerStep=1 - Update GptSession class accordingly Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>	2025-03-26 20:16:26 +08:00
Zheng Duan	d70ff79d1d	conditional disagg test (#3012 ) Signed-off-by: Zheng Duan <200704041+zhengd-nv@users.noreply.github.com>	2025-03-26 15:55:33 +08:00
Enwei Zhu	f70b439503	bitmask v3 (#3009 ) Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>	2025-03-26 15:21:29 +08:00
DylanChen-NV	1ac0566a93	fix: fix for cp > kvHeadNum (#3002 ) * fix for cp > kvHeadNum Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> * fix for None kv_head_num Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com> --------- Signed-off-by: Dylan Chen <191843203+DylanChen-NV@users.noreply.github.com>	2025-03-26 12:39:02 +08:00
Shunkangz	8ee840159b	Add updateKVCacheTransfer (#2984 ) Add kv cache transfer measurement Signed-off-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>	2025-03-25 21:45:35 +08:00
Perkz Zheng	e9df23f815	fix: [MLA] fix the bug with fp8 MLA kernels on Blackwell. (#3008 ) * update cubins * update error message --------- Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>	2025-03-25 18:03:29 +08:00
Aurelien Chartier	a33c595c88	Fix logits dtype in assert (#3038 ) Remove extra methods in trtGptModelInflightBatching.h. The methods were moved out of that class during a previous refactoring, but the definitions have been left behind. Signed-off-by: Aurelien Chartier <achartier@nvidia.com>	2025-03-25 10:35:21 +08:00
nv-guomingz	dc0463b0e2	doc:add version.txt for internal cutlass library and nvrtc_wrapper so files (#3030 ) Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>	2025-03-24 23:44:21 +08:00
Netanel Haber	da0b0e0ee3	fix: disable kv cache reuse when minimum window size is reached, instead of maximum window size (#2983 ) * fix variable window size reuse - disable when min attention window starts sliding, not max * isPreCyclic -> isCyclic, and invert logic, for clarity * getDecoderState() Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>	2025-03-24 22:49:52 +08:00
Kaiyu Xie	2631f21089	Update (#2978 ) Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2025-03-23 16:39:35 +08:00
Kaiyu Xie	3aa6b11d13	Update TensorRT-LLM (#2936 ) * Update TensorRT-LLM --------- Co-authored-by: changcui <cuichang147@gmail.com>	2025-03-18 21:25:19 +08:00
Kaiyu Xie	9b931c0f63	Update TensorRT-LLM (#2873 )	2025-03-11 21:13:42 +08:00
Kaiyu Xie	225b77667c	Fix .gitmodules (#2852 )	2025-03-04 22:34:09 +08:00
Kaiyu Xie	77d7fe1eb2	Update TensorRT-LLM (#2849 ) * Update TensorRT-LLM --------- Co-authored-by: aotman <chenhangatm@gmail.com>	2025-03-04 18:44:00 +08:00
Kaiyu Xie	ab5b19e027	Update TensorRT-LLM (#2820 )	2025-02-25 21:21:49 +08:00
Kaiyu Xie	2ea17cdad2	Update TensorRT-LLM (#2792 ) * Update TensorRT-LLM --------- Co-authored-by: jlee <jungmoolee@clika.io>	2025-02-18 21:27:39 +08:00
Kaiyu Xie	e88da961c5	Update TensorRT-LLM (#2783 )	2025-02-13 18:40:22 +08:00
Dan Blanaru	16d2467ea8	Update TensorRT-LLM (#2755 ) * Update TensorRT-LLM --------- Co-authored-by: Denis Kayshev <topenkoff@gmail.com> Co-authored-by: akhoroshev <arthoroshev@gmail.com> Co-authored-by: Patrick Reiter Horn <patrick.horn@gmail.com> Update	2025-02-11 03:01:00 +00:00
Kaiyu Xie	be17881062	Update TensorRT-LLM (#2582 )	2024-12-16 21:50:47 -08:00
Kaiyu Xie	aaacc9bd68	Update TensorRT-LLM (#2562 ) * Update TensorRT-LLM --------- Co-authored-by: Starrick Liu <73152103+StarrickLiu@users.noreply.github.com>	2024-12-11 00:31:05 -08:00
石晓伟	548b5b7310	Update TensorRT-LLM (#2532 ) * blossom-ci.yml: run vulnerability scan on blossom * open source efb18c1256f8c9c3d47b7d0c740b83e5d5ebe0ec --------- Co-authored-by: niukuo <6831097+niukuo@users.noreply.github.com> Co-authored-by: pei0033 <59505847+pei0033@users.noreply.github.com> Co-authored-by: Kyungmin Lee <30465912+lkm2835@users.noreply.github.com> Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>	2024-12-04 21:16:56 +08:00

1 2 3 4 5 ...

325 Commits