* Move all casters to customCasters.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use customCasters in all bindings.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added customCasters to userbuffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* feat: Add group_rms_norm kernel to normalize multiple inputs in a single operator.
Previously, the RMSNorm implementation only supported a single input tensor. With group_rms_norm, multiple tensors can be normalized together:
```python
input_a, input_b, ... = group_rms_norm([input_a, input_b, ...])
```
All input tensors must share the same batch dimension. The kernel partitions work by dynamically assigning warp groups proportional to the last dimension of each input, improving launch efficiency and reducing overhead.
This MR provides two implementations:
GroupRMSNormKernel: Optimized for small-to-medium batch sizes
GroupRMSNormKernelLargeBatch: Contains additional optimizations for large batch sizes
Both kernels are currently exposed as custom PyTorch ops. A future MR will implement heuristic-based kernel selection and expose a unified interface.
Signed-off-by: Simeng Liu <simengl@nvidia.com>
* Resolve comments and fix typo with IS_FLASHINFER_AVAILABLE
Signed-off-by: Simeng Liu <simengl@nvidia.com>
---------
Signed-off-by: Simeng Liu <simengl@nvidia.com>
When input size is larger than the max workspace size, we shall fallback to NCCL + corresponding pre/post function to ensure the functionality of AllReduce.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* support return logprob in llmapi
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
update and add test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
stability test
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* revert removal of old flag
Signed-off-by: Erin Ho <erinh@nvidia.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Signed-off-by: Erin Ho <erinh@nvidia.com>
* add qwen3 dense model pytorch backend support, initial commit
solve the results error issue
add qwen3 moe model pytorch backend support
reformat the code
* perf - use flash_infer rmsnorm for qwen3
* feat - support qwen3 moe rmsnorm
* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B.
* Put the computation of Q and K norm (in attn) into a single CUDA stream, and get a 5% - 8% throughput improvement on Qwen3 4B and Qwen3 - moe 30B - A3B. -- Forgot to update all modifications.
* fix bugs of running qwen3 public models and fp8 models
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bugs due to rebase
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bugs captured by pre-commi
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
* fix bug of attention
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
---------
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
Co-authored-by: Keddy Jin <jin.gq@aliyun.com>
Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Co-authored-by: shao <shao@nvidia.com>
* Add a new param to LlmRequest and Request to natively support mm
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* update comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update tests to match the new LlmRequest constructor parameters
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify unitTest and modify mm_embeding's dict name in llama4
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix LlmRequest initialization in kvCacheManagerTest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up code for promt_tuning_config
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up prompt_tuning_config in GenerationRequest
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Move world options to a different group for clarity.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Add eos_id option.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---------
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Replace deepseek_allreduce op with the new unified allreduce op and moe_allreduce op.
* Minor revision of moe_allreduce op argument names.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Squash of dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Add timer + waive test with suspected GptSession bug
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
* Respond to reviewer comments
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
---------
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: domb <3886319+DomBrown@users.noreply.github.com>
This MR integrates Conan into the build system, so that it can be used to fetch dependencies in future changes.
Also installs all requirements-dev.txt inside a virtualenv instead of the system, since some of Conan's dependencies may conflict with the system packages. Virtualenv is used instead of venv because the triton server backend container has only virtualenv installed. This also allows developers to cache the requirements-dev.txt packages between container launches.
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
Replace libtensorrt_llm_nvrtc_wrapper.so with its source code, which
consists of two parts:
1. NVRTC glue code
2. XQA kernel code
During TensorRT-LLM build, XQA kernel code is embedded as C++ arries via
gen_cpp_header.py and passed to NVRTC for JIT compilation.
Signed-off-by: Ming Wei <2345434+ming-wei@users.noreply.github.com>
* Add mIsGenerationMLA to differentiate ctx and gen MLA in AttentionOp.
For Generation MLA, if FlashMLA is used, do not check the existence of FMHA based MLA kernel.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Run pre-commit.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Fix compile error.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
---------
Signed-off-by: Bo Li <bobboli0202@gmail.com>
(1) match quant exclude modules names to TRTLLM names
(2) No need for any special weight loading for quantization scales weights (#3891)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
* add parallel_q_b_proj_and_concat
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
* code cleanup
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
* one gemm/concat and then split the latent_cache and pass them separately to context/gen
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
---------
Signed-off-by: junliu <65336694+hello-11@users.noreply.github.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
* Update gen tps calculation.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Add back output speed for comparison.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Fix issue with f-string.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Fix some spacing.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Replace output speed with per-request genphase tput.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Add gen TPS breakdown.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Update some tagging.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---------
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>