* support lp in pytorch backend
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
* fix tp
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
---------
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
test: add test cases for 0.19 release (#3608)
* fix test name
* add quickstart test for nemotron-ultra
* add rcca multi-node test case for deepseek-v3
* add rcca info
---------
squash (#3642)
fix: nvbugs/5187237: fix deterministic mode crash (#3448)
* nvbugs/5187237 nvbugs/5112075: fix deterministic mode error
* remove waive
* Revert "remove waive"
This reverts commit 0bf5486d19906d692bfb7a6262333c296b0087ac.
* revert ar fusion
---------
update fp8 doc (#3647)
tests: change qa perf test to trtllm-bench (#3619)
fix: FP8 quantized lm_head (NvBug 5214229) (#3567)
infra: Add PR approval protection for the release branch (#3634)
fix: nvbugs/5231298: pytorch allreduce issue (#3673)
Fix: nvbugs/5222698 variable not defined (#3630)
* Fix: nvbugs/5222698 variable not defined
* Tidy code
---------
test:sync waives.txt from main branch by disabling test_perf/gpt_350m-cppmanager case (#3685)
test:restore fp8 kv cache testing for L0 (#3671)
doc: Update DeepSeek perf docs (#3693)
* Update DeepSeek perf docs
* update
* Apply suggestions from code review
---------
tests: waive test_llm_multi_node (#3664)
fix: update test_user_buffers_mm_add_prologue atol (#3711)
Fix: cherry-pick hmac encryption from main branch (#3635)
* security fix cherry-pick changes from main
* fix hmac in remote mpi session (#3649)
---------
Un-waive DS-V3-Lite tests. (#3621)
fix: FP8 kv accuracy (#3675)
* fix FP8 kv accuracy
* update doc
---------
Fix script options for engines. (#3622)
unwaive multi-node test (#3721)
chore : Split more tests out of gpt tests (#3524) (#3674)
doc:add torch examples link into torch backend documentation (#3749)
test: Get Eagle tests working (#3593) (#3722)
Waive L0 test (#3756)
waive failed case in perf test, change default max_batch_size to 512 and write config.json to output log (#3656)
Update ds v3 parameters in stress test. (#3676)
waive gemma on L20 (#3766)
https://nvbugs/5141291: Fix convert.py script for Qwen model. (#3758)
Include Qwen2VLDecoderLayer in the smooth_qwen2_model function.
fix: PP4 fixes and cleanup (#3688)
remove benchmark test list (#3643)
skip disagg deepseek test if sm!=90 (#3720)
test: skip failed cases on B200 (#3710)
* add skip condition to tests
* fix error
---------
test: [nvbug: 5234494] skip_pre_ada for fp8 cases (#3718)
* skip_pre_ada for fp8 cases
* update
* update after rebase
---------
add know issue to deepseek doc. (#3800)
Fix ModelOpt Mixtral AWQ OOM (#3714) (#3761)
Waive L0 tests (#3826)
fix: Reduce memory usage in fused moe op associated with AutoTuning and fix moe fallback issue. (#3793)
* Reduce memory usage in fused moe op associated with AutoTuning.
* Replace pre-defined bucket size strategy with a generating function based on the tune_max_num_tokens.
* Add free_memory logic of workspace in min_latency_mode fused moe path.
* Fix fused_moe fallback issue. (#3652)
min_latency_mode is only set to False during warmup phase. Thus when it becomes true during inference, all tactics fall back to the default one and thus cause perf regression.
---------
[doc] Better document for Draft-Target-Model (DTM) speculative decoding (#3797)
Fix pre-commit
Fix again
Address some review comments for the MI
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
* adding local paths to the datasets to make them loadable in offline mode
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* bert datasets should work on both offline and online mode
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
---------
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* Use updateDecoderBuffers in python decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix synchronize in trtllm decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Enable by default.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use guided_decoder to setup seqslots and free them.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use always decode_async and update_requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update decoder buffers.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix speculative decoding tests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Send new_tensors_host instead of assuming dict.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make default False in enable_trtllm_decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Partially fix mtp, partially fix py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update request states before sending disagg ctx cache.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix disagg test for torch decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make isend_tensor_list and recv_tensor_list for sending the tensors_host.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add disagg serving case to guided decoder.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Get overlap scheduling to work.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update cutlass to main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update after rebasing.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update to use decode async and update requests.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Properly pass information to update_requests
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make disaggregated serving a step closer to working.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix rebase and format.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Copy new device tokens more pythonic.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Restore MTP add dummy reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add ordereddict import to py_executor.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Added seq slot manager. Add test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Use transmission for single tensor except when list of tensors is received.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add TRTLLMDecoder allocation to estimate max kv cache tokens.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add stream synchronization
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Make memory calculation of decoder adapt to the chosen decoder. Recognize decoder option passed in executorconfig. Make overlap scheduler test run on TinyLlama.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Format
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Add decoder creation to estimate max kv.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update submodule UCXX inline with main.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* generalizing cudagraph to multiple dynamic inputs
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* fix for failing test
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
---------
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
* Feat: Offload ptable to cpu if enable_chunk_context
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Feat: offload ptable to cpu for chunk context mode
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix and add comment
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Update Readme for multimodal and add a new param mm_embedding_offloading
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* fix: Correct prompt table offloading condition in PromptTuningBuffers
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Clean up the code
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Add commits to explain copy from cpu <-> gpu using pinned memory
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix namings based on comments
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Fix format based on precommit
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
* Modify --mm_embedding_offloading flag
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
---------
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Co-authored-by: Haohang Huang <31998628+symphonylyh@users.noreply.github.com>
* Update Nemotron Super and Ultra in Supported Models and add an example
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
* Update README link to match new examples structure
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
---------
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
This change makes the draft target model works without mismatch in the vocab size
Signed-off-by: mayani-nv <67936769+mayani-nv@users.noreply.github.com>
Co-authored-by: rakib-hasan <rhasan@nvidia.com>
* feat: Integrate GPUDirect Storage (GDS) into Executor API
Squash of several dev commits
Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com>
Signed-off-by: Zheyu Fu <zheyufu2@gmail.com>
Co-authored-by: Junda Chen <32371474+GindaChen@users.noreply.github.com>
Co-authored-by: Yichao Fu <57950249+fuyichao2000@users.noreply.github.com>
Co-authored-by: Andy Dai <zhongdongmin@nvidia.com>
* feat: adding multimodal (only image for now) support in trtllm-bench
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* fix: add in load_dataset() calls to maintain the v2.19.2 behavior
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* re-adding prompt_token_ids and using that for prompt_len
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* updating the datasets version in examples as well
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* api changes are not needed
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* moving datasets requirement and removing a missed api change
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* addressing review comments
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* refactoring the quickstart example
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
---------
Signed-off-by: Rakib Hasan <rhasan@nvidia.com>
* doc: Update doc to enable FP8 MLA for Deepseek.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Update.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Update.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Update the status on Hopper and Blackwell.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Update.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
* Update table of contents.
Signed-off-by: Bo Li <bobboli0202@gmail.com>
---------
Signed-off-by: Bo Li <bobboli0202@gmail.com>
Co-authored-by: bhsueh_NV <11360707+byshiue@users.noreply.github.com>
* Rename nvsmall to nemotron NAS
* Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests
* Add NemotronNAS to pytorch supported models table
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
- Added a new entry in the README for the published benchmarking best practices for DeepSeek-R1.
- Introduced a new blog post detailing performance benchmarking configurations and procedures for DeepSeek-R1 in TensorRT-LLM, including installation, dataset preparation, and benchmarking steps for both B200 and H200 GPUs.
Signed-off-by: taoli <litaotju@users.noreply.github.com>
Co-authored-by: taoli <litaotju@users.noreply.github.com>
* fix: Fix p-tuning test bug
* A change in the vocab_size calculation for T5Tokenizer,
introduced in transformers version 4.34, caused addition of incorrect vtokens for ptuning.
In general, instead of adding tokens which are outside the vocabulary, tokens inside the vocabulary were added.
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>