* [Infra][TRTLLM-4063] - Branch out for the TRT-LLM v0.18.0 release
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
(cherry picked from commit de90312020e51c22ba5e75b3502c7ee90c059265)
* [Infra][TRTLLM-3652] - Update dependencies to TRT 10.9 / CUDA 12.8.1 / DLFW 25.03(Internal)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
(cherry picked from commit 58db1340ef7db22f1910f878d220a92be5b830d1)
* [None][Doc] - Update docs for v0.18.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit d23e75bc95619ce3b116213d55319272888e0c88)
* [Infra] - Fix or WAR issues in the package sanity check stages
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit e874e2b127515c52ba10c8df1cc2631627f74ffe)
* [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path
Signed-off-by: Yuki Huang <yukih@nvidia.com>
(cherry picked from commit 731811d4e182d70a66193d646152cb71dfafe83a)
* cherry-pick 'test: Updat cluster and multi node test lists and trtllm-bench' test to fix perf drop issue
Signed-off-by: Ruodi Lu <ruodil@nvidia.com>
(cherry picked from commit 5214616283fbc15ae98871a1d84c78d8e1f2e6e8)
* Revert "Merge branch 'user/yukih/fix_5173454_5173432' into 'release/0.18'"
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 8d34831cb2b81ee2dfa8021b68e7158b33789a5f)
* [Infra]Restrict setuptools version to avoid sasb pip install issue
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
(cherry picked from commit 1e60ad29e0dafec0e295bedb5d89b716a02a707c)
* [https://nvbugs/5173454] [https://nvbugs/5173432] [https://nvbugs/5175863] fix chatglm tokenizer and tmp model path
Signed-off-by: Yuki Huang <yukih@nvidia.com>
(cherry picked from commit 3ed8164e5bfea1d5aa2039b5408439fd6cf59dac)
* WAR for bug 5173448
Signed-off-by: Thor Johnsen <tjohnsen@nvidia.com>
(cherry picked from commit b6528b2ba15322b6c6a4c81a8b74c04d4973de4f)
* [Infra][TRTLLM-3652] - Update dependencies to CUDA 12.8.1 / DLFW 25.03
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
(cherry picked from commit 6560983d132d9d257ee15849664eb055e94adaa9)
* [Docs] - Doc changes for v0.18.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 26769b61218a947c8f9d070f73b63d576fcc20c4)
* [Doc] - Doc change for v0.18.0
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
(cherry picked from commit 4b3b5ed6bfbc2300e3775fe75456083faad7b235)
* [Infra] update version to 0.18.1
Signed-off-by: Zhanrui Sun <zhanruis@nvidia.com>
(cherry picked from commit 59e8326c75639275837d34de8e140358737a3365)
* Add back nemotron file.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix recurrentgemma reqs.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Adding WAR for bug 5173448.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Formatting.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Remove duplicated file.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update examples/prompt_lookup/requirements.txt
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Remove glm-4-9b from model dir in chatglm test.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Remove indent change.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
* Revert changes on l0_test.groovy.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Update dev images
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
* Remove duplicated import.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Fix custom op
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
* Fix flashinfer & vanilla backend
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
* Skip problematic case.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
* Skip problematic test_moe_w4a8_1_14336_4096_8_bfloat16_True_False case.
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
---------
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Zhanrui Sun <zhanruis@nvidia.com>
Co-authored-by: Yiqing Yan <yiqingy@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Ruodi Lu <ruodil@nvidia.com>
Co-authored-by: Emma Qiao <qqiao@nvidia.com>
Co-authored-by: Thor Johnsen <tjohnsen@nvidia.com>
Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-authored-by: Tao Li @ NVIDIA <tali@nvidia.com>
* add dgx_h200 tests
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* test
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix pre-commit
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* change bsl branch
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* fix
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* change multi gpu related file list
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
---------
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
* Rename nvsmall to nemotron NAS
* Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests
* Add NemotronNAS to pytorch supported models table
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>
* test: Add single gpu disaggregated tests
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add deepseek with overlap tests
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Use updated prompt
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Move test to disaggregated folder
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* fix: Fix p-tuning test bug
* A change in the vocab_size calculation for T5Tokenizer,
introduced in transformers version 4.34, caused addition of incorrect vtokens for ptuning.
In general, instead of adding tokens which are outside the vocabulary, tokens inside the vocabulary were added.
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
* init trtllm attn no cache
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: fix the seq_len issue and attn metadata prepare for qwen reward model test
fix: fix minor bugs after rebase
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: remove unnecessary debug logs and clean up commented code
refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment
remove Debug code.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: Resolve comments for Python code
Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata
Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* docs: Add is_dummy_attention field to attention metadata for simulation operations
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* refactor: add KVCacheParams to attention backend interface and import relevant metadata classes
Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: fix rebase format issue
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: extend attention mask type handling in MHARunnerFixedParams
Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
* fix: enhance attention mask type handling in TllmGenFmhaRunnerParams
Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types.
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
---------
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>