Commit Graph

155 Commits

Author SHA1 Message Date
Enwei Zhu
8ee019f8c4
test: Accuracy test improvement (Part 3.4): Move LLaMA tests (#3350)
Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-08 15:07:57 +08:00
Yukun He
c678774c99
feat: Apply the new torch-flow compatible AutoTuner to both Fused MoE and NVFP4 Linear operators. (#3151)
* Several optimizations and fixings on the Autotuner.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on current linear for nvFP4 data type.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Fixing and revising according to reviewer's suggestions.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Replace runner with runner id to achieve a serializable cache.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Code clean up and minor fixings.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Move all tunable runners and custom ops into torch_custom_ops.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

---------

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-08 14:28:36 +08:00
Gabriel Wu
f1655afb0d
feat: enable DeepGEMM by default (#3341)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
2025-04-08 13:58:57 +08:00
Fanrong Li
62e0876e39
Waive unittest/trt/model/test_mamba.py::TestMamba::test_loaders_mamba_130m_hf_from_checkpoint. Will fix it later. (#3356)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-07 22:36:35 -07:00
MinaHuai
31422e7e46
add tp=2 ci test for vision encoder (#3319)
Signed-off-by: mhuai <mhuai@nvidia.com>
2025-04-07 21:46:08 -07:00
Gabriel Wu
42c8574e93
fix: revert extra cmake var (#3351)
Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-08 11:57:16 +08:00
pcastonguay
add5e5cd93
feat: Add option to run disaggregated serving without ctx servers,… (#3243)
* feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing comment in sanity check

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-07 21:56:03 -04:00
Void
efe2ecfb37
fix: runtime error in est_deepseek_allreduce.py (#3226)
Signed-off-by: Yilin Zhang <18275976+yilin-void@users.noreply.github.com>
2025-04-08 09:19:47 +08:00
Enwei Zhu
ba019a43d6
test: Accuracy test improvement (Part 3.3): Move DeepSeek tests (#3260)
add skip



fix



fix



update



update test list



fixqa list



move bf16 to postmerge

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-08 07:19:04 +08:00
Gabriel Wu
376731013d
feat: use NVRTC for DeepGEMM JIT compilation (#3239)
* feat: use NVRTC for DeepGEMM JIT compilation

Signed-off-by: Zihua Wu 

* fix: add license

Signed-off-by: Zihua Wu

* feat: store NVRTC JIT results in memory by default

Signed-off-by: Zihua Wu


* feat: refinement

Signed-off-by: Zihua Wu

* feat: refinement

Signed-off-by: Zihua Wu

* test: set timeout to 7200

Signed-off-by: Zihua Wu

---------

Signed-off-by: Zihua Wu
2025-04-07 20:29:23 +08:00
YueWeng
aab6214801
test: fix conflicting test names (#3316)
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
2025-04-07 20:10:01 +08:00
pansicheng
ef1ba468a1
feat: support abort disconnected requests (#3214)
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
2025-04-07 16:14:58 +08:00
QI JUN
a2fad51011
chore: waive a timeout multi-GPU test case (#3310)
* debug CI timeout issue

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* fix

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

* waive timeout case

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>

---------

Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-07 14:04:54 +08:00
brb-nv
017361c26c
test: Waive non-Llama Eagle tests (#3309) 2025-04-07 09:25:41 +08:00
tburt-nv
7a659885e3
chore: remove usernames from comments (#3291)
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
2025-04-05 13:44:28 +08:00
Yan Chunwei
b21cfcfed1
chore: refactor the LlmArgs with Pydantic and migrate remaining pybinding configs to python (#3025)
* make LlmArgs Pydantic

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* amending doc

fix api_stability

fix tests

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* restore yaml groups

refine StackTrace

singleton

clean tests

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix trtllm-bench

fix pytorch

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix serve distagg

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

* fix

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>

---------

Signed-off-by: Superjomn <328693+Superjomn@users.noreply.github.com>
2025-04-05 13:31:48 +08:00
qixiang-99
0d4d50a745
feat: no-cache attention in PyTorch workflow (#3085)
* init trtllm attn no cache

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix the seq_len issue and attn metadata prepare for qwen reward model test

fix: fix minor bugs after rebase
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unnecessary debug logs and clean up commented code

refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine
Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment

remove Debug code.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: Resolve comments for Python code

Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata

Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* docs: Add is_dummy_attention field to attention metadata for simulation operations

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* refactor: add KVCacheParams to attention backend interface and import relevant metadata classes

Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: fix rebase format issue

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: extend attention mask type handling in MHARunnerFixedParams

Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

* fix: enhance attention mask type handling in TllmGenFmhaRunnerParams

Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types.

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>

---------

Signed-off-by: Qixiang Lin <qixiangl@nvidia.com>
2025-04-05 01:54:32 +08:00
QI JUN
059a34468c
fix deepseek multi gpu tests timeout (#3285)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-04-04 16:19:02 +08:00
yuanjings-nvda
5776b99b70
fix vila test (#3042)
Signed-off-by: Yuanjing Shi <yuanjings@nvidia.com>
2025-04-04 14:30:06 +08:00
shaharmor98
ee4aab72ec
feat: Support PeftCacheManager in Torch (#3186)
* Add PeftCacheManager implementation

Signed-off-by: Shahar Mor <smor@nvidia.com>
2025-04-04 12:38:08 +08:00
Pengyun Lin
f25c7cefb4
doc: refactor trtllm-serve examples and doc (#3187)
Signed-off-by: Pengyun Lin <81065165+LinPoly@users.noreply.github.com>
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Co-authored-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2025-04-04 11:40:43 +08:00
Tracin
bb6c338730
AWQ support Modelopt ckpts. (#3258)
Signed-off-by: Tracin <10434017+Tracin@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
2025-04-04 08:10:35 +08:00
pcastonguay
b763051ba4
chore: Refactor disaggregated serving scripts (#3073)
* chore: Refactor to reduce duplicated code in disagg server, reuse trtllm-serve

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Updating README, removing launch script

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Fixing integration tests

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

* Adding scripts to populate urls section of disagg config based on SLURM env vars

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

---------

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
2025-04-03 14:55:05 -04:00
Yibin Li
32ae1564bd
update FP4 quantize layout (#3045)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
2025-04-03 13:13:54 -04:00
Yukun He
d138795485
Fix minor issues in test_autotuner.py and loose the cache check for test gemms. (#3261)
This test can cause nondeterministic failures on CI with unexpected kernel profiling results. Given longer delay time or cache clear will not solve the issue. Thus, loose the test checks to avoid these false alarms.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
2025-04-03 18:24:08 +08:00
xinhe-nv
2005e5aaaf
remove tests from qa test lists (#3256)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
2025-04-03 16:06:39 +08:00
xiweny
174a5af779
doc: refine integration test guide (#3215)
* doc: refine integration test guide

Signed-off-by: Xiwen Yu <xiweny@nvidia.com>
2025-04-03 15:36:13 +08:00
Zhanrui Sun
7f03125098
test: [TRTLLM-3994] Support only run pytorch tests (#3013)
* [TRTLLM-3994] Support only run pytorch tests

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Move perf test to TensorRT backend

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

* Fix review

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>

---------

Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
2025-04-03 13:46:09 +08:00
pcastonguay
b5b83009ff
chore: Reenabling get_stats_async test which seems to have been fixed by recent commit (#3246)
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: Sharan Chetlur <116769508+schetlur-nv@users.noreply.github.com>
2025-04-02 20:57:31 -07:00
Jinyang Yuan
2fdfa39ea8
fix: Fix an error related to dummy request when MTP is used (#3146) 2025-04-03 11:08:12 +08:00
Chuang Zhu
f5bf74bc7f
enable some disagg test (#3203)
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
2025-04-03 06:10:48 +08:00
Enwei Zhu
3cf7066350
test: Accuracy test improvement (Part 3.2): Move Qwen tests (NvBug 5135332) (#3219)
* remove test_llm_models_multi_gpu.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* qwen 2.5

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* upgrade

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-02 17:29:57 +08:00
Zongfei Jing
8d48b96545
reduce test cases for deepseek (#3211)
Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-02 13:57:55 +08:00
wili
34e63d07e6
feat: Variable-Beam-Width-Search (VBWS) Part2 (#3133)
* feat: Variable-Beam-Width-Search Part2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search Part2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search Part2, fix CPP tests

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search Part3, simplify CPP tests

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search Part4, move beam_width_array param

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search, fix CI error

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search part2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search part2

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search part2, fix pre-commit

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

* feat: Variable-Beam-Width-Search part2, fix review

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>

---------

Signed-off-by: wili-65535 <wili-65535@user.noreply.github.com>
Co-authored-by: wili-65535 <wili-65535@user.noreply.github.com>
2025-04-02 12:31:28 +08:00
Yiqing Yan
c19b7f7c2a
waive L0 test (#3217)
Signed-off-by: Yiqing Yan <yiqingy@nvidia.com>
2025-04-02 11:16:22 +08:00
Chuang Zhu
bc5811da65
chore: Ucx ip port remove mpi depend (#3101)
* initial ucx support

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fixes to support dynloading and ucx connection establishment - not stable yet

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* update

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* more connection bringup fixes - faillig on connection vector build

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* executor test pass

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* update

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* passed full benchmark

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* changing to TLLM_THROW and removing cout

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* stoping progress thread at ucxComm destructor

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fixing build with ENABLE_UCX=0 to not build ucx traget at all and removing includes for ucxConnection for cache transceiver, also delete commented cold code

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fix copyrights

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* adding ucx flavor to cache transceiver test and insertto the CI pipeline

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* allowing sending non ib interfaces IPs

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* setting UCX port reuse for the tests in pipeline

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* code review fixes

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* querying ep after GID message is sent to avoid UCX Errors

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* fixing more CR issues

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* querying ep to not fail is ep_not_connected yet

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>

* remove mpi dependency and debug

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* debug to info

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* mpirun n 2

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* remove mpi comm split when disaggOrchestrator mode

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* waive disagg_mtp test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use future instead of thread

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* use future_promise instead of cv wait

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* connectionId type

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* improve test

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* imporve test 2

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

* gtest_skip

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

---------

Signed-off-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>
Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>
Co-authored-by: roeya <165803633+RoeyAzran1992@users.noreply.github.com>
2025-04-02 09:42:29 +08:00
Zongfei Jing
c7548ad72c
perf: Add optimizations for deepseek in min latency mode (#3093)
* Add optimizations for deepseek min latency

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Fix compile error

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Update internal cutlass kernel libs

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Format code

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

* Resolve conflicts

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>

---------

Signed-off-by: Zongfei Jing <20381269+zongfeijing@users.noreply.github.com>
2025-04-02 09:05:24 +08:00
brb-nv
1fe3e30356
Add support for Phi-4-mini (#2990)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-02 08:34:39 +08:00
Chang Liu
1d3a5d38af
fix: Update FP8 sf layout for Blackwell and relax blockwise GEMM assertions (#3144)
* Update fp8 sf layout for blackwell and enable fp8 gemm e2e

* Add test case when m needs to be padded

* Better comment

Signed-off-by: Chang Liu <liuc@nvidia.com>

* Add TODO for fp8 quant kernel

Signed-off-by: Chang Liu <liuc@nvidia.com>

* Enable DCO check

Signed-off-by: Chang Liu <liuc@nvidia.com>

* Fix lint

---------

Signed-off-by: Chang Liu <liuc@nvidia.com>
2025-04-01 13:08:29 -07:00
Enwei Zhu
b2f69db507
test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167)
* add eval_llmapi

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

tmp commit

port to CLI tool

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

setup llmapi

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix spec_dec_algo

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

_update_from_hf_quant_config

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

migrate test_pytorch.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix fp8 block scales

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix fp8 rowwise

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

adj alpha

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move test_pytorch.py cases

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

rename test_accuracy.py to test_cli.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix cnn_dailymail

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* renaming to cli flow

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* rename MMLU

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* rename

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add error

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-01 22:20:29 +08:00
amirkl94
bf02b9144f
feature: Add LoRA support for gemma (#3068)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
2025-04-01 19:15:55 +08:00
WeiHaocheng
ff35af77ea
feat: refactor scaffolding worker and support openai api worker (#3166)
Signed-off-by: Fred Wei <20514172+WeiHaocheng@users.noreply.github.com>
Signed-off-by: fredw <20514172+WeiHaocheng@users.noreply.github.com>
2025-04-01 18:31:52 +08:00
bhsueh_NV
d34202273b
fix bug of glm-4-9b ci (#3184) bug nvbug_5196515
Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-04-01 16:58:42 +08:00
brb-nv
727d78e785
Support prequantized fp8 ckpt for nemotron-mini-4b-instruct (#3046)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-01 14:52:09 +08:00
dongjiyingdjy
22ff81b047
fix:fix illeagel memory access when mtp >= 2 (#3006)
* fix - fix illeagel memory access when mtp > 2

---------

Signed-off-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
2025-04-01 13:36:45 +08:00
brb-nv
1901bfcf76
test: Add Eagle tests with untrained heads (#2991)
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
2025-04-01 11:41:59 +08:00
Frank
8bb3eea285
perf: Readd iteration logging for trtllm-bench. (#3039)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
2025-04-01 08:13:09 +08:00
Iman Tabrizian
e8731ba3b7
fix: disable cuda graph and MTP for overlap tests (#3155)
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
2025-03-31 11:35:35 -07:00
bhsueh_NV
322ac565fc
chore: clean some ci of qa test (#3083)
* move some models to examples/models/contrib

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* update the document

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove arctic, blip2, cogvlm, dbrx from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove tests of dit, mmdit and stdit from qa test

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* remove grok, jais, sdxl, skywork, smaug from qa test list

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* re-organize the glm examples

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix issues after running pre-commit

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix some typo in glm_4_9b readme

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

* fix bug

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>

---------

Signed-off-by: bhsueh <11360707+byshiue@users.noreply.github.com>
2025-03-31 14:30:41 +08:00
xinhe-nv
86f3b59f81
update waive list (#3094)
Signed-off-by: xinhe-nv <200704525+xinhe-nv@users.noreply.github.com>
Co-authored-by: larryx <larryx@nvidia.com>
2025-03-31 11:42:45 +08:00