* Add nested aliases for Llama 4
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
* Fix missed alias.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
---------
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-authored-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>
* fix: broadcast embeddings input when using speculative decoding
Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>
* fix: use shape tensor instead of tuple
Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>
* fix: comment
Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>
---------
Signed-off-by: Maximiliano Levi <maxilevi77@gmail.com>
Co-authored-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
* test: Add single gpu disaggregated tests
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Add deepseek with overlap tests
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Use updated prompt
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* Move test to disaggregated folder
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
---------
Signed-off-by: Iman Tabrizian <itabrizian@nvidia.com>
* infra: [TRTLLM-4450] Support more files for pytorch only mode
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
* Test pytorch only mode
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
* Revert "Test pytorch only mode"
This reverts commit b32f54d7858bd2432251734bc7b31669147ed94b.
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
* Fix review
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
---------
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
* Instead of allocating UserBuffers at beginning of runtime, UB buffers
are now managed with global allocator. The allocator will dynamically
assign free UB buffer or allocate new buffer for torch tensor. It makes
userbuffers easier to use.
* In common usecase, the Userbuffers will be allocated correctly during
warm up stage. There is no dynamic allocation during inference.
* UB fusion pattern is rewroten using the new UB Allocator. It contains
following passes:
1. Fuse Quant with allreduce, replace with UB impl, and insert a
copy_to_userbuffers. Currently the normal allreduce still does not
support FP8 quant. So this need to be done in UB pass
2. Convert all supported allreduce with UB and insert copy_to_userbuffers.
3. Fuse op before ar with the copy_to_userbuffers. So the op directly
writes to the userbuffer
4. Remove userbuffers finalize if the output is connect to another UB
allreduce.
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
* fix: Fix p-tuning test bug
* A change in the vocab_size calculation for T5Tokenizer,
introduced in transformers version 4.34, caused addition of incorrect vtokens for ptuning.
In general, instead of adding tokens which are outside the vocabulary, tokens inside the vocabulary were added.
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
* Several optimizations and fixings on the Autotuner.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on current linear for nvFP4 data type.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Apply the new Python side Autotuner on MoE op
* Remove routers from cache key to improve inference perf
* Prevent unnecessary code profiling. Use do_preparation keyword to select which part should be executed during before evaluating any tactic.
* Remove try-catch inside moe profiling process.
* Move default tactic -1 to 0 transforms in cpp runner.
* Revise relavant tests.
* Predefined the bucketizing strategy for fused_moe
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Add specific_profile support for AutoTuner to bypass the standard cache search process for perf optimization
* Add specific_profile for moe
* Add specific profile for linear
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Fixing and revising according to reviewer's suggestions.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Use lru_cache for inference pref optimization.
* Revert gen_custom_cache_key feature
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Replace runner with runner id to achieve a serializable cache.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Code clean up and minor fixings.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Move all tunable runners and custom ops into torch_custom_ops.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Treat min_latency_mode as a independent dynamic tensor. Modify get_valid_tactics to suit for it.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
---------
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* feat: Add option to run disaggregated serving without ctx servers, to benchmark gen only
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing comment in sanity check
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>