* chore: unify pp_layers helpers
Fix assumptions about equal number of layers per PP rank
in prepare_attention_inputs
Signed-off-by: Aurelien Chartier <2567591+achartier@users.noreply.github.com>
* apply a tenative fix to moe bypass kernel update
* Pass none to disable final stage in moe
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
Signed-off-by: Chang Liu <lc9114@gmail.com>
---------
Signed-off-by: Chang Liu <lc9114@gmail.com>
Co-authored-by: hlu1 <14827759+hlu1@users.noreply.github.com>
* fix: Fixing issue with first gen token being returned twice with streaming
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
* Fixing not_expectring_strings in test
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
---------
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Co-authored-by: QI JUN <22017000+QiJune@users.noreply.github.com>
* fix: add kv memory size per token of draft model to calculate max number
of tokens of kv cache
Signed-off-by: Hui Gao
* Fix code to get model_config of draft model
Signed-off-by: Hui Gao
---------
Signed-off-by: Hui Gao
* refactor: remove cumLogProbs and logProbs from DecoderBuffers
- Eliminated cumLogProbs and logProbs from DecoderBuffers, streamlining the buffer management.
- Updated related code in decoderBuffers.cpp and bindings.cpp to reflect these changes, ensuring that only host pointers are used for log probabilities.
These modifications enhance code clarity and maintainability by reducing redundancy in buffer management.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: streamline sequence length handling in GptDecoderBatched and StatefulGptDecoderBatched
- Updated GptDecoderBatched to directly use output.sequenceLengths for lengths assignment, removing unnecessary reshaping.
- Adjusted StatefulGptDecoderBatched to ensure sequence lengths are correctly shaped based on actual batch size and max beam width.
These changes enhance clarity and maintainability in the decoding process.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: integrate DecoderState for sequence length management in decoding process
- Updated DecoderBuffers to remove direct handling of sequence lengths, now utilizing DecoderState for this purpose.
- Adjusted MakeDecodingBatchInputOutput to accept DecoderState, enhancing clarity in the decoding input/output management.
- Refactored GptDecoderBatched and StatefulGptDecoderBatched to streamline sequence length handling, ensuring consistency across the decoding workflow.
refactor: update SlotDecoderBuffers to manage sequence lengths directly
- Introduced sequenceLengths and sequenceLengthsHost to SlotDecoderBuffers for better management of sequence lengths.
- Refactored asyncSend and recv methods to utilize the new sequenceLengths member, enhancing clarity and reducing redundancy.
- Updated TrtGptModelInflightBatching to align with the new structure, ensuring consistent handling of sequence lengths across the decoding process.
These changes improve maintainability and streamline the decoding workflow.
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* refactor: Delegate to asyncSend method in SlotDecoderBuffers
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
---------
Signed-off-by: Robin Kobus <19427718+Funatiq@users.noreply.github.com>
* add test to map flashinfer rope op with triton custom rope ops and pytorch rope in fused_mha
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add rope matcher and unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* capture cos and sin from graph
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* revert fuse_mha op change
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor update to address comment and remove redundant unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move view and transpose into graph nodes and update unit test to test custom op directly
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* move view into custom op, update bfs with bound, update custom op return type to be half precision
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* custom op update to support 3D input
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* handle bnsd and bsnd format, update tests, handle 3D cos/sin input to the custom op
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add llama4 rope test, update custom op with is_neox flag
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* add llama4 style rope to matcher and update unit test
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* separate into two transformations
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* fix when num_head != num_kv_head; add support for cached position_ids and cos_sin_cache in graph; update unit tests
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor update, cache locally and propagate meta info of qk nodes
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: fix cos_sin_cache not float
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* minor: move cache into matcher
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
---------
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
* One of the tactic is not supported during dispatch.
* final_hidden_states should be unpacked if it is not min_latency_mode.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* feat: Add NVFP4 UB pattern optimization pass in torch compile
* Add an additional flag for UB fp4 pattern to avoid inverse the scale
* Add NVFP4 related UB patterns
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
* Update atol, some points fails for B200 umbriel.
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
---------
Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
Signed-off-by: liji-nv <59594262+liji-nv@users.noreply.github.com>
* Rename nvsmall to nemotron NAS
* Revert nvsmall to nemotron_nas rename in paths in tests that access llm_models_root/nvsmall/tests
* Add NemotronNAS to pytorch supported models table
Signed-off-by: Amit Zuker <203509407+amitz-nv@users.noreply.github.com>