Before this commit, the kv cache manager does the same regardless, which causes a mis-calculation in free memory available to allocate for the KV cache manager, hence causing a crash.
This commit fixes this by letting KV cache manager initialization be aware whether it is doing the dry run or not. If it is a dry run, use the max_tokens setting that is already pre-calculated and filled into kv_cache_config.max_tokens.
Signed-off-by: eopXD <yuehtingc@nvidia.com>
The performance results of some kernels could be easily affected by the warm/cold L2 cache status. To achieve more precise profiling results, the L2 cache is cleared for every execution by the circular buffer method for better benchmarking during autotuning.
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
* Why?
There were a couple of issues with the recently merged custom model
injection for AutoDeploy + the reference implementation of nemotron
H:
- `d_mlp` was left in despite being mathematically always null (could
lead to runtime issues during sharding).
- the custom model mapping was inherited by children factories.
* What?
This commit fixes these issues, and refactors the key of the custom
implementation to be based on the name of the configuration class as
well.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
In this change we rename 3rdparty/README.md (which contains the process
playboook for C++ dependencies) to 3rdparty/cpp-thirdparty.md and add a
new 3rdparty/py-thirdparty.md file which contains the process playbook
for python dependencies.
We also update the main 3rdparty/README.md file to serve as a
starting-point referring to both of these files.
Signed-off-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com>
Co-authored-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com>
This change addresses the nitpick comments from coderabbit on the
previous pull request !8986. None of the changes appear to be critical
as the build is healthy without them, but they should provide some
protection against future breakages if we change CMake version or
or modify other build logic.
This change consists of the following:
1. Add GIT_SUBMODULE_RECURSE ON to FetchContent_Declare calls for
deepgemm and flashmla to ensure submodules are initialized in
cmake versions where it is not the default.
2. Modify error messages in deep_gemm and flash_mla CMakeLists to
indicate that submodule initialization failed if the expected
submodule directories are not present.
3. Remove the NVTX include directories if the build is configured
with NVTX_DISABLE off, to avoid potential confusions if NVTX is
included on the compile commands when disabled.
4. Fix a minor CMake syntax issue in cpp/CMakeLists.txt where a
message() call was missing parentheses around a string.
Signed-off-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com>
Co-authored-by: Josh Bialkowski <1309820+cheshirekow@users.noreply.github.com>
* Skip the shape profile generating process if the profile has already been found in the cache under tuning mode. This is a prerequisite for nested autotuning because host overhead might be included during the profiling of the high-level op.
* Enable the profiling with CUDA graph as the default profiling method.
* Apply a heuristic method to cut off the number of repeat times of profiling according to a few-run time measurement.
* Why?
The reference nemotron H code on HuggingFace is out of date,
and therefore bugged, and has several untested code paths.
This makes an already hairy patching system even hairier.
The proposal is to do away with those patches, and replace the
original implementation with one that is heavily slimmed down.
* What?
This PR sets the basis for an alternative path with such a
slimmed down implementation that:
- fixes bugs in the current HF implementation
- adds no new dependencies to TensorRT-LLM
- does away with unnecessary features for TensorRT-LLM/
AutoDeploy:
- no training related code (dropout, gradient checkpointing, etc.)
- no caching logic (we want to replace it with our own anyway)
- no attention masking where possible
- reuses existing AD custom ops for mamba SSM update /
causal conv1d / attention
In order for the above to be usable in the AD apparatus,
`AutoModelForCausalLMFactory` is extended to allow registrations
of custom model implementations.
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>