mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
* Why? The reference nemotron H code on HuggingFace is out of date, and therefore bugged, and has several untested code paths. This makes an already hairy patching system even hairier. The proposal is to do away with those patches, and replace the original implementation with one that is heavily slimmed down. * What? This PR sets the basis for an alternative path with such a slimmed down implementation that: - fixes bugs in the current HF implementation - adds no new dependencies to TensorRT-LLM - does away with unnecessary features for TensorRT-LLM/ AutoDeploy: - no training related code (dropout, gradient checkpointing, etc.) - no caching logic (we want to replace it with our own anyway) - no attention masking where possible - reuses existing AD custom ops for mamba SSM update / causal conv1d / attention In order for the above to be usable in the AD apparatus, `AutoModelForCausalLMFactory` is extended to allow registrations of custom model implementations. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| _torch | ||
| api_stability | ||
| bindings | ||
| disaggregated | ||
| executor | ||
| llmapi | ||
| others | ||
| scaffolding | ||
| tools | ||
| trt | ||
| utils | ||
| conftest.py | ||
| dump_checkpoint_stats.py | ||
| gc_utils.py | ||
| profile_utils.py | ||
| pytest.ini | ||
| test_model_runner_cpp.py | ||
| test_pip_install.py | ||