mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-14 06:27:45 +08:00
A redundant D2D copy is observed when enabling torch.compile for the Llama model due to the swiglu triton kernel, which brings perf overhead. Use a custom op to wrap the swiglu op to avoid this overhead. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| cpp_custom_ops.py | ||
| flashinfer_custom_ops.py | ||
| torch_custom_ops.py | ||
| trtllm_gen_custom_ops.py | ||
| userbuffers_custom_ops.py | ||