mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
docs:update 0.19 docs (#3986)
Signed-off-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com> Co-authored-by: nv-guomingz <37257613+nv-guomingz@users.noreply.github.com>
This commit is contained in:
parent
cd9c7498d0
commit
034f6f2d91
@ -778,7 +778,7 @@ Refer to the {ref}`support-matrix-software` section for a list of supported mode
|
||||
- System prompt caching
|
||||
- Enabled split-k for weight-only cutlass kernels
|
||||
- FP8 KV cache support for XQA kernel
|
||||
- New Python builder API and `trtllm-build` command (already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/contrib/opt#3-build-tensorrt-engines))
|
||||
- New Python builder API and `trtllm-build` command (already applied to [blip2](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/blip2) and [OPT](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/opt#3-build-tensorrt-engines))
|
||||
- Support `StoppingCriteria` and `LogitsProcessor` in Python generate API
|
||||
- FHMA support for chunked attention and paged KV cache
|
||||
- Performance enhancements include:
|
||||
|
||||
@ -250,7 +250,7 @@ llm = LLM(
|
||||
|
||||
</details>
|
||||
|
||||
For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/llm-api-examples/).
|
||||
For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html).
|
||||
|
||||
______________________________________________________________________
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user