mirror of
https://github.com/NVIDIA/TensorRT-LLM.git
synced 2026-01-13 22:18:36 +08:00
Minor fixes for documents (#3577)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
This commit is contained in:
parent
fffb403125
commit
f5f68ded26
@ -1443,7 +1443,6 @@ trtllm-build --checkpoint_dir llama_3.1_405B_HF_FP8_model/trt_ckpts/tp8-pp1/ \
|
||||
To run inference on the 405B model, we often need to use multi-node to accommodate the entire model. Here, we use slurm to launch the job on multiple nodes.
|
||||
|
||||
Notes:
|
||||
* For the FP8 model, we can fit it on a single 8xH100 node, but we cannot support 128k context due to memory limitations. So, we test with 64k context in this demonstration.
|
||||
* For convenience, we use the Huggingface tokenizer for tokenization.
|
||||
|
||||
The following script shows how to run evaluation on long context:
|
||||
|
||||
@ -154,7 +154,7 @@ def parse_arguments():
|
||||
default=False,
|
||||
action="store_true",
|
||||
help=
|
||||
'By default, we use dtype for KV cache. fp8_kv_cache chooses int8 quantization for KV'
|
||||
'By default, we use dtype for KV cache. fp8_kv_cache chooses fp8 quantization for KV'
|
||||
)
|
||||
parser.add_argument(
|
||||
'--quant_ckpt_path',
|
||||
|
||||
Loading…
Reference in New Issue
Block a user