Minor fixes for documents (#3577)

Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
2026-01-13 22:18:36 +08:00 · 2025-04-16 07:47:18 +08:00 · 2025-04-16 07:47:18 +08:00 · f5f68ded26
commit f5f68ded26
parent fffb403125
2 changed files with 1 additions and 2 deletions
--- a/examples/llama/README.md
+++ b/examples/llama/README.md
@ -1443,7 +1443,6 @@ trtllm-build --checkpoint_dir llama_3.1_405B_HF_FP8_model/trt_ckpts/tp8-pp1/ \
 To run inference on the 405B model, we often need to use multi-node to accommodate the entire model. Here, we use slurm to launch the job on multiple nodes.

 Notes:
-* For the FP8 model, we can fit it on a single 8xH100 node, but we cannot support 128k context due to memory limitations. So, we test with 64k context in this demonstration.
 * For convenience, we use the Huggingface tokenizer for tokenization.

 The following script shows how to run evaluation on long context:
--- a/examples/llama/convert_checkpoint.py
+++ b/examples/llama/convert_checkpoint.py
@ -154,7 +154,7 @@ def parse_arguments():
        default=False,
        action="store_true",
        help=
-        'By default, we use dtype for KV cache. fp8_kv_cache chooses int8 quantization for KV'
+        'By default, we use dtype for KV cache. fp8_kv_cache chooses fp8 quantization for KV'
    )
    parser.add_argument(
        '--quant_ckpt_path',