TensorRT-LLMs/examples/llm-eval/lm-eval-harness
QI JUN 1c6e490894
[TRTLLM-9065][chore] remove PyTorchConfig completely (#8856)
Signed-off-by: junq <22017000+QiJune@users.noreply.github.com>
2025-11-06 22:37:03 -08:00
..
lm_eval_tensorrt_llm.py [TRTLLM-9065][chore] remove PyTorchConfig completely (#8856) 2025-11-06 22:37:03 -08:00
README.md Update (#2978) 2025-03-23 16:39:35 +08:00
requirements.txt Update TensorRT-LLM (#2873) 2025-03-11 21:13:42 +08:00

Evaluation scripts for LLM tasks

This folder includes code to use the LM-Eval-Harness, a unified framework to test generative language models on a large number of different evaluation tasks. The supported eval tasks are here.

The following instructions show how to evaluate TRT-LLM engines with the benchmark.

Instructions

TRT-LLM API

Build the TRT-LLM engine using trtllm-build.

Install the lm_eval package in the requirements.txt file in this folder.

Run the evaluation script with the following command:

python lm_eval_tensorrt_llm.py --model trt-llm \
    --model_args tokenizer=<HF model folder>,model=<TRT LLM engine dir>,chunk_size=<int> \
    --tasks <comma separated tasks, e.g., gsm8k-cot, mmlu>

In the LM-Eval-Harness, model args are submitted as a comma-separated list of the form arg=value. The trt-llm model supports the following model_args:

Name Description Default Value
tokenizer directory containing the HF tokenizer.
model directory containing the TRTLLM engine or torch model.
max_gen_toks max number of tokens to generate (if not specified in gen_kwargs) 256
chunk_size number of async requests to send at once to the engine 200
max_tokens_kv_cache max tokens in paged KV cache None
free_gpu_memory_fraction KV cache free GPU memory fraction 0.9
trust_remote_code trust remote code; use if necessary to set up the tokenizer False
tp tensor parallel size (for torch backend) no. of workers
use_cuda_graph enable CUDA graph True
max_context_length maximum context length for evaluation None
moe_expert_parallel_size expert parallel size for MoE models None
moe_backend backend for MoE models (e.g., "TRTLLM") "TRTLLM"

Torch backend

Install the lm_eval package in the requirements.txt file in this folder.

Run the evaluation script with the same command as above, but include backend=torch in the model_args. For example:

python lm_eval_tensorrt_llm.py --model trt-llm \
    --model_args model=<HF model folder>,backend=torch,chunk_size=<int> \
    --tasks <comma separated tasks, e.g., gsm8k-cot, mmlu>

trtllm-serve

Build the TRT-LLM engine using trtllm-build and deploy with trtllm-serve.

Install the lm_eval package in the requirements.txt file in this folder.

Run the evaluation script with the following command:

python lm_eval_tensorrt_llm.py --model local-completions \
    --model_args base_url=http://${HOST_NAME}:8001/v1/completions,model=<model_name>,tokenizer=<tokenizer_dir> \
    --tasks <comma separated tasks, e.g., gsm8k-cot, mmlu> \
    --batch_size <#>

Because trtllm-serve is OpenAI API compatible, we can use the local-completions model built in to lm_eval, which supports these model_args.