TensorRT-LLMs/tests/integration/defs/accuracy
Enwei Zhu b2f69db507
test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167)
* add eval_llmapi

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

tmp commit

port to CLI tool

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

setup llmapi

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix spec_dec_algo

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

_update_from_hf_quant_config

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

migrate test_pytorch.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix fp8 block scales

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

fix fp8 rowwise

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

adj alpha

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move test_pytorch.py cases

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

move

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

rename test_accuracy.py to test_cli.py

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

clean

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix cnn_dailymail

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* renaming to cli flow

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* rename MMLU

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* rename

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* add error

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

* fix

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>

---------

Signed-off-by: Enwei Zhu <21126786+syuoni@users.noreply.github.com>
2025-04-01 22:20:29 +08:00
..
media test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
references test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
scripts test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
__init__.py Update (#2978) 2025-03-23 16:39:35 +08:00
accuracy_core.py test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
README.md test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
test_cli_flow.py test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
test_llm_api_pytorch.py test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00
test_llm_api.py test: Accuracy test improvement (Part 3.1): Extend accuracy test suite with LLM API and initial implementation of trtllm-eval (#3167) 2025-04-01 22:20:29 +08:00

Accuracy Evaluation

Hypothesis testing methodology

Null hypothesis and alternative hypothesis

For a given dataset and model, the evaluated scores can be viewed as a population with mean \mu and variance \sigma. Note that the distribution is not necessarily to be a normal distribution.

When we finish implementing a model, we need to setup an accuracy reference. By evaluating the model on a subset of n samples, we practically draw n scores x_1, x_2, \dots, x_n from the population, and thus we can compute and record the sample average \bar{x} = \frac{1}{n} \sum_{i} x_i.

When testing if there is an accuracy regression, we once again evaluate the model on n samples, resulting in x'_1, x'_2, \dots, x'_n, and also sample average \bar{x'} = \frac{1}{n} \sum_{i} x'_i. The question is that, are these n samples drawn from the same distribution to the referenced one? This can be formulated as a hypothesis testing problem:

  • Null Hypothesis (H_0): x'_1, x'_2, \dots, x'_n are drawn from the same distribution to the reference.
  • Alternative Hypothesis (H_1): x'_1, x'_2, \dots, x'_n are from a different distribution from the reference.

Since we care about accuracy regression only, so it should be a one-tailed hypothesis testing problem:

  • Null Hypothesis (H_0): x'_1, x'_2, \dots, x'_n are drawn from a distribution with a mean equal to or higher than the reference.
  • Alternative Hypothesis (H_1): x'_1, x'_2, \dots, x'_n are drawn from a distribution with a mean lower than the reference.

Hypothesis Testing

Two-sample t-test

According to the two-sample t-test method, we can compute the t-statistic t = \frac{\bar{x'} - \bar{x}}{\sqrt{2 \sigma^2 / n}}. According to the Central Limit Theorem (CLT), the t-statistic is from a distribution that converges to the standard normal distribution \mathcal{N} (0, 1).

Given the threshold \gamma, the false positive (type I error) rate \alpha can be formulated as:


\begin{equation*}
\begin{aligned}
\alpha &= P \left(\bar{x'} \leq \gamma \mid t \sim \mathcal{N} (0, 1) \right) \\
&= P \left(t \leq \frac{\gamma - \bar{x}}{\sqrt{2 \sigma^2 / n}} \mid t \sim \mathcal{N} (0, 1) \right).
\end{aligned}
\end{equation*}

In practive, we setup a \alpha (e.g., 0.05) and then compute the threshold \gamma:


\begin{equation*}
\gamma = \Phi^{-1} (\alpha) \cdot \sqrt{2 \sigma^2 / n} + \bar{x}.
\end{equation*}

Note that \alpha is typically smaller than 0.5, so \gamma < \bar{x}.

Given the minimum detectable effect \theta, the false negative (type II error) rate \beta can be formulated as:


\begin{equation*}
\begin{aligned}
\beta &= P \left(\bar{x'} > \gamma \mid t \sim \mathcal{N} (-\frac{\theta}{\sqrt{2 \sigma^2 / n}}, 1) \right) \\
&= P \left(t > \frac{\gamma - \bar{x}}{\sqrt{2 \sigma^2 / n}} \mid t \sim \mathcal{N} (-\frac{\theta}{\sqrt{2 \sigma^2 / n}}, 1) \right) \\
&= P \left(t + \frac{\theta}{\sqrt{2 \sigma^2 / n}} > \frac{\gamma - \bar{x} + \theta}{\sqrt{2 \sigma^2 / n}} \mid t + \frac{\theta}{\sqrt{2 \sigma^2 / n}} \sim \mathcal{N} (0, 1) \right) \\
&= P \left(t + \frac{\theta}{\sqrt{2 \sigma^2 / n}} > \Phi^{-1} (\alpha) + \frac{\theta}{\sqrt{2 \sigma^2 / n}} \mid t + \frac{\theta}{\sqrt{2 \sigma^2 / n}} \sim \mathcal{N} (0, 1) \right)
\end{aligned}
\end{equation*}

In practice, we setup a \beta (e.g., 0.2) and then compute \theta:


\begin{equation*}
\begin{aligned}
\theta &= (\Phi^{-1} (1-\beta) - \Phi^{-1} (\alpha)) \cdot \sqrt{2 \sigma^2 / n} \\
&= - (\Phi^{-1} (\alpha) + \Phi^{-1} (\beta)) \cdot \sqrt{2 \sigma^2 / n}
\end{aligned}
\end{equation*}

Note that \alpha and \beta are typical smaller than 0.5, so \theta > 0.

References:

Steps to add accuracy tests

  • Estimate \sigma from the full dataset.
  • Decide a target minimum detectable effect \theta based on the nature of dataset and corresponding accuracy metric.
  • Decide \alpha and \beta based on the importance of model.
  • Iterate sample volume n from small to large, and compute \theta until it satisfies (is equal to or lower than) the target \theta.
  • Evaluate the model on the subset of sample volume n, resulting in the reference accuracy.
  • The threshold \gamma is automatically setup based on \alpha, \sigma, n and the reference accuracy.