TensorRT-LLMs/docs/source/inference_request.md
Kaiyu Xie 5955b8afba
Update TensorRT-LLM Release branch (#1192)
* Update TensorRT-LLM

---------

Co-authored-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
2024-02-29 17:20:55 +08:00

37 lines
3.2 KiB
Markdown

# Inference Request
The main class to describe requests to `GptManager` is `InferenceRequest`. This is structured as a map of tensors and a `uint64_t requestId`.
The mandatory tensors to create a valid `InferenceRequest` object are described below. Sampling Config params are documented in more detail [here](gpt_runtime.md#sampling-parameters), and descriptions are omitted in the table:
| Name | Shape | Type | Description |
| :----------------------: | :----------------------------: | :-----------------------------: | :-----------------------------: |
| `request_output_len` | [1,1] | `int32_t` | Max number of output tokens |
| `input_ids` | [1, num_input_tokens] | `int32_t` | Tensor of input tokens |
Optional tensors that can be supplied to `InferenceRequest` are shown below. Default values, where applicable are specified.:
| Name | Shape | Type | Description |
| :----------------------: | :----------------------------: | :-----------------------------: | :-----------------------------: |
| `streaming` | [1] | `bool` | (Default=`false`). When `true`, stream out tokens as they are generated. When `false` return only when the full generation has completed. |
| `beam_width` | [1] | `int32_t` | (Default=1) Beam width for this request; set to 1 for greedy sampling |
| `temperature` | [1] | `float` | Sampling Config param: `temperature` |
| `runtime_top_k` | [1] | `int32_t` | Sampling Config param: `topK` |
| `runtime_top_p` | [1] | `float` | Sampling Config param: `topP` |
| `len_penalty` | [1] | `float` | Sampling Config param: `lengthPenalty` |
| `repetition_penalty` | [1] | `float` | Sampling Config param: `repetitionPenalty` |
| `min_length` | [1] | `int32_t` | Sampling Config param: `minLength` |
| `presence_penalty` | [1] | `float` | Sampling Config param: `presencePenalty` |
| `frequency_penalty` | [1] | `float` | Sampling Config param: `frequencyPenalty` |
| `random_seed` | [1] | `uint64_t` | Sampling Config param: `randomSeed` |
| `end_id` | [1] | `int32_t` | End token Id |
| `pad_id` | [1] | `int32_t` | Pad token Id |
| `embedding_bias` | [1] | `float` | Embedding bias |
| `bad_words_list` | [2, num_bad_words] | `int32_t` | Bad words list |
| `stop_words_list` | [2, num_stop_words] | `int32_t` | Stop words list |
| `prompt_embedding_table` | [1] | `float16` | P-tuning prompt embedding table |
| `prompt_vocab_size` | [1] | `int32_t` | P-tuning prompt vocab size |
| `lora_weights` | [ num_lora_modules_layers, D x Hi + Ho x D ] | `float` (model data type) | weights for a lora adapter. see [lora docs](lora.md#lora-tensor-format-details) for more details. |
| `lora_config` | [3] | `int32_t` | lora configuration tensor. `[ module_id, layer_idx, adapter_size (D aka R value) ]` see [lora docs](lora.md#lora-tensor-format-details) for more details. |
| `return_log_probs` | [1] | `bool` | When `true`, include log probs in the output |
| `return_context_logits` | [1] | `bool` | When `true`, include context logits in the output |
| `return_generation_logits` | [1] | `bool` | When `true`, include generation logits in the output |
| `draft_input_ids` | [num_draft_tokens] | `int32_t` | Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |