(inference-request)= # Inference Request The main class to describe requests to `GptManager` is `InferenceRequest`. This is structured as a map of tensors and a `uint64_t requestId`. The mandatory tensors to create a valid `InferenceRequest` object are described below. Sampling config params are documented in the {ref}`gpt-runtime` section. Descriptions have been omitted in the table. | Name | Shape | Type | Description | | :----------------------: | :----------------------------: | :-----------------------------: | :-----------------------------: | | `request_output_len` | [1,1] | `int32_t` | Max number of output tokens | | `input_ids` | [1, num_input_tokens] | `int32_t` | Tensor of input tokens | Optional tensors that can be supplied to `InferenceRequest` are shown below. Default values, where applicable are specified.: | Name | Shape | Type | Description | | :----------------------: | :----------------------------: | :-----------------------------: | :-----------------------------: | | `streaming` | [1] | `bool` | (Default=`false`). When `true`, stream out tokens as they are generated. When `false` return only when the full generation has completed. | | `beam_width` | [1] | `int32_t` | (Default=1) Beam width for this request; set to 1 for greedy sampling | | `temperature` | [1] | `float` | Sampling Config param: `temperature` | | `runtime_top_k` | [1] | `int32_t` | Sampling Config param: `topK` | | `runtime_top_p` | [1] | `float` | Sampling Config param: `topP` | | `len_penalty` | [1] | `float` | Sampling Config param: `lengthPenalty` | | `early_stopping` | [1] | `int` | Sampling Config param: `earlyStopping` | | `repetition_penalty` | [1] | `float` | Sampling Config param: `repetitionPenalty` | | `min_length` | [1] | `int32_t` | Sampling Config param: `minLength` | | `presence_penalty` | [1] | `float` | Sampling Config param: `presencePenalty` | | `frequency_penalty` | [1] | `float` | Sampling Config param: `frequencyPenalty` | | `random_seed` | [1] | `uint64_t` | Sampling Config param: `randomSeed` | | `end_id` | [1] | `int32_t` | End token Id. If not specified, defaults to -1 | | `pad_id` | [1] | `int32_t` | Pad token Id | | `embedding_bias` | [1] | `float` | Embedding bias | | `bad_words_list` | [2, num_bad_words] | `int32_t` | Bad words list | | `stop_words_list` | [2, num_stop_words] | `int32_t` | Stop words list | | `prompt_embedding_table` | [1] | `float16` | P-tuning prompt embedding table | | `prompt_vocab_size` | [1] | `int32_t` | P-tuning prompt vocab size | | `lora_task_id` | [1] | `uint64_t` | Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given. The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`. If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached | | `lora_weights` | [num_lora_modules_layers, D x Hi + Ho x D] | `float` (model data type) | weights for a LoRA adapter. Refer to {ref}`lora` for more information. | | `lora_config` | [num_lora_modules_layers, 3] | `int32_t` | LoRA configuration tensor. `[ module_id, layer_idx, adapter_size (D aka R value) ]` Refer to {ref}`lora` for more information. | | `return_log_probs` | [1] | `bool` | When `true`, include log probs in the output | | `return_context_logits` | [1] | `bool` | When `true`, include context logits in the output | | `return_generation_logits` | [1] | `bool` | When `true`, include generation logits in the output | | `draft_input_ids` | [num_draft_tokens] | `int32_t` | Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration | | `draft_logits` | [num_draft_tokens, vocab_size] | `float` | Draft logits associated with `draft_input_ids` to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |