Inference Request

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory tensors to create a valid InferenceRequest object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:

Name

Shape

Type

Description

request_output_len

[1,1]

int32_t

Max number of output tokens

input_ids

[1, num_input_tokens]

int32_t

Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name

Shape

Type

Description

streaming

[1]

bool

(Default=false). When true, stream out tokens as they are generated. When false return only when the full generation has completed.

beam_width

[1]

int32_t

(Default=1) Beam width for this request; set to 1 for greedy sampling

temperature

[1]

float

Sampling Config param: temperature

runtime_top_k

[1]

int32_t

Sampling Config param: topK

runtime_top_p

[1]

float

Sampling Config param: topP

len_penalty

[1]

float

Sampling Config param: lengthPenalty

repetition_penalty

[1]

float

Sampling Config param: repetitionPenalty

min_length

[1]

int32_t

Sampling Config param: minLength

presence_penalty

[1]

float

Sampling Config param: presencePenalty

frequency_penalty

[1]

float

Sampling Config param: frequencyPenalty

random_seed

[1]

uint64_t

Sampling Config param: randomSeed

end_id

[1]

int32_t

End token Id

pad_id

[1]

int32_t

Pad token Id

embedding_bias

[1]

float

Embedding bias

bad_words_list

[2, num_bad_words]

int32_t

Bad words list

stop_words_list

[2, num_stop_words]

int32_t

Stop words list

prompt_embedding_table

[1]

float16

P-tuning prompt embedding table

prompt_vocab_size

[1]

int32_t

P-tuning prompt vocab size

lora_weights

[ num_lora_modules_layers, D x Hi + Ho x D ]

float (model data type)

weights for a lora adapter. see lora docs for more details.

lora_config

[3]

int32_t

lora configuration tensor. [ module_id, layer_idx, adapter_size (D aka R value) ] see lora docs for more details.

return_log_probs

[1]

bool

When true, include log probs in the output

return_context_logits

[1]

bool

When true, include context logits in the output

return_generation_logits

[1]

bool

When true, include generation logits in the output

draft_input_ids

[num_draft_tokens]

int32_t

Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration