Inference Request

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory tensors to create a valid InferenceRequest object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:

Name	Shape	Type	Description
`request_output_len`	[1,1]	`int32_t`	Max number of output tokens
`input_ids`	[1, num_input_tokens]	`int32_t`	Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name	Shape	Type	Description
`streaming`	[1]	`bool`	(Default=`false`). When `true`, stream out tokens as they are generated. When `false` return only when the full generation has completed.
`beam_width`	[1]	`int32_t`	(Default=1) Beam width for this request; set to 1 for greedy sampling
`temperature`	[1]	`float`	Sampling Config param: `temperature`
`runtime_top_k`	[1]	`int32_t`	Sampling Config param: `topK`
`runtime_top_p`	[1]	`float`	Sampling Config param: `topP`
`len_penalty`	[1]	`float`	Sampling Config param: `lengthPenalty`
`repetition_penalty`	[1]	`float`	Sampling Config param: `repetitionPenalty`
`min_length`	[1]	`int32_t`	Sampling Config param: `minLength`
`presence_penalty`	[1]	`float`	Sampling Config param: `presencePenalty`
`frequency_penalty`	[1]	`float`	Sampling Config param: `frequencyPenalty`
`random_seed`	[1]	`uint64_t`	Sampling Config param: `randomSeed`
`end_id`	[1]	`int32_t`	End token Id
`pad_id`	[1]	`int32_t`	Pad token Id
`embedding_bias`	[1]	`float`	Embedding bias
`bad_words_list`	[2, num_bad_words]	`int32_t`	Bad words list
`stop_words_list`	[2, num_stop_words]	`int32_t`	Stop words list
`prompt_embedding_table`	[1]	`float16`	P-tuning prompt embedding table
`prompt_vocab_size`	[1]	`int32_t`	P-tuning prompt vocab size
`lora_weights`	[ num_lora_modules_layers, D x Hi + Ho x D ]	`float` (model data type)	weights for a lora adapter. see lora docs for more details.
`lora_config`	[3]	`int32_t`	lora configuration tensor. `[ module_id, layer_idx, adapter_size (D aka R value) ]` see lora docs for more details.
`return_log_probs`	[1]	`bool`	When `true`, include log probs in the output
`return_context_logits`	[1]	`bool`	When `true`, include context logits in the output
`return_generation_logits`	[1]	`bool`	When `true`, include generation logits in the output
`draft_input_ids`	[num_draft_tokens]	`int32_t`	Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration