Inference Request
The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId.
The mandatory tensors to create a valid InferenceRequest object are described below. Sampling Config params are documented in more detail here, and descriptions are omitted in the table:
Name |
Shape |
Type |
Description |
|---|---|---|---|
|
[1,1] |
|
Max number of output tokens |
|
[1, num_input_tokens] |
|
Tensor of input tokens |
Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:
Name |
Shape |
Type |
Description |
|---|---|---|---|
|
[1] |
|
(Default= |
|
[1] |
|
(Default=1) Beam width for this request; set to 1 for greedy sampling |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
End token Id |
|
[1] |
|
Pad token Id |
|
[1] |
|
Embedding bias |
|
[2, num_bad_words] |
|
Bad words list |
|
[2, num_stop_words] |
|
Stop words list |
|
[1] |
|
P-tuning prompt embedding table |
|
[1] |
|
P-tuning prompt vocab size |
|
[ num_lora_modules_layers, D x Hi + Ho x D ] |
|
weights for a lora adapter. see lora docs for more details. |
|
[3] |
|
lora configuration tensor. |
|
[1] |
|
When |
|
[1] |
|
When |
|
[1] |
|
When |
|
[num_draft_tokens] |
|
Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |