From 5347b4949332682dca3811dc4d0d17208be40d2a Mon Sep 17 00:00:00 2001 From: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com> Date: Tue, 5 Nov 2024 14:01:36 +0800 Subject: [PATCH] update llm api reference page. (#2410) --- llm-api/reference.html | 1236 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 1233 insertions(+), 3 deletions(-) diff --git a/llm-api/reference.html b/llm-api/reference.html index ad68b6ef3f..01fed0683f 100644 --- a/llm-api/reference.html +++ b/llm-api/reference.html @@ -17,7 +17,7 @@ - + @@ -64,7 +64,196 @@
LLM API
LLM
+RequestOutput
+SamplingParamsSamplingParams.__init__()SamplingParams.add_special_tokensSamplingParams.badSamplingParams.bad_token_idsSamplingParams.beam_search_diversity_rateSamplingParams.beam_widthSamplingParams.early_stoppingSamplingParams.embedding_biasSamplingParams.end_idSamplingParams.exclude_input_from_outputSamplingParams.external_draft_tokens_configSamplingParams.frequency_penaltySamplingParams.include_stop_str_in_outputSamplingParams.length_penaltySamplingParams.logits_post_processor_nameSamplingParams.max_new_tokensSamplingParams.max_tokensSamplingParams.min_lengthSamplingParams.min_tokensSamplingParams.no_repeat_ngram_sizeSamplingParams.pad_idSamplingParams.presence_penaltySamplingParams.prompt_tuning_configSamplingParams.random_seedSamplingParams.repetition_penaltySamplingParams.return_context_logitsSamplingParams.return_encoder_outputSamplingParams.return_generation_logitsSamplingParams.return_log_probsSamplingParams.seedSamplingParams.setup()SamplingParams.stopSamplingParams.stop_token_idsSamplingParams.temperatureSamplingParams.top_kSamplingParams.top_pSamplingParams.top_p_decaySamplingParams.top_p_minSamplingParams.top_p_reset_idsKvCacheConfig
+SchedulerConfig
+CapacitySchedulerPolicy
+BuildConfigBuildConfig.__init__()BuildConfig.auto_parallel_configBuildConfig.dry_runBuildConfig.enable_debug_outputBuildConfig.force_num_profilesBuildConfig.from_dict()BuildConfig.from_json_file()BuildConfig.gather_context_logitsBuildConfig.gather_generation_logitsBuildConfig.input_timing_cacheBuildConfig.kv_cache_typeBuildConfig.lora_configBuildConfig.max_batch_sizeBuildConfig.max_beam_widthBuildConfig.max_draft_lenBuildConfig.max_encoder_input_lenBuildConfig.max_input_lenBuildConfig.max_num_tokensBuildConfig.max_prompt_embedding_table_sizeBuildConfig.max_seq_lenBuildConfig.opt_batch_sizeBuildConfig.opt_num_tokensBuildConfig.output_timing_cacheBuildConfig.plugin_configBuildConfig.profiling_verbosityBuildConfig.speculative_decoding_modeBuildConfig.strongly_typedBuildConfig.to_dict()BuildConfig.update()BuildConfig.update_from_dict()BuildConfig.update_kv_cache_type()BuildConfig.use_fused_mlpBuildConfig.use_refitBuildConfig.use_strip_planBuildConfig.visualize_networkBuildConfig.weight_sparsityBuildConfig.weight_streamingQuantConfigQuantConfig.__init__()QuantConfig.clamp_valQuantConfig.exclude_modulesQuantConfig.from_dict()QuantConfig.get_modelopt_kv_cache_dtype()QuantConfig.get_modelopt_qformat()QuantConfig.get_quant_cfg()QuantConfig.group_sizeQuantConfig.has_zero_pointQuantConfig.kv_cache_quant_algoQuantConfig.layer_quant_modeQuantConfig.pre_quant_scaleQuantConfig.quant_algoQuantConfig.quant_modeQuantConfig.requires_calibrationQuantConfig.requires_modelopt_quantizationQuantConfig.smoothquant_valQuantConfig.to_dict()QuantConfig.use_plugin_sqQuantAlgoQuantAlgo.FP8QuantAlgo.FP8_PER_CHANNEL_PER_TOKENQuantAlgo.INT8QuantAlgo.MIXED_PRECISIONQuantAlgo.NO_QUANTQuantAlgo.W4A16QuantAlgo.W4A16_AWQQuantAlgo.W4A16_GPTQQuantAlgo.W4A8_AWQQuantAlgo.W8A16QuantAlgo.W8A8_SQ_PER_CHANNELQuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGINQuantAlgo.W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGINQuantAlgo.W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGINQuantAlgo.W8A8_SQ_PER_TENSOR_PLUGINCalibConfig
+BuildCacheConfig
+RequestErrorLLM API Examples
Bases: object
LLM class is the main class for running a LLM model.
+model (str) – The model name or a local path to the model directory. It could be a HuggingFace(HF) model name, +a local path to the HF model, or a local path to the TRT-LLM engine or checkpoint.
tokenizer (Optional[Union[str, Path, TokenizerBase, PreTrainedTokenizerBase]]) – The tokenizer name or a local +path to the tokenizer directory.
skip_tokenizer_init – If true, skip initialization of tokenizer and detokenizer. generate and generate_async +will accept prompt token ids as input only.
tensor_parallel_size (int) – The number of processes for tensor parallelism.
dtype (str) – The data type for the model weights and activations.
trust_remote_code (bool, default=False) – Download the model and tokenizer from trust remote code (e.g, Hugging Face)
revision (Optional[str]) – The revision of the model.
tokenzier_revision (Optional[str]) – The revision of the tokenizer.
auto_parallel (bool, default=False) – Enable auto parallel mode.
pipeline_parallel_size (int, default=1) – The pipeline parallel size.
enable_lora (bool, default=False) – Enable LoRA adapters.
max_lora_rank (int, default=None) – Maximum LoRA rank. If specified, it overrides build_config.lora_config.max_lora_rank.
max_loras (int, default=4) – Maximum number of LoRA adapters to be stored in GPU memory.
max_cpu_loras (int, default=4) – Maximum number of LoRA adapters to be stored in CPU memory.
build_config (BuildConfig, default=BuildConfig()) – The build configuration for the model. +Default is an empty BuildConfig instance.
quant_config (QuantConfig, default=QuantConfig()) – The quantization configuration for the model. +Default is an empty QuantConfig instance.
calib_config (CalibConfig, default=CalibConfig()) – The calibration configuration for the model.
embedding_parallel_mode (str, default="SHARDING_ALONG_VOCAB") – The parallel mode for embeddings.
share_embedding_table (bool, default=False) – Whether to share the embedding table.
kv_cache_config (KvCacheConfig, optional) – The key-value cache configuration for the model. +Default is None.
peft_cache_config (PeftCacheConfig, optional) – The PEFT cache configuration for the model. +Default is None.
decoding_config (DecodingConfig, optional) – The decoding configuration for the model. +Default is None.
logits_post_processor_map (Dict[str, Callable], optional) – A map of logit post-processing functions. +Default is None.
scheduler_config (SchedulerConfig, default=SchedulerConfig()) – The scheduler configuration for the model. +Default is an empty SchedulerConfig instance.
normalize_log_probs (bool, default=False) – Whether to normalize log probabilities for the model.
iter_stats_max_iterations (int, optional) – The maximum number of iterations for iteration statistics. +Default is None.
request_stats_max_iterations (int, optional) – The maximum number of iterations for request statistics. +Default is None.
batching_type (BatchingType, optional) – The batching type for the model. +Default is None.
enable_build_cache (bool or BuildCacheConfig, optional) – Whether to enable build caching for the model. +Default is None.
enable_tqdm (bool, default=False) – Whether to display a progress bar during model building.
trust_remote_code – Whether to trust remote code when downloading model and tokenizer from Hugging Face.
Generate output for the given prompts in the synchronous mode. +Synchronous generation accepts either single prompt or batched prompts.
+inputs (Union[PromptInputs, Sequence[PromptInputs]]) – The prompt text or token ids. +Note, it must be single prompt or batched prompts.
sampling_params (Optional[Union[SamplingParams, List[SamplingParams]]]) – The sampling params for the +generation, a default one will be used if not provided.
use_tqdm (bool) – Whether to use tqdm to display the progress bar.
lora_request (Optional[Union[LoRARequest, Sequence[LoRARequest]]]) – LoRA request to use for generation, if any.
The output data of the completion request to the LLM.
+Union[RequestOutput, List[RequestOutput]]
+Generate output for the given prompt in the asynchronous mode. +Asynchronous generation accepts single prompt only.
+inputs (PromptInputs) – The prompt text or token ids; must be single prompt.
sampling_params (Optional[SamplingParams]) – The sampling params for the generation, a default one will be +used if not provided.
lora_request (Optional[LoRARequest]) – LoRA request to use for generation, if any.
streaming (bool) – Whether to use the streaming mode for the generation.
The output data of the completion request to the LLM.
+Save the built engine to the given path.
+engine_dir (str) – The path to save the engine.
+None
+Bases: GenerationResult
The output data of a completion request to the LLM.
+request_id (int): The unique ID of the request. +prompt (str): The prompt string of the request. +prompt_token_ids (List[int]): The token ids of the prompt. +outputs (List[CompletionOutput]): The output sequences of the request. +context_logits (torch.Tensor): The logits on the prompt token ids. +finished (bool): Whether the whole request is finished.
+Bases: object
Sampling parameters for text generation.
+end_id (int) – The end token id.
pad_id (int) – The pad token id.
max_tokens (int) – The maximum number of tokens to generate.
max_new_tokens (int) – The maximum number of tokens to generate. This argument is being deprecated; please use max_tokens instead.
bad (Union[str, List[str]]) – A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output.
bad_token_ids (List[int]) – A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output.
stop (Union[str, List[str]]) – A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True.
stop_token_ids (List[int]) – A list of token ids that stop the generation when they are generated.
include_stop_str_in_output (bool) – Whether to include the stop strings in output text. Defaults to False.
embedding_bias (torch.Tensor) – The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size].
external_draft_tokens_config (ExternalDraftTokensConfig) – The speculative decoding configuration.
prompt_tuning_config (PromptTuningConfig) – The prompt tuning configuration.
logits_post_processor_name (str) – The logits postprocessor name. Must correspond to one of the logits postprocessor name provided to the ExecutorConfig.
beam_width (int) – The beam width. Default is 1 which disables beam search.
top_k (int) – Controls number of logits to sample from. Default is 0 (all logits).
top_p (float) – Controls the top-P probability to sample from. Default is 0.f
top_p_min (float) – Controls decay in the top-P algorithm. topPMin is lower-bound. Default is 1.e-6.
top_p_reset_ids (int) – Controls decay in the top-P algorithm. Indicates where to reset the decay. Default is 1.
top_p_decay (float) – Controls decay in the top-P algorithm. The decay value. Default is 1.f
seed (int) – Controls the random seed used by the random number generator in sampling
random_seed (int) – Controls the random seed used by the random number generator in sampling. This argument is being deprecated; please use seed instead.
temperature (float) – Controls the modulation of logits when sampling new tokens. It can have values > 0.f. Default is 1.0f
min_tokens (int) – Lower bound on the number of tokens to generate. Values < 1 have no effect. Default is 1.
min_length (int) – Lower bound on the number of tokens to generate. Values < 1 have no effect. Default is 1. This argument is being deprecated; please use min_tokens instead.
beam_search_diversity_rate (float) – Controls the diversity in beam search.
repetition_penalty (float) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. Default is 1.f
presence_penalty (float) – Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. Default is 0.f
frequency_penalty (float) – Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. Default is 0.f
length_penalty (float) – Controls how to penalize longer sequences in beam search. Default is 0.f
early_stopping (int) – Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token)
no_repeat_ngram_size (int) – Controls how many repeat ngram size are acceptable. Default is 1 << 30.
return_log_probs (bool) – Controls if Result should contain log probabilities. Default is false.
return_context_logits (bool) – Controls if Result should contain the context logits. Default is false.
return_generation_logits (bool) – Controls if Result should contain the generation logits. Default is false.
exclude_input_from_output (bool) – Controls if output tokens in Result should include the input tokens. Default is true.
return_encoder_output (bool) – Controls if Result should contain encoder output hidden states (for encoder-only and encoder-decoder models). Default is false.
add_special_tokens (bool) – Whether to add special tokens to the prompt.
Bases: pybind11_object
Bases: pybind11_object
Overloaded function.
+__init__(self: tensorrt_llm.bindings.executor.SchedulerConfig, capacity_scheduler_policy: tensorrt_llm.bindings.executor.CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT) -> None
__init__(self: tensorrt_llm.bindings.executor.SchedulerConfig, capacity_scheduler_policy: tensorrt_llm.bindings.executor.CapacitySchedulerPolicy, context_chunking_policy: Optional[tensorrt_llm.bindings.executor.ContextChunkingPolicy]) -> None
Bases: pybind11_object
Members:
+MAX_UTILIZATION
+GUARANTEED_NO_EVICT
+STATIC_BATCH
+Bases: object
Bases: object
Serializable quantization configuration class, part of the PretrainedConfig
+Bases: StrEnum
An enumeration.
+Bases: object
Calibration configuration.
+device (Literal['cuda', 'cpu'], default='cuda') – The device to run calibration.
calib_dataset (str, default='cnn_dailymail') – The name or local path of calibration dataset.
calib_batches (int, default=512) – The number of batches that the calibration runs.
calib_batch_size (int, default=1) – The batch size that the calibration runs.
calib_max_seq_length (int, default=512) – The maximum sequence length that the calibration runs.
random_seed (int, default=1234) – The random seed used for calibration.
tokenizer_max_seq_length (int, default=2048) – The maximum sequence length to initialize tokenizer for calibration.
Bases: object
Configuration for the build cache.
+The root directory for the build cache.
+str
+The maximum number of records to store in the cache.
+int
+The maximum amount of storage (in GB) to use for the cache.
+float
+Note
+The build-cache assumes the weights of the model are not changed during the execution. If the weights are +changed, you should remove the caches manually.
+Bases: RuntimeError
The error raised when the request is failed.
+