mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-29 15:50:22 +00:00
b7666
7666 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5b8844ae53 | scripts : fix repos cloned with .git extension (#18669) b7666 | ||
|
|
7e16fef085 | convert : more variants of rope_theta config entries (#18668) | ||
|
|
f5245b5e4e |
cuda : fix build on cuda 12.8 (#18672)
compute121 requires 12.9 Signed-off-by: Oliver Walsh <owalsh@redhat.com>b7664 |
||
|
|
ae9f8df778 |
fix(docker): add missing libglvnd libraries to Vulkan image (#18664)
Add libglvnd0, libgl1, libglx0, libegl1, libgles2 to the Vulkan Dockerfile base image. These libraries are required by mesa-vulkan-drivers to properly initialize the Vulkan ICD and detect GPU devices. Without these libraries, vkEnumeratePhysicalDevices() returns an empty list, resulting in "ggml_vulkan: No devices found." error. Fixes #17761 |
||
|
|
56d2fed2b3 |
tools : remove llama-run (#18661)
* tools : remove llama-run * Remove licenses/LICENSE-linenoise Signed-off-by: Adrien Gallouët <angt@huggingface.co>b7662 |
||
|
|
56426673cb |
scripts : add pr2wt.sh (#18644)
* scripts : add pr2wt.sh * script : shebang Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
bb77764c2d |
convert : clarify sentence-transformers-dense-modules help [no ci] (#18662)
* convert : clarify sentence-transformers-dense-modules help [no ci]
This commit updates this options help message which currently looks
like this:
```console
--sentence-transformers-dense-modules
Whether to include sentence-transformers dense modules.It can be used for sentence-transformers models, like
google/embeddinggemma-300mDefault these modules are not included.
```
|
||
|
|
9dfa8ee950 | ci : run cann build unconditionally [no ci] (#18659) | ||
|
|
ca4a8370bc | vulkan: reject ops when a tensor is too large to allocate (#18646) b7658 | ||
|
|
03023296cf |
vulkan: Warptile tuning for Intel Xe2/Xe3 (#18178)
* modify warptile tuning for xe3 * intel vendor check w/ coopmat support * fix back formatting * fix formatting change 2 * move intel check to chip specific tuning part * Change to support both windows and linux * modify m_warptile to l_warptile for intel * modify warptile tuning for bf16 matmuls to fix regression (m_warptile to l_warptile) * Code style changes * Code style changes (2) * Code style changes (3)b7657 |
||
|
|
8c77a04cc7 |
vulkan: more mul mat optimizations (#18533)
* q4_k * q5_k * q2_k * q4_1 * q5_1 * better buf indexb7656 |
||
|
|
ffba4f29e6 |
examples : add debug utility/example (#18464)
* examples : add debug utility/example
This commit introduces a new example named llama-debug which is a
utility that is intended to be used to assist with developing/debugging
a converted model.
The motivation for this utilitiy is to assist in model conversion work
to verify that the model produces the expected outputs. It is intended
to replace logits.cpp in examples/model-conversion.
Example usage:
```console
./build/bin/llama-debug \
-m models/Qwen2.5-0.5B-Instruct.gguf \
--prompt "Hello, my name is" \
--save-logits
...
Model add_bos: false
Input prompt: "Hello, my name is"
Token ids (5):
Hello(9707) ,(11) my(847) name(829) is(374)
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.bin
Data saved to data/llamacpp-Qwen2.5-0.5B-Instruct.txt
Prompt saved to data/llamacpp-Qwen2.5-0.5B-Instruct-prompt.txt
Tokens saved to data/llamacpp-Qwen2.5-0.5B-Instruct-tokens.bin
```
For more details about the options available for this example, please
refer to examples/debug/README.md.
* throw runtime error instead of logging error
* remove params.warmup and enable the warmup/nowarmup option
* model-conversion : remove logits.cpp
This commit removes logits.cpp in favor of using llama-debug for
generating logits and embeddings.
* examples : remove model-conversion directory
This was missed in the previous commit.
* model-conversion : add support for saving prompt and token ids
This commit add support for storing the prompt and the token ids for the
prompt when running the original models.
The motivation for this is that this will allow us to compare the prompt
and the tokens generated for the prompt when verifing the converted
model. Currently it is possible that even if the same prompt is used
that the tokens generated are different if there is a difference in the
tokenization between the original and converted model which would
currently go unnoticed (the verification will most likely fail but it
might not be obvious why).
* squash! model-conversion : add support for saving prompt and token ids
fix pyright errors.
* model-conversion : add compare_tokens utility
This commit adds a script to compare token outputs between original and
converted models.
Example usage:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
And there is a verbose flag that will also print out the prompts:
```console
(venv) $ ./scripts/utils/compare_tokens.py pytorch-gemma-3-270m-it llamacpp-gemma-3-270m-it-bf16 -v
Original model prompt (pytorch-gemma-3-270m-it):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Converted model prompt (llamacpp-gemma-3-270m-it-bf16):
prompt: Hello, my name is
n_tokens: 6
token ids: 2, 9259, 236764, 1041, 1463, 563
Comparing tokens between:
Original : pytorch-gemma-3-270m-it (6 tokens)
Converted: llamacpp-gemma-3-270m-it-bf16 (6 tokens)
✅ All 6 tokens match!
```
* model-conversion : add token comparison to verifiction scripts
This commit add the calling of the compare_tokens function in
compare-logits.py and semantic_check.py to ensure that the token ids
that the tokenizers procoduce are the same before proceeding with
verifying the logits/embeddings.
Placing them in the existing scripts instead calling them separately
ensures that the token comparison is always done prior to the
logit/embedding verifications.
Follow up commit/pr could refactor the causal logits verification into
a single script instead of the two that exist now. This would reduce the
code and make it consistent with the embeddings verficiation which only
has a single script.
* debug : use llama_model_n_embd_out
This commit updates the debug example to use the new function
llama_model_n_embd_out instead of llama_model_n_embd.
The motivation for this change is to support late interation retriever
models, like LFM2-ColBert-350M, where the output embeddings are down
projected to a lower dimension.
* debug : add print_usage function
This commit adds a print_usage function that is passed to the
common_params_parse.
The motivation for this is that this enables a specific usage message
which will be printed after all the options, for example:
```console
example usage:
Print tensors:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --verbose
The tensors to be printed can be filtered with --tensor-filter option.
Save logits/embeddings:
./build/bin/llama-debug -m model.gguf -p "Hello my name is" --save-logits
Add --embedding to save embeddings
```
b7655
|
||
|
|
3333951d86 |
CANN: Fix rename for get_env (#18652)
In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase to accurately reflect the function’s behavior and reduce the chance of misuse. However, the update missed renaming call sites in other files. This commit fixes that oversight.b7654 |
||
|
|
193ee38a1b |
CANN: Rename get_env to get_env_as_lowercase (#18624)
b7653
|
||
|
|
95ea9e0861 |
Hexagon add support for f16/f32 flash attention, scale, set-rows and improve f16/32 matmul (#18611)
* hexagon: improve fp16 matmul and add fp32/fp16 flash-attention * hexagon: add support for set-rows fp32 -> fp16 with i32/i64 row-idx * hexagon: add support for SCALE fp32 * hexagon: replace scalar fp32 -> fp16 copy with HVX * hexagon: optimize flash_atten_ext with aligned VTCM buffers and DMA - Implements double-buffered DMA prefetching for K, V, and Mask tensors. - Ensures K and V rows in VTCM are padded to 128 bytes to support aligned HVX operations. - Correctly synchronizes DMA transfers to prevent race conditions. - Uses `FLASH_ATTN_BLOCK_SIZE` of 128 for efficient chunking. * hexagon: use aligned mad_f16 * hexagon: flash_atten more aligned ops * hexagon: optimize scale_f32 hvx helpers * hexagon: unroll fa loops * hexagon: remove unused set-rows log * hexagon: flash_attn_ext add support for DMAing Q - Update `op_flash_attn_ext` to include Q row size in scratchpad allocation. - Pad Q row size to 128 bytes for alignment. - Implement DMA transfer for Q tensor in `flash_attn_ext_f16_thread`. - Update dot product computations to use VTCM-buffered Q data. * hexagon: fix handling of NANs hvx dotproducts * hexagon: cleanup spad allocation in flash-atten * hexagon: improve fp16/fp32 matmul - Introduced `vec_dot_f16_f16` and `vec_dot_f16_f16_rx2` kernels using efficient HVX dot product intrinsics. - Added `quantize_fp32_f16` to copy/convert weights from DDR to VTCM - Updated `op_matmul` to use the optimized path when VTCM capacity allows and broadcasting requirements are compatible. - Implemented fallback logic to the original implementation for complex broadcasting scenarios. * hexagon: fix HVX_ARCH check * hexagon: matmul cleanup and fp16 fixes Use aligned vec_dot_f16 for 2d matmuls and unaligned version for 4d. * hexagon: fix fp16 x fp16 matmuls and some minor refactoring * hexagon: add support for GET_ROWS f32 -> f32 Also optimize SET_ROWS threading a bit when we have just a few rows to process. * hexagon: optimize set-rows threading * hexagon: update adb/run-bench.sh to properly support experimental and verbose options * hexagon: flash_atten use aligned vectors for dot productsb7652 |
||
|
|
ccbc84a537 |
mtmd: mtmd_audio_streaming_istft (#18645)
Change is decoupled from https://github.com/ggml-org/llama.cpp/pull/18641. [LFM2.5-Audio-1.5B](https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B) needs streaming istft for generating output audio. * add streaming ISTFT class (`mtmd_audio_streaming_istft`) with overlap-add for audio reconstruction * replace global audio cache with per-instance cache, the model requires two independent caches, for preprocessing (audio input) and for istft (audio output). * unified templated FFT/IFFT implementation supporting both forward and inverse transformsb7651 |
||
|
|
68b4d516c3 | llama-params-fit: fix last devices with low VRAM (#18494) b7650 | ||
|
|
24af22fc36 |
ggml : optimize cuda ssm_scan using warp-level reduction (#18505)
* ggml : optimize cuda ssm_scan using warp-level reduction * ggml : apply code review suggestions (style, const, constexpr) * ggml : add TODO regarding stride consistencyb7649 |
||
|
|
07fbe19f1f |
arg: use CSV escape style for multiple-value args (#18643)
* arg: use CSV escape style for multiple-value args * add testb7648 |
||
|
|
ea13cba850 |
vulkan: support buffer_from_host_ptr (#18467)
* vulkan: support buffer_from_host_ptr * hacky use of buffer_from_host_ptr for directio * disable buffer_from_host_ptr cap * use external memory for ggml_vk_host_malloc, revert model loader changes * disable external_memory_host for MoltenVK * take buffer memory types into account * don't use external_memory_host for ggml_vk_host_mallocb7647 |
||
|
|
090b137e56 |
ggml-cuda: refactor cuda graph usage (#18637)
* ggml-cuda: refactor cuda graph usage * use is_enabled() instead of enabledb7646 |
||
|
|
968929528c |
mmq.cu: tune mmq/rocblas switching for RDNA (#18537)
* Patch perf regression for mmq kernels in ROCm recover performance regression for https://github.com/ggml-org/llama.cpp/issues/17917 * add n_experts branch like the cdna path * mmq.cu: tune mmq/wmma switching for RDNA * mmq.cu: move amd wmma mmq/wmma switching behind IS_RDNA3 * Update ggml/src/ggml-cuda/mmq.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Jiacheng (Jason) Chen <76919340+jiachengjason@users.noreply.github.com> Co-authored-by: jiachengjason <jasonchen.jiacheng@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>b7645 |
||
|
|
3d26a09dc7 |
server : add thinking content blocks to Anthropic Messages API (#18551)
* server : add thinking content blocks to Anthropic Messages API Add support for returning reasoning/thinking content in Anthropic API responses when using models with --reasoning-format deepseek and the thinking parameter enabled. - Non-streaming: adds thinking block before text in content array - Streaming: emits thinking_delta events with correct block indices - Partial streaming: tracks reasoning state across chunks via anthropic_has_reasoning member variable Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model. * server : fix Anthropic API streaming for thinking content blocks Add signature field and fix duplicate content_block_start events in Anthropic Messages API streaming responses for reasoning models. * server: refactor Anthropic streaming state to avoid raw pointer Replace raw pointer to task_result_state with direct field copies: - Copy state fields in update() before processing chunk - Use local copies in to_json_anthropic() instead of dereferencing - Pre-compute state updates for next chunk in update() This makes the data flow clearer and avoids unsafe pointer patterns.b7644 |
||
|
|
bd2a93d475 | gguf-py : add requests to dependencies (#18629) | ||
|
|
e75ee11024 |
ggml : fix avx512bf16 build (#18623)
- include `immintrin.h` when required - remove unused m512bh Signed-off-by: Adrien Gallouët <angt@huggingface.co>b7642 |
||
|
|
da9b8d3300 |
CANN: Make valid_values variable static const (#18627)
b7641
|
||
|
|
e443fbcfa5 |
ggml webgpu: add CEIL operation support (#18605)
* ggml-webgpu: add CEIL operation support
Add support for the CEIL unary operation in the WebGPU backend:
- Add CEIL_FUNC shader template in unary_op.wgsl
- Add 4 shader variants (f32, f16, inplace versions)
- Initialize CEIL pipelines in ggml-webgpu.cpp
- Register CEIL in supports_op function
* docs: update WebGPU ops support for CEIL
b7640
|
||
|
|
73d284a250 |
model : add LFM2-ColBert-350M (#18607)
* model : add LFM2-ColBert-350M * llama_model_n_embd_out() - returns `hparams.n_embd_out` if set and fallbacks to `hparams.n_embd`b7639 |
||
|
|
df17a4c94f | CUDA: fix FA FP16 accumulator overflow for Granite (#18614) b7638 | ||
|
|
1871f0ba56 |
add YoutuVLForConditionalGeneration architectures (#18620)
* Support Youtu-VL Model --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> |
||
|
|
f47edb8c19 |
ggml-cuda: check for srcs outside the cgraph (#18583)
* ggml-cuda: check for srcs outside the cgraph * review: use leafs insteadb7636 |
||
|
|
da143b9940 | server : fix router child env in containerized environments (#18562) b7635 | ||
|
|
f1768d8f03 | vulkan: fix topk_moe_sigmoid_norm_bias failures in GLM-4.6 (#18582) b7634 | ||
|
|
2da64a2f8a |
models : fix backend assignment for Granite/Nemotron graphs (#18599)
* models : fix backend assignment for Granite/Nemotron graphs * cont : add ref * cont : move call to build_inp_embd()b7633 |
||
|
|
b37124d2d2 |
vulkan: handle quantize_q8_1 overflowing the max workgroup count (#18515)
* vulkan: handle quantize_q8_1 overflowing the max workgroup count * vulkan: Fix small tile size matmul on lavapipe * fix mul_mat_id failuresb7632 |
||
|
|
eadc4184ca |
llama : refactor rope_freq_base/scale_swa conversion and init (#18553)
* refactor rope_freq_base/scale_swa conversion and init * safe defaults for unknowns * update relevant models * grammar * add get_rope_freq_scale to modern-bert * const * const * log swa infob7631 |
||
|
|
67e3f6f601 |
CANN: add operator fusion support for ADD + RMS_NORM (#17512)
This commit implements operator fusion for ADD + RMS_NORM operations in the CANN backend to reduce memory access overhead and improve performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: false). Changes: - Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm - Add ggml_cann_can_fuse() to check fusion eligibility - Integrate fusion logic into computation graph evaluation - Add test cases for ADD + RMS_NORM fusion - Update documentation with new environment variable The fusion combines ADD and RMS_NORM into a single kernel call, which is more efficient than executing them separately.b7630 |
||
|
|
92ac1e016b |
doc: clarify that steps also apply to linux for opencl (#18002)
* Clarify setup steps for Linux Added note that setup steps apply to Linux as well. * Added note for backtick replacement * clarify that backtick replacement only applies on linux * clarified Linux specific steps So actually some changes are needed for Linux but they are minor. * clarify change execution * clarify by placing info after steps * clarify which steps * Make instructions consistent across OSes * Rm whitespace * Update docs/backend/OPENCL.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Update docs/backend/OPENCL.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Update docs/backend/OPENCL.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com> |
||
|
|
8e3a761189 |
ci : init git lfs in every build for RISC-V (#18590)
* Initialized git lfs in every test * Added git-lfs in dependencies to instalb7628 |
||
|
|
d3dce4e0a5 |
sampling : add support for backend sampling (#17004)
* sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph. * llama-cli : add backend sampler configuration * server : add backend sampling options/configuration * webui : add backend sampling options * ggml : add initial cumsum implementation for CUDA * sampling : enable all backend sampler tests This commit enables all exisiting backend sampler tests in the test-backend-sampler. Previously, some tests were disabled because there were missing ggml operation implementations. * graph : do not include llama-model.h * sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful. * sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence. * CUDA: Optimize argsort for gpu-based token sampling Argsort is used for top-k currently. WE optimize argsort by 2 things: 1. Use `DeviceRadixSort` for single-row/sequence to parallelize it across our SMs 2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the correct entrypoint (the function chooses different execution paths, it contains `DeviceSegmentedRadixSort` as one of the paths and will choose the best one according to heuristics. https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview Some perf numbers for a RTX PRO 6000: On the kernel level, tested with `GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf` Before: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run ``` After: ``` ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run ``` --- On the model level, tested with `llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling` Before: ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second) llama_perf_context_print: load time = 18215.58 ms llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second) llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second) llama_perf_context_print: total time = 857.62 ms / 206 tokens ``` After ``` llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second) llama_perf_context_print: load time = 18366.92 ms llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second) llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second) llama_perf_context_print: total time = 683.65 ms / 206 tokens ``` * sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection. * sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers. * sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode. * squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available. * common : fix regression caused by extra memory allocations during sampling * squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring. * squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit |
||
|
|
4974bf53cf |
model : mtmd : make input norm optional in LFM2-VL (#18594)
Upcoming LFM2-VL releases will have configurable input norm. See https://github.com/huggingface/transformers/pull/43087 for details.b7626 |
||
|
|
908a9e5a1e |
CUDA: disable cuda graph when using n-cpu-moe (#18593)
* CUDA: disable cuda graph when using n-cpu-moe * call ggml_cuda_set_deviceb7625 |
||
|
|
5126c41c1c | ggml-cuda: remove unused params in ggml_cuda_graph (#18579) b7624 | ||
|
|
cef1d23c5a |
common/grammar : replace problematic backtracking regex [\s\S]* (#18342)
* grammar : add support for std::regex_search() with trigger patterns * common : update hermes2 pro trigger to search instead of match * common : use regex_search with anchoring for partial matching * common : adjust regex partial tests to use new pattern * grammar : check pattern directly instead of adding a type * common : adjust existing patterns to match new semanticsb7623 |
||
|
|
c69c7ebc90 |
graph : fix graph reuse logic when n_pos_per_embd > 1 (#18566)
b7622
|
||
|
|
e57f52334b | ggml-cuda: fixes for concurrent streams (#18496) b7621 | ||
|
|
a554a1ecc7 | context : fix reserve token padding to n_seqs (#18536) b7620 | ||
|
|
0f2e42ca1d | CUDA: only allocate FA tmp buffer if needed (#18564) b7619 | ||
|
|
9dba9f5352 |
(Bugfix, ggml-cuda) Pool alloc count fix + small size computation type adjustment (#18559)
* CUDA: Fixed obj byte size instead of obj count being passed to pool alloc (fattn-common, dst_tmp_meta) * CUDA: Explicitly casted some of the int alloc counts before multiplication in argsort --------- Co-authored-by: pl752 <maximpl752@gmail.com>b7618 |
||
|
|
bcfc8c3cec |
ggml-hexagon: optimize activation function (#18393)
* refactor: refactor silu * refactor: optimize swiglu * refactor: remove unncessary if in swiglu * refactor: refactor swiglu_oai * chore: fix formatting issueb7617 |