llama.cpp/ggml/src at b8680 - llama.cpp - Gitea: Git with a cup of tea

kanshan/llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-29 07:40:21 +00:00

Files

T

History

Gaurav Garg 15f786e658 [CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159 )

* Write an optimized flash_attn_stream_k_fixup kernel

Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst

* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs

* Address review comments

* Address review comments

* Revert variable names to original

2026-04-06 20:34:29 +02:00

..

ggml-blas: set mkl threads from thread context (#20602 )

2026-03-18 01:16:49 +08:00

CANN: fix multi-thread set_tensor race conditions (#20151 )

2026-03-31 17:00:51 +03:00

ggml : fix RWKV ops thread assignment (#21226 )

2026-04-01 11:10:25 +03:00

[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159 )

2026-04-06 20:34:29 +02:00

hexagon: slight optimization for argosrt output init (#21463 )

2026-04-05 18:30:25 -07:00

ggml-cuda: native bf16 flash attention for vec kernel (#20525 )

2026-03-22 11:05:51 +01:00

metal : Fix dimension constraint violation in matmul2d descriptor (#21048 )

2026-03-27 09:05:21 +02:00

ggml-cuda: native bf16 flash attention for vec kernel (#20525 )

2026-03-22 11:05:51 +01:00

opencl: fix leak in Adreno q8_0 path (#21212 )

2026-04-01 12:54:58 -07:00

fix(openvino): explicit memset in buffer_context allocation (#20857 )

2026-03-23 08:05:37 +02:00

rpc : reuse compute graph buffers (#21299 )

2026-04-03 10:28:09 +03:00

sycl : handle other FA case (#21377 )

2026-04-06 13:28:00 +03:00

ggml-virtgpu: improve the reliability of the code (#19846 )

2026-02-26 20:00:57 +08:00

vulkan: add noncontiguous GLU support (#21081 )

2026-03-28 08:44:56 +01:00

ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278 )

2026-04-03 11:40:14 -07:00

ggml-zdnn : mark zDNN buffers as non-host (#18967 )

2026-01-22 01:16:21 +01:00

ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315 )

2026-04-03 12:19:08 +03:00

CMakeLists.txt

ggml : add OpenVINO backend (#15307 )

2026-03-14 07:56:55 +02:00

ggml-alloc.c

ggml : make ggml_is_view as API (#19539 )

2026-02-16 17:43:34 +02:00

ggml-backend-dl.cpp

hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 )

2026-01-29 12:33:21 -08:00

ggml-backend-dl.h

hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 )

2026-01-29 12:33:21 -08:00

ggml-backend-impl.h

llama: use host memory if device reports 0 memory (#18587 )

2026-01-09 05:34:56 +08:00

ggml-backend-reg.cpp

ggml : add OpenVINO backend (#15307 )

2026-03-14 07:56:55 +02:00

ggml-backend.cpp

llama : disable graph reuse with pipeline parallelism (#20463 )

2026-03-12 21:04:13 +02:00

ggml-common.h

ggml : add NVFP4 quantization type support (#19769 )

2026-03-11 21:02:54 +01:00

ggml-impl.h

llama: fix llama-model-saver (#20503 )

2026-03-25 12:53:16 +02:00

ggml-opt.cpp

finetune: SGD optimizer, more CLI args (#13873 )

2025-08-14 12:03:57 +02:00

ggml-quants.c

ggml : guard against sumq2 being 0 in IQ4_NL (#20460 )

2026-03-15 10:47:28 +02:00

ggml-quants.h

ggml : add NVFP4 quantization type support (#19769 )

2026-03-11 21:02:54 +01:00

ggml-threading.cpp

ggml : build backends as libraries (#10256 )

2024-11-14 18:04:35 +01:00

ggml-threading.h

remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797 )

2024-12-12 19:02:49 +01:00

ggml.c

mtmd: Add DeepSeekOCR Support (#17400 )

2026-03-25 19:57:40 +01:00

ggml.cpp

ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)

2025-06-01 13:43:57 +03:00

gguf.cpp

llama: fix llama-model-saver (#20503 )

2026-03-25 12:53:16 +02:00