Reduce the number of parallel jobs in server-self-hosted.yml by stacking
test configurations as sequential steps within a single job, following the
pattern from #23927.
- server-metal: 4 matrix jobs -> 1 job with 4 sequential test steps
- server-cuda: 2 matrix jobs -> 1 job with 2 sequential test steps
- server-kleidiai: removed unnecessary single-entry matrix
- removed unused Setup Node.js step from server-metal
Total: 7 parallel jobs -> 3 parallel jobs
Assisted-by: llama.cpp:local pi
* speculative : add common_speculative_n_max helper function
Extract the speculative max-draft-size logic from server_n_outputs_max
into a reusable common_speculative_n_max() function in common/speculative.
Assisted-by: llama.cpp:local pi
* cont : draft context always has n_parallel outputs
* llama : log n_outputs_max
* speculative : remove draft-simple auto-enable
* ci : enable server tests on PRs
Fixes: https://github.com/ggml-org/llama.cpp/pull/23927#discussion_r3332213086
The cpu-x64-high-perf job was missing the Linux label in its runs-on
specification, causing the runner to not be discovered. All other
self-hosted Linux jobs include this label.
Assisted-by: llama.cpp:local pi
* remove redundant apple job
openvino gpu and cpu test can share the same build and machine
Update build-rpc.yml
Update build-openvino.yml
cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm
remove duplicate x86 vulkan
combine backend sampling
Update server.yml
run server on arm as windows is x86
* emdawn on one machine only
* fix openvino, remove cpu tag as we dont have many x64 machines with that tag
* ci : disable libcommon build from xcframework
* ocd : fix name
* ci : ios-xcode change to macos-26
* cont : pin xcode
* cont : pin xcode to minor version
* ci : ios use macos-15 again
* ci : add and test ccache-clear
* cont : fix
* cont : set permission
* cont : another permission
* cont : token
* cont : print key
* cont : bring back perms
* cont : test windows
* cont : add token
* cont : cleanup
* ci : make release jobs clean-up their ccache
* ci : separate CUDA windows workflow + fix names
* ci : rename workflow
* ci : prefix cache names with workflow name
* ci : rename build.yml -> build-cpu.yml
* ci : cache keys
* ci : fix windows cuda/hip concurrency of release workflow
* ci : fix apple cache names
* ci : add TODOs
* cont : keep just the last cache
* ci : update release concurrency to queue
* ci : move the release trigger to ubuntu-slim
* ci : hip add TODO
* cont : improve words
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ci : server windows set build type explicitly
* cont : try windows-2025
* ci : use llvm
* cont : use ninja
* cont : fix shell
* ci : set number of jobs correctly
* ci : fix windows with vulkan ccache by using llvm
* ci : server ccache only on master
* ocd : fix job names
[no release]
* ci : fix undefined sanitizer build to use Debug build type only
* ci : ccache the server builds
* cont : remove ui dependency + reuse ccache for both ubuntu jobs
* tmp : force ccache save
* Revert "tmp : force ccache save"
This reverts commit a857b03a10.
* cont : no need for node.js
* ci : move [no release] check to dedicated check_release job
Move the workflow-level \`if\` condition that skips builds when the commit
message contains \`[no release]\` into a lightweight \`check_release\` job.
All build jobs now depend on it via \`needs\` and check its output.
This ensures the skip logic is evaluated at the job level rather than at
the workflow level, which is the recommended approach for conditional jobs.
Assisted-by: llama.cpp:local pi
* cont : use `fast` runner
* ci : skip release workflow on master when commit message contains [no release]
Assisted-by: llama.cpp:local pi
* ci : restrict sanitizer builds to x86_64 + fix build type
the spark is apparently too slow for some reason
* tests : fix undefined warning
[no ci]
* ci : disable SYCL f16 builds
* ci : extract android and hip into separate workflows
* ci : move webgpu to separate workflow
* ci : move the rpc to a separate workflow
* ci : extract s309x and ppcl jobs
* ci : extract opencl job into a separate workflow
* fix(action): update SpacemiT toolchain URL and version
Change-Id: If4cc1c738a855274103f8c3ad52daa33528acd0c
* fix(action): add -L flag to curl command for URL redirection
Change-Id: I9b6c37390f0c7a733a36308c8fb53d22d234ab06
* ci : remove tag from build-self-hosted.yml
* ci : slim -> self-hosted
* ci : prevent heavy CPU jobs from running on fast runners
* ci : prevent cmake pkg to run on dedicated fast runners
* ci : try to bump 3.11 -> 3.13
* ci : move lint back to 3.11
* ci : back to 3.11
* ci : add comment about UI jobs
* ci : move python requirements check to CPU runners
this job is a bit slow for a dedicated "fast" runner
* ci : add self-hosted ui workflow
* ci : fix UI naming
* tmp to check if arm64 fast is compatible with all jobs
* revert last commit
* pi : update
* ci : fix ios build
* ci : fix andoroid
* ci : fix apple builds
* cmake : add install() for impl libraries
Add install(TARGETS <target> LIBRARY) for all -impl libraries that were
changed from STATIC to shared (controlled by BUILD_SHARED_LIBS) in
commit bb28c1fe2. Without this, cmake --install fails to copy the shared
libraries, causing runtime errors like:
llama-server: error while loading shared libraries: libllama-server-impl.so
Ref: https://github.com/ggml-org/llama.cpp/issues/23494#issuecomment-4512912515
Assisted-by: llama.cpp:local pi
* ci : fix xcframework build
* cmake : remove STATIC from impl libraries, allow BUILD_SHARED_LIBS control
Remove explicit STATIC from all -impl libraries (server, cli, completion, bench,
batched-bench, fit-params, quantize, perplexity) so BUILD_SHARED_LIBS controls
shared vs static linkage.
Add WINDOWS_EXPORT_ALL_SYMBOLS ON for proper DLL export on Windows.
Assisted-by: llama.cpp:local pi
* cmake : enable LLAMA_BUILD_APP by default
Assisted-by: llama.cpp:local pi
* ci : disable app in build-cmake-pkg.yml
* snapdragon: update compiler flags to enable all CPU features
* snapdragon: update readme to point to toolchain v0.6
* snapdragon: bump toolchain docker to v0.6
* docker: add OCI image labels to all published images
* docker: propagate OCI labels as manifest and index annotations
* docker: drop hardcoded org URL and revert accidental intel version bump
The OCI image url and source are now driven by build args with a sensible default. The workflow passes the actual repository url so fork builds get labels pointing at the fork instead of upstream. Also restores the IGC, compute runtime, and IGDGMM versions in the intel Dockerfile labeled stage which I accidentally bumped in the first commit.
* docker: add skip_s390x workflow_dispatch input for fast test runs
Lets maintainers and PR authors trigger the docker workflow without the s390x build target, which depends on the IBM Z runner and is by far the slowest job in the matrix. The flag filters the s390x row out of the build matrix before merge_matrix is derived, so the merge job sees a consistent shape too.
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
---------
Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>