4557 Commits

Author SHA1 Message Date
Jeff Bolz f35726c2fb build: apply MSVC /bigobj option to c/cpp files only (#11423) b4557 2025-01-26 03:10:03 +01:00
Jeff Bolz 4a75d19376 vulkan: compile shaders on-demand (#11406)
Reduce first-run startup time and memory consumption.

Should fix #11339.
2025-01-25 22:29:57 +01:00
uvos 26771a1491 Hip: disable VMM on hip as it seams that it dosent work in some configurations (#11420) 2025-01-25 21:01:12 +01:00
Jeff Bolz ca6baf76c1 build: add /bigobj to MSVC build (#11407) 2025-01-25 11:26:37 -06:00
Diego Devesa 6e264a905b docker : add GGML_CPU_ARM_ARCH arg to select ARM architecture to build for (#11419) 2025-01-25 17:22:41 +01:00
Xuan Son Nguyen 49b0e3cec4 server : fix cleaning up stream task (#11418)
* server : fix cleaning up stream task

* one more spot
b4552
2025-01-25 16:36:44 +01:00
Diego Devesa 20a758155b docker : fix CPU ARM build (#11403)
* docker : fix CPU ARM build

* add CURL to other builds
2025-01-25 15:22:29 +01:00
Georgi Gerganov 00c24acb2a ci : fix line breaks on windows builds (#11409)
* ci : fix line breaks on windows builds

* cont : another try

* ci : fix powershell line breaks
b4550
2025-01-25 13:36:48 +02:00
jiahao su 466ea66f33 CANN: Add Ascend CANN build ci (#10217)
* CANN: Add Ascend CANN build ci

* Update build.yml

* Modify cann image version

* Update build.yml

* Change to run on x86 system

* Update build.yml

* Update build.yml

* Modify format error

* Update build.yml

* Add 'Ascend NPU' label restrictions

* Exclude non PR event

Co-authored-by: Yuanhao Ji <jiyuanhao@apache.org>

* Update build.yml

---------

Co-authored-by: Yuanhao Ji <jiyuanhao@apache.org>
b4549
2025-01-25 00:26:01 +01:00
uvos 5f0db9522f hip : Add hipGraph and VMM support to ROCM (#11362)
* Add hipGraph support

* Enable VMM on rocm
b4548
2025-01-25 00:02:23 +01:00
Johannes Gäßler c5d9effb49 CUDA: fix FP16 cuBLAS GEMM (#11396) b4547 2025-01-24 21:02:43 +01:00
uvos 9fbadaef4f rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (#11356) b4546 2025-01-24 17:50:49 +01:00
Georgi Gerganov 9755129c27 release : pack /lib in the packages (#11392)
* release : pack /lib and /include in the packages

* cmake : put libs in /bin

* TMP : push artifacts

* Revert "TMP : push artifacts"

This reverts commit 4decf2c4df.

* ci : fix HIP cmake compiler options to be on first line

* ci : restore the original HIP commands

* ci : change ubuntu build from latest to 20.04

* ci : try to fix macos build rpaths

* ci : remove obsolete MacOS build

* TMP : push artifacts

* ci : change back to ubuntu latest

* ci : macos set build rpath to "@loader_path"

* ci : fix typo

* ci : change ubuntu package to 22.04

* Revert "TMP : push artifacts"

This reverts commit 537b09e70f.
b4545
2025-01-24 18:41:30 +02:00
Jafar Uruç a07c2c8a52 docs : Update readme to build targets for local docker build (#11368) 2025-01-24 14:30:13 +01:00
Johannes Gäßler 8137b4bb2b CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380) b4543 2025-01-24 12:38:31 +01:00
Bernhard M. Wiedemann 1af6945eb0 cmake : avoid -march=native when reproducible build is wanted (#11366)
See https://reproducible-builds.org/ for why this is good
and https://reproducible-builds.org/specs/source-date-epoch/
for the definition of this variable.

Without this patch, compiling on different machines produced different binaries, which made verification of results difficult.

Fixes: #11317

This patch was done while working on reproducible builds for openSUSE.
b4542
2025-01-24 13:21:35 +02:00
Eric Curtin 01f37edf1a Update llama-run README.md (#11386)
For consistency

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-01-24 09:39:24 +00:00
stduhpf c07e87f38b server : (webui) put DeepSeek R1 CoT in a collapsible <details> element (#11364)
* webui : put DeepSeek R1 CoT in a collapsible <details> element

* webui: refactor split

* webui: don't use regex to split cot and response

* webui: format+qol

* webui: no loading icon if the model isn't generating

* ui fix, add configs

* add jsdoc types

* only filter </think> for assistant msg

* build

* update build

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-24 09:02:38 +01:00
Jeff Bolz 564804b79b tests: fix some mul_mat test gaps (#11375)
Now that we have batched mat-vec mul Vulkan shaders for up to n==8,
these tests weren't actually exercising the mat-mat mul path. Test
n==9 as well. Also, change to use all_types.
b4539
2025-01-23 14:51:24 -06:00
Eric Curtin 05f63cc9ee Update documentation (#11373)
To show -n, -ngl, --ngl is acceptable.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4538
2025-01-23 20:04:31 +00:00
Eric Curtin f7fb43cd0b Add -ngl (#11372)
Most other llama.cpp cli tools accept -ngl with a single dash.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4537
2025-01-23 16:16:18 +00:00
Xuan Son Nguyen 5845661640 server : add more clean up when cancel_tasks is called (#11340)
* server : add more clean up when cancel_tasks is called

* fix recv_with_timeout

* std::remove_if

* fix std::remove_if
b4536
2025-01-23 13:56:05 +01:00
Eric Curtin f211d1dc10 Treat hf.co/ prefix the same as hf:// (#11350)
ollama uses hf.co/ to specify huggingface prefix, like RamaLama
uses hf://

Treat them similarly.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4535
2025-01-23 10:38:20 +00:00
amd-dwang 955a6c2d91 Vulkan-run-test: fix mmq_wg_denoms (#11343)
There should be a copy-and-paste error here.

*mmq_wg_denoms should be used together with *warptile_mmq, instead of
wg_denoms.
b4534
2025-01-23 08:14:28 +01:00
Jeff Bolz 1971adf55e vulkan: sort shaders for more deterministic binary (#11315)
Fixes #11306.
b4533
2025-01-23 08:07:50 +01:00
Jeff Bolz 5245729e33 vulkan: fix diag_mask_inf (#11323)
With robustbufferaccess disabled, this shader was showing OOB stores. There
is a bounds check in the code, but the workgrouop dimensions were reversed vs
CUDA and it was running the wrong number of threads. So fix the workgroup
dimensions and disable robustness for this pipeline.
b4532
2025-01-23 08:01:17 +01:00
Diego Devesa 6152129d05 main : update README documentation for batch size (#11353)
* main : update README documentation for batch size

* fix formatting

* minor
2025-01-22 19:22:20 +01:00
Georgi Gerganov 16d3df7ab0 readme : add plugin links (#11355) 2025-01-22 19:44:26 +02:00
Diego Devesa 12c2bdf2de server : fix draft context not being released (#11354) b4529 2025-01-22 17:44:40 +01:00
Olivier Chafik c64d2becb1 minja: sync at https://github.com/google/minja/commit/0f5f7f2b3770eb682fbc11763266d45204173686 (#11352) b4528 2025-01-22 16:16:27 +00:00
Jiří Podivín 96f4053934 Adding logprobs to /v1/completions (#11344)
Signed-off-by: Jiri Podivin <jpodivin@redhat.com>
b4527
2025-01-22 12:51:32 +01:00
Olivier Chafik a94f3b2727 common: utils to split / join / repeat strings (from json converter) (#11342)
* Factor string_join, string_split, string_repeat into common

* json: refactor to surface a versatile builder

* Update common.cpp
b4526
2025-01-22 09:51:44 +00:00
tc-mb 3e3357fd77 llava : support Minicpm-omni (#11289)
* init

* add readme

* update readme

* no use make

* update readme

* update fix code

* fix editorconfig-checker

* no change convert py

* use clip_image_u8_free
b4525
2025-01-22 09:35:48 +02:00
Olivier Chafik 6171c9d258 Add Jinja template support (#11016)
* Copy minja from https://github.com/google/minja/commit/58f0ca6dd74bcbfbd4e71229736640322b31c7f9

* Add --jinja and --chat-template-file flags

* Add missing <optional> include

* Avoid print in get_hf_chat_template.py

* No designated initializers yet

* Try and work around msvc++ non-macro max resolution quirk

* Update test_chat_completion.py

* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

* Refactor test-chat-template

* Test templates w/ minja

* Fix deprecation

* Add --jinja to llama-run

* Update common_chat_format_example to use minja template wrapper

* Test chat_template in e2e test

* Update utils.py

* Update test_chat_completion.py

* Update run.cpp

* Update arg.cpp

* Refactor common_chat_* functions to accept minja template + use_jinja option

* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE

* Revert LLAMA_CHATML_TEMPLATE refactor

* Normalize newlines in test-chat-templates for windows tests

* Forward decl minja::chat_template to avoid eager json dep

* Flush stdout in chat template before potential crash

* Fix copy elision warning

* Rm unused optional include

* Add missing optional include to server.cpp

* Disable jinja test that has a cryptic windows failure

* minja: fix vigogne (https://github.com/google/minja/pull/22)

* Apply suggestions from code review

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Finish suggested renamings

* Move chat_templates inside server_context + remove mutex

* Update --chat-template-file w/ recent change to --chat-template

* Refactor chat template validation

* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)

* Warn against missing eos / bos tokens when jinja template references them

* rename: common_chat_template[s]

* reinstate assert on chat_templates.template_default

* Update minja to https://github.com/google/minja/commit/b8437df626ac6cd0ce3b333b3c74ed1129c19f25

* Update minja to https://github.com/google/minja/pull/25

* Update minja from https://github.com/google/minja/pull/27

* rm unused optional header

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
b4524
2025-01-21 13:18:51 +00:00
Xuan Son Nguyen e28245f35f export-lora : fix tok_embd tensor (#11330) b4523 2025-01-21 14:07:12 +01:00
Radoslav Gerganov 6da5bec81c rpc : better caching of the base buffer pointer (#11331)
There is no need to use map, just store the base pointer in the buffer
context.
b4522
2025-01-21 15:06:41 +02:00
Eric Curtin 2e2f8f093c linenoise.cpp refactoring (#11301)
More RAII mainly

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4521
2025-01-21 09:32:35 +00:00
Georgi Gerganov 2139667ec4 metal : fix out-of-bounds write (#11314)
ggml-ci
b4520
2025-01-21 08:48:13 +02:00
Georgi Gerganov 80d0d6b4b7 common : add -hfd option for the draft model (#11318)
* common : add -hfd option for the draft model

* cont : fix env var

* cont : more fixes
b4519
2025-01-20 22:29:43 +02:00
Jeff Bolz aea8ddd516 vulkan: fix coopmat2 validation failures (#11284)
mul mat and flash attention shaders were loading f32 types directly into
A/B matrices, which happens to work but is technically invalid usage.
For FA, we can load it as an Accumulator matrix and convert and this
is not in the inner loop and is cheap enough. For mul mat, it's more
efficient to do this conversion in a separate pass and have the input(s)
be f16.

coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId
requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.
b4518
2025-01-20 10:38:32 -06:00
Georgi Gerganov 9f7add1cde examples : fix add_special conditions (#11311) 2025-01-20 16:36:08 +02:00
Christopher Nielsen 90d987b105 mmap: add include for cerrno (#11296)
ggml-ci

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
b4516
2025-01-20 16:02:43 +02:00
Michael Podvitskiy a4251edd6f cmake: fix shell command quoting in build-info script (#11309) 2025-01-20 16:02:15 +02:00
Xuan Son Nguyen ec7f3ac9ab llama : add support for Deepseek-R1-Qwen distill model (#11310)
* llama : add support for Deepseek-R1-Qwen distill model

* coding style
b4514
2025-01-20 14:35:07 +01:00
Georgi Gerganov ef6dada60c cont : fix whitespaces (#11305) b4513 2025-01-20 09:29:32 +02:00
Kyle Bruene ae3c1db2f9 llama : re-add LLM_ARCH_PHIMOE (#11305)
Phi 3.5 MoE was partially removed during a refactor. The code was originally in llama.cpp and should be in llama-model.cpp after the refactor.
b4512
2025-01-20 09:21:01 +02:00
Georgi Gerganov 92bc493917 tests : increase timeout when sanitizers are enabled (#11300)
* tests : increase timeout when sanitizers are enabled

* tests : add DEFAULT_HTTP_TIMEOUT
2025-01-19 20:22:30 +02:00
Georgi Gerganov b9daaffe02 simple-chat : fix BOS being added to each message (#11278) b4510 2025-01-19 18:12:09 +02:00
Nicolò Scipione 99487b57d4 SYCL: Introducing memory host pool (#11251)
* Implement host pool for matrix_info

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

* Remove unnecessary headers and cast

* Reorder member variable to avoid warning on initialization

* Formatting

* Remove unused variable

* Address PR review feedback - remove warning

---------

Signed-off-by: nscipione <nicolo.scipione@codeplay.com>
b4509
2025-01-19 21:33:34 +08:00
Eric Curtin a1649cc13f Adding linenoise.cpp to llama-run (#11252)
This is a fork of linenoise that is C++17 compatible. I intend on
adding it to llama-run so we can do things like traverse prompt
history via the up and down arrows:

https://github.com/ericcurtin/linenoise.cpp

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
b4508
2025-01-18 14:42:31 +00:00