* hex-mm: new weight layout and fusion updates
* hvx-mm: unroll the new tiled vec_dots to optimize hvx register util
* hex-mm: optimize dyn.quant format for q8_0 and q8_1 to reduce overhead in vec_dots.
* hvx-mm: parallel quantizer per block for large rows
* hvx-mm: simplify and futher optimize dyn.quant and vec_dots
* hvx-mm: keep intermediate per tile accumulators in fp16
* hmx-mm: optimize weight dequant by aligning the repacked tiles with the DMA
* hmx-mm: remove qweight scratch and just use vtcm_weight
* hmx-mm: remove all unused and obsolete code
* hmx-mm: the new tiled repack format is here to stay -- rename all x4x2 to _tiled
* hmx-mm: improve activation processing with dma prefetch
* hex-mm: fix hmx/hvx fallback logic and MUL_MAT_ID allocation (unbreaks OLMoE)
* hex-mm: align the weight tiles with dma just like we did in hmx-mm
* hex-mm: factor out common mm bits into htp/matmul-ops.h
* hex-mm: start moving mm kernel selection to the host
* hex-mm: move all of the matmul param compute into the host
* hmx-mm: restore pipelined mode
* hmx-mm: unroll the dequant functions to optimize register usage
* hmx-mm: further improve activation process
* hex-mm: use vtcm_seq_alloc for all vtcm allocations and define more common functions
* hex-mm: improve mm optimizer to acount for number of activation threads
* hex-mm: fix matmul-id kernel params selection (unbreaks OLMoE and LFM)
* hexagon: remove support for arch < v73 since HMX is now required for most use-cases
* hex-mm: cleanup naming for consistency
* hex-mm: make sure matmul fusion accounts for vtcm allocation
* hex-mm: minor cleanup for kernel_params definition
* hex-mm: replace hardcoded limits with proper checks for vtcm requirements
* hex-mm: add support for non-tiled mm as a fallback option and factor out hvx kernels into separate header
* hex-mm: remove unused functions
* hex-mm: add shorthand for MM_SELECT in run-tool script
* hvx-mm: factor out hvx/hmx microkernels and unify matmul entry and dispatch
* hex-mm: further cleanup matmul fallback path
* hex-mm: refactor matmul entry point and dispatch a bit further
* hexagon: update cmake build to enable hmx for everything
* hex-ops: optimize kernel_param updates and include summary in the logs
* hex-mm: add support for GGML_HEXAGON_MM_SELECT
* hex-mm: add hex-common header
* hex-mm: pass correct number of tasks to workpool
* hex-mm: add proper checks for no-work in dyn.quant tasks
* hex-mm: convert all quantizers into a macro
* hex-mm: fix hvx-flat fallback to pass all MUL_MAT tests
* hex-mm: vectorize q8_1 quantizer
* hex-mm: improve fused ffn mm stride handling
* hex-mm: consistent use of n_threads and pipeline in kernel_params
* hexagon: minor formatting
* hex-mm: update MUL_MAT_ID kernel_param handling to make sure host/npu are in sync
* hvx-mm: go back to accumulating in fp32 in tiled hvx kernels, more accurate and same perf
* hvx-mm: unroll the loops and remove masking that is not needed for tiled accums
* hmx-mm: optimize activation processing (slit loops, some unrolling, etc)
* hmx-mm: minor optimization for output processing
* hex-mm: consistent use of uint32_t and size_t in mm kernels
* hex-mm: remove legacy restrictions for rows to be multiple of 256
* hexagon: replace sprintf with snprintf
* hex-mm: relax hardcoded nrows checks and rely on VTCM size requirements
* hexagon: minor alignment fix
* hexagon: fix trailing spaces
* hex-mm: relax padding from 256 to 128 (leftovers)
* hex-mm: remove redundant checks for weight align to 128
we always use 2D dma for the weights and align them properly
* hmx-mm: MUL_MAT_ID better work distribution between hvx threads and hmx tracing
* hex-mm: specialize per-token mmid activation handling
* hex-profile: update python scripts to handle kernel-params section in the logging output
* hex-mm: move n_prefetch (aka dma_depth) into kernel params and remove unused fields
* hex-trace: use easier to parse format, simply and fix post-proc scripts
* hmx-mm: relax 32 row limit for output processing which helps utilization
* hmx-mm: use start-chunk idx for tracing info
* hmx-mm: parameterize activation dma pipeline
* hexagon: add support for simple graph caching to avoid recomputing kernel-params
* hex-mm: remove left-over repack functions
* hex-mm: tighten n_prefetch asserts
* hex-mm: remove duplicate round/align_up helper
* hexagon: cleanup common header used in host/npu
* hexagon: update early wakeup threshold
* hmx-mm: define cost constants and update solver to assume that repacked ne[1] is padded to 32
* hmx-mm: make precompute_matmul a bit more readable (split into smaller functions, etc)
* hex-mm: remove n_threads constraint
* hex-mm: minor formatting updates
* hex-mm: remove obsolete profiling logs
* hex-mm: restore hardcode gate to refuse lm-head to avoid repacking that tensor
* snapdragon: update compiler flags to enable all CPU features
* snapdragon: update readme to point to toolchain v0.6
* snapdragon: bump toolchain docker to v0.6
* hexagon: restore HTP_OPMASK_QUEUE
* hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul
* hex-prof: restore op profiling
* hex-prof: enable PMU
* hexagon: simplify and improve op-queuing with full profiling support
Add separate profile descriptors.
* hexagon: remove opsync and rename opmask into opstage
opsync is no longer needed since the profiler is fully async now.
opmask name was confusing and opstage is more accurate.
* hexagon: refactor opbatch queue handling
* hexagon: add iface hooks for enabling profiler from the host
Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use.
* hexagon: make profiler mode configurable
On older devices getting PMU counters is expensive so it's now optional.
* hexagon: add support for setting profiler pmu events from env
* hexagon: simplify profiler output (no need to print buffs, etc)
* hexagon: simplify pmu counter formating
* hexagon: add a simple profile post-proc tool
* hex-prof: add support for reading logs from stdin
* hexagon: document GGML_HEXAGON_PROFILE
* hex-prof: update default width for dims field
* hex-prof: fix linter warnings and errors
* Update ggml/src/ggml-hexagon/htp/htp-ops.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/snapdragon/ggml-hexagon-profile.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
* hex-utils: add explicit l2flush and l2clear helpers
* hex-opreq: use fine-grain per tensor l2 management
* hex-opreq: avoid redundant invalidates for tensors we already flushed
* hex-opreq: update debug messages
* htp-opreq: reuse ops_context
* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
* hex-opreq: fix errors in log message
* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
* hexagon: limit l2 flushes to 1MB which covers l2 cache
* hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
* hexagon: drop cache flush size to 2MB
* hex-opreq: start reworking opreq packing
* hex-opreq: introduce new way of packing opbatch where tensors are stored separately
* hex-opreq: add a simple fastrpc call to force unmap all buffers
* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
* hex-opreq: bump opreq batch size to 256
* hex-mm: place src1 spad at the top of vtcm for easy reuse
* hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
* htp-opreq: use tensor pointers instead of copies
* hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
* hex-cumsum: fix error post opreq merge
* hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
* hex-buf: add support for allocating shared/pinned buffer for opreqs
* hex-opbatch: make opbatches configurable
* hex-naming: better name for ggml_hexagon_shared_buffer
* hex-naming: add session->c_name() helper
* hex-opbatch: start using shm but still copy for now
* hex-opbatch: use shared buffer for packing opbatch
* hex-opbatch: beter naming for opbatch related classes and code
* hex-opbatch: reuse batched tensors with same data/dims/strides
* hex-opbatch: update logging
* hex-opbatch: add support for vmem limit for op batching
* hex-opbatch: update htp side to properly support dynamic mmap/unmap
* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
* hex-opbatch: fixed src1 handling in act ops
* hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
* hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
* hex-opbatch: allocate extra 1KB for dspqueue overhead
* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
* hex-mm: properly handle hmx_disabled flag
* hex-ops: update comments
* hex-ops: add debug output for get/set-rows
* hex-mmap: optimize un/mapping of buffers
* hex-opreq: global cache flush and invalidate beyond 128KB threshold
* hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
* hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
* hex-mm: fixed hvx fallback path
* hex-mm: lower the vmem threshold a bit further to ~3GB
* hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.
* hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
* hex-opbatch: cleanup naming and headers for opbatch and related descriptors
* hex-fa: it's now better to enable FA during TG to reduce graph splits
* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.
* hexagon: fixed editorconfig check
* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>