Update to llama.cpp 2026-01-01 by avion23 · Pull Request #2108 · abetlen/llama-cpp-python

avion23 · 2026-01-01T18:07:07Z

Bindings were 5 months outdated, preventing newer model architectures from loading.

Updates bindings to llama.cpp commit be47fb92 (2026-01-01).

Removed

14 llama_kv_self_* functions (use llama_memory_* API)
llama_sampler_init_softmax()

Added

Enums:

LLAMA_ROPE_TYPE_IMROPE
llama_flash_attn_type
llama_params_fit_status
llama_model_meta_key

Struct fields:

llama_model_params: no_host, no_alloc
llama_context_params: flash_attn_type (replaced flash_attn bool)

Functions:
llama_max_tensor_buft_overrides, llama_n_ctx_seq, llama_model_n_embd_inp, llama_model_is_hybrid, llama_flash_attn_type_name, llama_model_meta_key_str, llama_adapter_meta_* (5 functions), llama_log_get, llama_log_set, llama_memory_breakdown_print

Breaking Changes

flash_attn parameter:

# Old
params.flash_attn = True
# New
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_ENABLED

KV cache API:

# Old
llama_kv_self_clear(ctx)
# New
llama_memory_clear(mem, data=True)

Other

Added ggml_log_callback typedef
Fixed LLAVA/mtmd build (set LLAMA_INSTALL_VERSION before subdirectory include)
Version 0.3.16 → 0.4.0

Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B

avion23 · 2026-01-01T20:07:17Z

Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages

dhdaines · 2026-01-04T03:45:17Z

This will need at least one more (very important) change as the layout of mtmd_context_params has changed. It should be updated in mtmd_cpp.py to this:

class mtmd_context_params(Structure):
    _fields_ = [
        ("use_gpu", c_bool),
        ("print_timings", c_bool),
        ("n_threads", c_int),
        ("image_marker", c_char_p),
        ("media_marker", c_char_p),
        ("flash_attn_type", c_int),
        ("warmup", c_bool),
        ("image_min_tokens", c_int),
        ("image_max_tokens", c_int),
    ]

dhdaines · 2026-01-04T04:05:39Z

More changes needed as the layout of llama_context_params has also changed... a new field flash_attn_type has been added after attention_type.

dhdaines · 2026-01-04T05:17:15Z

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

dhdaines · 2026-01-04T13:00:10Z

llama_cpp/llama_cpp.py

+@ctypes_function("llama_max_tensor_buft_overrides", [], ctypes.c_size_t)
+def llama_max_tensor_buft_overrides() -> int:
+    """Get maximum number of tensor buffer type overrides"""
+    ...


Stray ellipsis operator (which does nothing, but still)

Sorry! The issue here isn't the ellipsis operator, it's that at some point there were two of them - you shouldn't change this to pass because that implies that the function returns None, which will cause type checking to fail.

thank you, and I understand now. done

avion23 · 2026-01-04T13:43:09Z

@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime

avion23 · 2026-01-05T10:20:39Z

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

I think I have fixed this, could you check?

dhdaines · 2026-01-05T18:47:15Z

Also the flash_attn parameter no longer exists, has been replaced by flash_attn_type... the default is now to determine automatically whether to use it (as some models require it). This is unfortunately a breaking change, not sure if you want to preserve the flash_attn parameter in the higher-level Python API.

I think I have fixed this, could you check?

Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though...

avion23 · 2026-01-06T19:22:13Z

My intention was to sweep in like a hero and save the day. Didn't work as planned :/

I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated.

oss-roettger · 2026-01-08T15:08:48Z

Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version!

I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt).

I'm pretty sure, it can be my mistake when installing your branch for GPU support:
!CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01

Any ideas what I did wrong?!

Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path.
Edit2: Very,very strange: after switching back to n_gpu_layers=100 (from n_gpu_layers=0) I was able to load and successfully run the new Nemotron-Nano-3-30B-A3B-Q4_K_M.gguf and Ling-mini-2.0.Q4_K_M.gguf models on GPU (on the same build that crashed the kernel always(!) while loading the models before. Could it be, that there is any context initialization code which is run in CPU mode only but also important for GPU mode?!

oss-roettger · 2026-01-09T11:40:34Z

@abetlen Thank you for your work! Please keep this repo alive and merge avion23's updates into the main branch.
These are working now on my RTX 5090 Cuda environment. It's not only that recent llama.cpp supports additional models, but it also offers significant performance gains (e.g. GPT OSS 20 NVFP4 +50% more tokens/s compared to the Oktober versions in your repo wheels).

avion23 · 2026-01-12T06:04:21Z

@oss-roettger thank you for all the testing. I found a bug with flash_attn which caused your error. Could you retest?

oss-roettger · 2026-01-12T11:28:39Z

@avion23 once again respect for your dedication. I have tested the new version after building it with

!CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01

Good news first:
The update runs out of the box (with all values for flash_attn in the constructor flash_attn=None/True/False/not present)

But: I think I have discovered an additional issue (with & without your latest update; on GPU & CPU):
https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF produces a KV cache issue on the second dialog turn. I guess there is a cache initialization parameter missing in the python bindings, since the llama-cli command of the same llama build (=same libllama.so) works on multi-turn dialogs with the same Nemotron model.
(see Llama_test.txt for minimum code to reproduce the error)

Edit: Log added:
log.txt

avion23 · 2026-01-12T13:15:45Z

Thank you, Could You attach the log as before? Then it'll be quicker for me and you get a nice solution. Let's just pray this gets merged, i don't want to maintain my fork.

…

On Mon 12. Jan 2026 at 18:29, oss-roettger ***@***.***> wrote: *oss-roettger* left a comment (abetlen/llama-cpp-python#2108) <#2108 (comment)> @avion23 <https://github.com/avion23> once again respect for your dedication. I have tested the new version after building it with !CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+ ***@***.*** Good news first: The update runs out of the box (with all values for *flash_attn* in the constructor flash_attn=None/True/False/not present) But: I think I have discovered an *additional issue* (with & without your latest update; on GPU & CPU): https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF produces a KV cache issue on the second dialog turn. I guess there is a cache initialization parameter missing in the python bindings, since the llama-cli command of the same llama build (=same libllama.so) works on multi-turn dialogs with the same Nemotron model. (see Llama_test.txt <https://github.com/user-attachments/files/24562603/Llama_test.txt> for minimum code to reproduce the error) — Reply to this email directly, view it on GitHub <#2108 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTGWLHPMSGS7UVCWIRBTY34GOAP5AVCNFSM6AAAAACQO5EZGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZYGEYDCNRRGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

oss-roettger · 2026-01-12T13:58:09Z

👍 Model loading log.txt attached above. Error log in Llama_test.txt

- Update vendor/llama.cpp submodule to commit be47fb92 (2026-01-01) - Bump version 0.3.16 -> 0.4.0 Critical fixes: - Remove phantom flash_attn field from llama_context_params (caused segfaults) - Add 3 missing params to llama_params_fit (margin, n_ctx_min, log_level) - Migrate flash_attn bool -> flash_attn_type enum (BREAKING CHANGE) - Add flash_attn_type to TYPE_CHECKING block - Fix test: use flash_attn_type instead of removed flash_attn field - FIX CRITICAL: kv_cache_seq_rm must preserve seq_id=-1 semantics (all sequences) * The wrapper was incorrectly converting -1 to 0, breaking context rewind * This caused 'discontinuity' errors on multi-turn conversations API changes: - flash_attn: bool field REMOVED from structs - flash_attn_type: int enum ADDED (AUTO=-1, DISABLED=0, ENABLED=1) - High-level API maintains backward compatibility via wrapper - Server default changed: flash_attn=False -> flash_attn=None (AUTO mode) New features: - 20+ new functions (memory API, state management, samplers, vocab queries) - 5 new enums (flash_attn_type, params_fit_status, model_meta_key, etc.) - 6 new struct fields across llama_model_params, llama_context_params, mtmd_context_params Deprecated removals: - 11 llama_kv_self_* functions (replaced by llama_memory_*) - llama_sampler_init_softmax - verbosity field from mtmd_context_params

avion23 · 2026-01-13T03:19:49Z

I had to map llama_kv_self_seq_rm to llama_memory_seq_rm it was a real bug. please check

avion23 · 2026-01-13T07:02:37Z

Critical Bugs Fixed (7 total)

✅ Phantom flash_attn field - Removed from llama_context_params (caused segfaults on RTX 5090)
✅ Missing params in llama_params_fit - Added margin, n_ctx_min, log_level
✅ seq_id=-1 conversion bug - Fixed kv_cache_seq_rm to preserve "all sequences" semantics (fixes multi-turn conversation crashes)
✅ Batch logits indexing - Fixed LlamaBatch.add_sequence() index calculation
✅ Model cleanup crash - Removed incorrect custom_samplers reference
✅ Sampler buffer lifetime - Added buffer pinning to prevent use-after-free
✅ seq_id validation guards - Added assertions to prevent undefined behavior

Validation Completed

✅ Tested on Mistral-Small-3.2-24B-Instruct (Q4_K_M) on M4 Max Metal
✅ Multi-turn conversations work (prefix match + KV cache truncation)
✅ Struct alignment verified byte-by-byte against C headers
✅ External review by GPT-5.2 (found and fixed 4 additional bugs)

Testing Request

@oss-roettger - Could you retest the Nemotron multi-turn conversation scenario with the latest commit (831dbe5)? The seq_id=-1 fix should resolve the "discontinuity" error you encountered.

The fix specifically addresses this error:

the last position ... is X = 767 ... input batch ... is Y = 34
it is required that the sequence positions remain consecutive: Y = X + 1

oss-roettger · 2026-01-13T12:07:01Z

Sorry @avion23 the multi-turn conversation cache issue still persists with Nemotron-Nano-3-30B-A3B
(see test2026-01-13.txt log - I have even tested it with two different versions to eliminate possible model errors):

❌ https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF/blob/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf
❌ https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF/blob/main/Nemotron-Nano-3-30B-A3B-Q4_K_M.gguf

Although these two models work in multi-turn conversation with llama-cli (of same build, with same libllama.so, log in test2026-01-13.txt) , they fail with python bindings on RTX 5090 (even when forced onto CPU by n_gpu_layers=0):

Llama.generate: 12 prefix-match hit, remaining 49 prompt tokens to eval
init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:

the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 48

the tokens for sequence 0 in the input batch have a starting position of Y = 12
it is required that the sequence positions remain consecutive: Y = X + 1
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1

BTW: NO Problems with these models :
✅ https://huggingface.co/bartowski/openai_gpt-oss-20b-GGUF?show_file_info=openai_gpt-oss-20b-Q4_K_M.gguf
✅ https://huggingface.co/Face314/Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE/blob/main/Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE.gguf
✅ https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF?show_file_info=google_gemma-3-27b-it-Q4_K_M.gguf

dhdaines · 2026-01-13T13:28:14Z

For my part it still works well with my PR (#2109) with the models:

avion23 · 2026-01-13T13:34:41Z

I found a bug with recurrent models like Mamba. I am working on it. Needs some time. Thanks again for all the testing

After external code review (GPT-5.2), fixed 4 critical issues: 1. CRITICAL: Fixed tokens[:-1] bug in prefix matching - Was silently breaking prefix matching for ALL models - Caused false rewind detection and cache inefficiency - Impact: Transformers AND recurrent models 2. CRITICAL: Implement proper reset() for recurrent models - Now actually clears llama_memory backend state - Root cause fix for 'sequence positions not consecutive' crash - Without this, reset was a no-op for recurrent models 3. CRITICAL: Enforce strict append policy for recurrent models - Prevents KV cache rewinding that's impossible without state snapshots - Forces full reset on history edits instead of crashing 4. Performance: Cache _is_recurrent to avoid repeated FFI calls 5. Documentation: Simplified comments and updated docstring 6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models. Reviewed-by: GPT-5.2 (OpenAI) Tested-by: pytest + Mistral-Small-3.2-24B Fixes: abetlen#2108 (recurrent model crashes) Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)

avion23 · 2026-01-14T02:20:54Z

I've implemented a fix for recurrent/hybrid models (Nemotron-A3B, Mamba, RWKV, Jamba)
that prevents "sequence positions not consecutive" crashes during multi-turn conversations.
The fix preserves full speed for normal append-only chat and only triggers reset when
history is edited. Verified compatible with #2109.

It's a bit of scope creep though. Diff is becoming huge even though I only adapted some c bindings.

oss-roettger · 2026-01-14T13:47:26Z

👍 @avion23: Thank you very much on behalf of the community!

Tests successfully passed with Nemotron-3-Nano-30B-A3B on RTX5090 (CUDA).

✅ https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF/blob/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf
✅ https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF/blob/main/Nemotron-Nano-3-30B-A3B-Q4_K_M.gguf

Also successfully retested on RTX5090 (CUDA):

✅ https://huggingface.co/bartowski/openai_gpt-oss-20b-GGUF?show_file_info=openai_gpt-oss-20b-Q4_K_M.gguf
✅ https://huggingface.co/Face314/Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE/blob/main/Qwen3-30B-A3B-Instruct-2507-MXFP4_MOE.gguf
✅ https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF?show_file_info=google_gemma-3-27b-it-Q4_K_M.gguf
✅ https://huggingface.co/mradermacher/Ling-mini-2.0-GGUF?show_file_info=Ling-mini-2.0.Q4_K_M.gguf

🙋 @abetlen: Please appreciate the extensive work of @avion23 and merge it into the main branch. Thank you!

avion23 · 2026-01-19T08:42:38Z

Thank you for retesting this. I am using it daily on Apple M4 Max and it's working good enough.

@abetlen Is there something I can improve so you can merge this with a good conscience?

antheas · 2026-01-19T15:34:07Z

I vouch for this, thanks @avion23. Saved my bacon on dottxt-ai/outlines#1812. I will test a few more models and if anything pops up will report back.

abetlen/llama-cpp-python#2108

bartwesthoff-fyrm · 2026-02-03T21:31:50Z

@avion23 Do you have a notebook for testing? Can't seem to run nemotron yet but it most probably is a mistake on my end.

avion23 · 2026-02-04T08:29:43Z

@bartwesthoff-fyrm The PR is stable, but Nemotron is a tricky model (Hybrid architecture) that requires specific initialization parameters to run correctly. n_batch=512, n_ubatch=512, flash_attn=True. Vibe coded snippet attached, test_pr_2108.py

This pr might never be merged, project seems abandonware. Have a look at https://github.com/TheBigEye/guanaco-py

bartwesthoff-fyrm · 2026-02-04T09:12:20Z

@avion23 worked perfectly with the new repository. Thank you for your helpfull responses in this PR.

avion23 marked this pull request as draft January 1, 2026 19:40

avion23 force-pushed the update-llama-cpp-2026-01 branch from 502532a to 23c10e8 Compare January 1, 2026 19:50

avion23 marked this pull request as ready for review January 1, 2026 19:52

dhdaines mentioned this pull request Jan 4, 2026

feat: support Granite-Docling model #2109

Open

avion23 force-pushed the update-llama-cpp-2026-01 branch from 23c10e8 to a070f61 Compare January 4, 2026 12:05

dhdaines reviewed Jan 4, 2026

View reviewed changes

avion23 marked this pull request as draft January 4, 2026 13:41

avion23 force-pushed the update-llama-cpp-2026-01 branch from 5042296 to d14a24f Compare January 4, 2026 13:42

avion23 force-pushed the update-llama-cpp-2026-01 branch 3 times, most recently from 64b087c to 3ffec02 Compare January 5, 2026 10:18

avion23 force-pushed the update-llama-cpp-2026-01 branch 2 times, most recently from 6dbddac to 39a2ee8 Compare January 5, 2026 14:35

avion23 force-pushed the update-llama-cpp-2026-01 branch from 39a2ee8 to 103f671 Compare January 6, 2026 19:17

avion23 marked this pull request as ready for review January 6, 2026 19:22

avion23 force-pushed the update-llama-cpp-2026-01 branch from e351642 to 235a3d4 Compare January 12, 2026 06:03

avion23 force-pushed the update-llama-cpp-2026-01 branch from 235a3d4 to 17aae47 Compare January 12, 2026 06:35

avion23 force-pushed the update-llama-cpp-2026-01 branch from 17aae47 to 831dbe5 Compare January 13, 2026 03:19

avion23 marked this pull request as draft January 13, 2026 13:34

avion23 marked this pull request as ready for review January 14, 2026 02:21

antheas mentioned this pull request Jan 19, 2026

On GB10 Spark, llama cpp generation fails with "No next state found" dottxt-ai/outlines#1812

Open

TheBigEye added a commit to TheBigEye/guanaco-py that referenced this pull request Jan 24, 2026

Update to llama.cpp 2026-01-01

d18d98f

abetlen/llama-cpp-python#2108

TheBigEye mentioned this pull request Jan 25, 2026

Update to llama.cpp 2026-01-01 TheBigEye/guanaco-py#18

Merged

Conversation

avion23 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Removed

Added

Breaking Changes

Other

Uh oh!

avion23 commented Jan 1, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines commented Jan 4, 2026

Uh oh!

dhdaines Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

dhdaines Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

avion23 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

avion23 commented Jan 4, 2026

Uh oh!

avion23 commented Jan 5, 2026

Uh oh!

dhdaines commented Jan 5, 2026

Uh oh!

avion23 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oss-roettger commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oss-roettger commented Jan 9, 2026

Uh oh!

avion23 commented Jan 12, 2026

Uh oh!

oss-roettger commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avion23 commented Jan 12, 2026 via email

Uh oh!

oss-roettger commented Jan 12, 2026

Uh oh!

avion23 commented Jan 13, 2026

Uh oh!

avion23 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical Bugs Fixed (7 total)

Validation Completed

Testing Request

Uh oh!

oss-roettger commented Jan 13, 2026

Uh oh!

dhdaines commented Jan 13, 2026

Uh oh!

avion23 commented Jan 13, 2026

Uh oh!

avion23 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oss-roettger commented Jan 14, 2026

Uh oh!

avion23 commented Jan 19, 2026

Uh oh!

antheas commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartwesthoff-fyrm commented Feb 3, 2026

Uh oh!

avion23 commented Feb 4, 2026

Uh oh!

bartwesthoff-fyrm commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

avion23 commented Jan 1, 2026 •

edited

Loading

avion23 commented Jan 6, 2026 •

edited

Loading

oss-roettger commented Jan 8, 2026 •

edited

Loading

oss-roettger commented Jan 12, 2026 •

edited

Loading

avion23 commented Jan 13, 2026 •

edited

Loading

avion23 commented Jan 14, 2026 •

edited

Loading

antheas commented Jan 19, 2026 •

edited

Loading