Conversation
502532a to
23c10e8
Compare
|
Tested on macos using CMAKE_ARGS="-DGGML_METAL=on" pip3.14 install --force-reinstall --no-cache-dir "llama-cpp-python @ git+https://github.com/avion23/llama-cpp-python.git@update-llama-cpp-2026-01" --break-system-packages |
|
This will need at least one more (very important) change as the layout of class mtmd_context_params(Structure):
_fields_ = [
("use_gpu", c_bool),
("print_timings", c_bool),
("n_threads", c_int),
("image_marker", c_char_p),
("media_marker", c_char_p),
("flash_attn_type", c_int),
("warmup", c_bool),
("image_min_tokens", c_int),
("image_max_tokens", c_int),
] |
|
More changes needed as the layout of |
|
Also the |
23c10e8 to
a070f61
Compare
| @ctypes_function("llama_max_tensor_buft_overrides", [], ctypes.c_size_t) | ||
| def llama_max_tensor_buft_overrides() -> int: | ||
| """Get maximum number of tensor buffer type overrides""" | ||
| ... |
There was a problem hiding this comment.
Stray ellipsis operator (which does nothing, but still)
There was a problem hiding this comment.
Sorry! The issue here isn't the ellipsis operator, it's that at some point there were two of them - you shouldn't change this to pass because that implies that the function returns None, which will cause type checking to fail.
There was a problem hiding this comment.
thank you, and I understand now. done
5042296 to
d14a24f
Compare
|
@dhdaines thanks for the review, I need some time to incorporate your comments, setting to draft in the meantime |
64b087c to
3ffec02
Compare
I think I have fixed this, could you check? |
6dbddac to
39a2ee8
Compare
Yes, this looks to me like a good way to handle it! We can see what the maintainer @abetlen thinks though... |
39a2ee8 to
103f671
Compare
|
My intention was to sweep in like a hero and save the day. Didn't work as planned :/ I've rewritten the PR, much less whitespace noise, and cleaner. All review comments are incorporated. |
|
Thank you so much avion23 for your efforts to update the python bindings to a recent llama-cpp version! I'm trying to use them in a Jupyter notebook (in Docker) on a Nvidia 5090 GPU. Although the latest locally build llama-cli version is running in that same environment (see attached llama-cli.txt) and the above considered problems are gone, the freshly build bindings produce a kernel crash when loading models to GPU (after loading weights to GPU, maybe a context issue, see attached build.txt). I'm pretty sure, it can be my mistake when installing your branch for GPU support: Any ideas what I did wrong?! Edit (New Findings): The above GPU build works with n_gpu_layers=0 (CPU only). This narrows down the problem to context handling in the GPU context code path. |
|
@abetlen Thank you for your work! Please keep this repo alive and merge avion23's updates into the main branch. |
e351642 to
235a3d4
Compare
|
@oss-roettger thank you for all the testing. I found a bug with |
235a3d4 to
17aae47
Compare
|
@avion23 once again respect for your dedication. I have tested the new version after building it with !CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install --force-reinstall --upgrade git+https://github.com/avion23/llama-cpp-python@update-llama-cpp-2026-01 Good news first: But: I think I have discovered an additional issue (with & without your latest update; on GPU & CPU): Edit: Log added: |
|
Thank you,
Could You attach the log as before? Then it'll be quicker for me and you
get a nice solution.
Let's just pray this gets merged, i don't want to maintain my fork.
…On Mon 12. Jan 2026 at 18:29, oss-roettger ***@***.***> wrote:
*oss-roettger* left a comment (abetlen/llama-cpp-python#2108)
<#2108 (comment)>
@avion23 <https://github.com/avion23> once again respect for your
dedication. I have tested the new version after building it with
!CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=86";pip install
--force-reinstall --upgrade git+
***@***.***
Good news first:
The update runs out of the box (with all values for *flash_attn* in the
constructor flash_attn=None/True/False/not present)
But: I think I have discovered an *additional issue* (with & without your
latest update; on GPU & CPU):
https://huggingface.co/ggml-org/Nemotron-Nano-3-30B-A3B-GGUF produces a
KV cache issue on the second dialog turn. I guess there is a cache
initialization parameter missing in the python bindings, since the
llama-cli command of the same llama build (=same libllama.so) works on
multi-turn dialogs with the same Nemotron model.
(see Llama_test.txt
<https://github.com/user-attachments/files/24562603/Llama_test.txt> for
minimum code to reproduce the error)
—
Reply to this email directly, view it on GitHub
<#2108 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTGWLHPMSGS7UVCWIRBTY34GOAP5AVCNFSM6AAAAACQO5EZGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMZYGEYDCNRRGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
👍 Model loading log.txt attached above. Error log in Llama_test.txt |
- Update vendor/llama.cpp submodule to commit be47fb92 (2026-01-01) - Bump version 0.3.16 -> 0.4.0 Critical fixes: - Remove phantom flash_attn field from llama_context_params (caused segfaults) - Add 3 missing params to llama_params_fit (margin, n_ctx_min, log_level) - Migrate flash_attn bool -> flash_attn_type enum (BREAKING CHANGE) - Add flash_attn_type to TYPE_CHECKING block - Fix test: use flash_attn_type instead of removed flash_attn field - FIX CRITICAL: kv_cache_seq_rm must preserve seq_id=-1 semantics (all sequences) * The wrapper was incorrectly converting -1 to 0, breaking context rewind * This caused 'discontinuity' errors on multi-turn conversations API changes: - flash_attn: bool field REMOVED from structs - flash_attn_type: int enum ADDED (AUTO=-1, DISABLED=0, ENABLED=1) - High-level API maintains backward compatibility via wrapper - Server default changed: flash_attn=False -> flash_attn=None (AUTO mode) New features: - 20+ new functions (memory API, state management, samplers, vocab queries) - 5 new enums (flash_attn_type, params_fit_status, model_meta_key, etc.) - 6 new struct fields across llama_model_params, llama_context_params, mtmd_context_params Deprecated removals: - 11 llama_kv_self_* functions (replaced by llama_memory_*) - llama_sampler_init_softmax - verbosity field from mtmd_context_params
17aae47 to
831dbe5
Compare
|
I had to map |
Critical Bugs Fixed (7 total)
Validation Completed
Testing Request@oss-roettger - Could you retest the Nemotron multi-turn conversation scenario with the latest commit (831dbe5)? The seq_id=-1 fix should resolve the "discontinuity" error you encountered. The fix specifically addresses this error: |
|
Sorry @avion23 the multi-turn conversation cache issue still persists with Nemotron-Nano-3-30B-A3B ❌ https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF/blob/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf Although these two models work in multi-turn conversation with llama-cli (of same build, with same libllama.so, log in test2026-01-13.txt) , they fail with python bindings on RTX 5090 (even when forced onto CPU by n_gpu_layers=0):
BTW: NO Problems with these models : |
|
For my part it still works well with my PR (#2109) with the models: |
|
I found a bug with recurrent models like Mamba. I am working on it. Needs some time. Thanks again for all the testing |
After external code review (GPT-5.2), fixed 4 critical issues: 1. CRITICAL: Fixed tokens[:-1] bug in prefix matching - Was silently breaking prefix matching for ALL models - Caused false rewind detection and cache inefficiency - Impact: Transformers AND recurrent models 2. CRITICAL: Implement proper reset() for recurrent models - Now actually clears llama_memory backend state - Root cause fix for 'sequence positions not consecutive' crash - Without this, reset was a no-op for recurrent models 3. CRITICAL: Enforce strict append policy for recurrent models - Prevents KV cache rewinding that's impossible without state snapshots - Forces full reset on history edits instead of crashing 4. Performance: Cache _is_recurrent to avoid repeated FFI calls 5. Documentation: Simplified comments and updated docstring 6. Testing: All existing tests pass + Mistral-Small-3.2-24B validated Resolves multi-turn crashes for Nemotron-A3B, Mamba, RWKV, Jamba models. Reviewed-by: GPT-5.2 (OpenAI) Tested-by: pytest + Mistral-Small-3.2-24B Fixes: abetlen#2108 (recurrent model crashes) Compatible-with: abetlen#2109 (Granite-Docling/SmolVLM special tokens)
|
I've implemented a fix for recurrent/hybrid models (Nemotron-A3B, Mamba, RWKV, Jamba) It's a bit of scope creep though. Diff is becoming huge even though I only adapted some c bindings. |
|
👍 @avion23: Thank you very much on behalf of the community! Tests successfully passed with Nemotron-3-Nano-30B-A3B on RTX5090 (CUDA). ✅ https://huggingface.co/unsloth/Nemotron-3-Nano-30B-A3B-GGUF/blob/main/Nemotron-3-Nano-30B-A3B-Q4_K_M.gguf Also successfully retested on RTX5090 (CUDA): ✅ https://huggingface.co/bartowski/openai_gpt-oss-20b-GGUF?show_file_info=openai_gpt-oss-20b-Q4_K_M.gguf 🙋 @abetlen: Please appreciate the extensive work of @avion23 and merge it into the main branch. Thank you! |
|
Thank you for retesting this. I am using it daily on Apple M4 Max and it's working good enough. @abetlen Is there something I can improve so you can merge this with a good conscience? |
|
I vouch for this, thanks @avion23. Saved my bacon on dottxt-ai/outlines#1812. I will test a few more models and if anything pops up will report back. |
|
@avion23 Do you have a notebook for testing? Can't seem to run nemotron yet but it most probably is a mistake on my end. |
|
@bartwesthoff-fyrm The PR is stable, but Nemotron is a tricky model (Hybrid architecture) that requires specific initialization parameters to run correctly. n_batch=512, n_ubatch=512, flash_attn=True. Vibe coded snippet attached, test_pr_2108.py This pr might never be merged, project seems abandonware. Have a look at https://github.com/TheBigEye/guanaco-py |
|
@avion23 worked perfectly with the new repository. Thank you for your helpfull responses in this PR. |
Bindings were 5 months outdated, preventing newer model architectures from loading.
Updates bindings to llama.cpp commit be47fb92 (2026-01-01).
Removed
llama_kv_self_*functions (usellama_memory_*API)llama_sampler_init_softmax()Added
Enums:
LLAMA_ROPE_TYPE_IMROPEllama_flash_attn_typellama_params_fit_statusllama_model_meta_keyStruct fields:
llama_model_params:no_host,no_allocllama_context_params:flash_attn_type(replacedflash_attnbool)Functions:
llama_max_tensor_buft_overrides,llama_n_ctx_seq,llama_model_n_embd_inp,llama_model_is_hybrid,llama_flash_attn_type_name,llama_model_meta_key_str,llama_adapter_meta_*(5 functions),llama_log_get,llama_log_set,llama_memory_breakdown_printBreaking Changes
flash_attn parameter:
KV cache API:
Other
ggml_log_callbacktypedefLLAMA_INSTALL_VERSIONbefore subdirectory include)Tested: macOS ARM64 Metal, Python 3.14, Nemotron-3-Nano-30B