-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Description
After the CPU Inference Optimization update (commit 112f853), running inference with i2_s quantization on ARM (aarch64) produces completely incoherent output — random tokens with no relation to the prompt. Rolling back to commit 404980e (the last commit before the optimization merge) restores correct, coherent output.
Environment
- Hardware: Raspberry Pi 5 (8GB RAM), ARM Cortex-A76 (aarch64)
- OS: Raspberry Pi OS 64-bit (Debian 12 Bookworm)
- Compiler: Debian clang version 18.1.8
- CMake: 3.25.1
- Python: 3.9 (conda)
- Model:
microsoft/BitNet-b1.58-2B-4T-gguf(ggml-model-i2_s.gguf) - Quantization: i2_s
Steps to Reproduce
-
Clone repo at current HEAD (
01eb415):git clone --recursive https://github.com/microsoft/BitNet.git cd BitNet -
Generate kernels and build (following Adafruit guide):
python utils/codegen_tl1.py --model bitnet_b1_58-3B --BM 160,320,320 --BK 64,128,64 --bm 32,64,32 export CC=clang-18 CXX=clang++-18 rm -rf build && mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make -j$(nproc) cd .. -
Download model:
huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T -
Run inference:
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -t 4 -cnv
Broken Output (HEAD - 01eb415)
> hi how are you
ri differentorefFly increase Hurtutar run following section underestimateAD Sachs weighedision
cann RICTS Reyn taskfir-ra mark filtr castWATCHB fr ret flatten missionuche purchase parameter
gramhit associatedyuraft runeded take compound sugar contrast unsubedom conveyuffanford...
Working Output (commit 404980e)
> hi
Hello! How can I assist you today?
> what is a raspi 5
The Raspberry Pi 5 is a next-generation model of the Raspberry Pi single-board computer series...
Performance on the working commit: 9.68 tokens/second (4 threads, ARM NEON).
Bisection
The regression was introduced in commit 112f853:
112f853 [feat] I2S kernels for weight & activation parallel on Intel & ARM machine;
[feat] I2S GEMV & GEMM(llama.cpp);
[feat] quantize activation & dequantize embedding(llama.cpp);
[fix] compile bug: cannot define __ARM_FEATURE_DOTPROD(llama.cpp)
The last known working commit is 404980e (one commit before 112f853).
Notes
- The build completes without errors on both commits — the issue is runtime behavior, not compilation.
ggml-bitnet-mad.cppis compiled and linked in both cases.- NEON is detected and enabled (
NEON = 1in system_info output). - DOTPROD detection:
GGML_COMPILER_SUPPORT_DOTPROD - Failed, butCOMPILER_SUPPORTS_ARMV82_DOTPROD - Success. - This issue also appears to affect other ARM64 platforms (Ampere/Hetzner CAX servers), not just Raspberry Pi.
- The Adafruit BitNet on Raspberry Pi guide (published Sept 2025, before the optimization commit) confirms working output on Pi 4 and Pi 5 with the older codebase.
Related to #411 — same root cause. Adding Pi 5 (Cortex-A76 with dotprod) as another confirmed affected platform.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels