Skip to content

ARM I2_S inference produces gibberish/garbage output after commit 112f853 (CPU Optimization update) #470

@aqn96

Description

@aqn96

Description

After the CPU Inference Optimization update (commit 112f853), running inference with i2_s quantization on ARM (aarch64) produces completely incoherent output — random tokens with no relation to the prompt. Rolling back to commit 404980e (the last commit before the optimization merge) restores correct, coherent output.

Environment

  • Hardware: Raspberry Pi 5 (8GB RAM), ARM Cortex-A76 (aarch64)
  • OS: Raspberry Pi OS 64-bit (Debian 12 Bookworm)
  • Compiler: Debian clang version 18.1.8
  • CMake: 3.25.1
  • Python: 3.9 (conda)
  • Model: microsoft/BitNet-b1.58-2B-4T-gguf (ggml-model-i2_s.gguf)
  • Quantization: i2_s

Steps to Reproduce

  1. Clone repo at current HEAD (01eb415):

    git clone --recursive https://github.com/microsoft/BitNet.git
    cd BitNet
    
  2. Generate kernels and build (following Adafruit guide):

    python utils/codegen_tl1.py --model bitnet_b1_58-3B --BM 160,320,320 --BK 64,128,64 --bm 32,64,32
    export CC=clang-18 CXX=clang++-18
    rm -rf build && mkdir build && cd build
    cmake .. -DCMAKE_BUILD_TYPE=Release
    make -j$(nproc)
    cd ..
    
  3. Download model:

    huggingface-cli download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
    
  4. Run inference:

    python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "You are a helpful assistant" -t 4 -cnv
    

Broken Output (HEAD - 01eb415)

> hi how are you
ri differentorefFly increase Hurtutar run following section underestimateAD Sachs weighedision
cann RICTS Reyn taskfir-ra mark filtr castWATCHB fr ret flatten missionuche purchase parameter
gramhit associatedyuraft runeded take compound sugar contrast unsubedom conveyuffanford...

Working Output (commit 404980e)

> hi
Hello! How can I assist you today?

> what is a raspi 5
The Raspberry Pi 5 is a next-generation model of the Raspberry Pi single-board computer series...

Performance on the working commit: 9.68 tokens/second (4 threads, ARM NEON).

Bisection

The regression was introduced in commit 112f853:

112f853 [feat] I2S kernels for weight & activation parallel on Intel & ARM machine;
        [feat] I2S GEMV & GEMM(llama.cpp);
        [feat] quantize activation & dequantize embedding(llama.cpp);
        [fix] compile bug: cannot define __ARM_FEATURE_DOTPROD(llama.cpp)

The last known working commit is 404980e (one commit before 112f853).

Notes

  • The build completes without errors on both commits — the issue is runtime behavior, not compilation.
  • ggml-bitnet-mad.cpp is compiled and linked in both cases.
  • NEON is detected and enabled (NEON = 1 in system_info output).
  • DOTPROD detection: GGML_COMPILER_SUPPORT_DOTPROD - Failed, but COMPILER_SUPPORTS_ARMV82_DOTPROD - Success.
  • This issue also appears to affect other ARM64 platforms (Ampere/Hetzner CAX servers), not just Raspberry Pi.
  • The Adafruit BitNet on Raspberry Pi guide (published Sept 2025, before the optimization commit) confirms working output on Pi 4 and Pi 5 with the older codebase.

Related to #411 — same root cause. Adding Pi 5 (Cortex-A76 with dotprod) as another confirmed affected platform.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions