BitNet is a project by Microsoft centered around 1-bit LLMs. There is a lot of info about 1-bit LLMs in the BitNet repo and technical report. The main benefit to this type of model is that it can be more efficient with CPU and energy usage, both helpful traits for running it on an SBC like the Raspberry Pi. bitnet.cpp is the official inference framework for 1-bit LLMs (e.g., BitNet b1.58). It offers a suite of optimized kernels that support fast and lossless inference of 1.58-bit models on CPU and GPU (NPU support will coming next). An inference framework is just a tool for running LLMs locally, so this tool is for running 1-bit LLMs specifically.
The installation process for bitnet.cpp is rather involved. It needs to be built from source using Clang, and there are a handful of prerequisites that must be done first to complete the build successfully. This blog post by Bijan Bowen documented the build process on a Raspberry Pi 4B. A similar process is documented below, it was tested successfully on Pi 4 and 5 with 8 GB of RAM.
Raspberry Pi OS Setup
Get the latest release of Raspberry Pi OS installed on your Pi and update all of the built-in software with apt. If you are comfortable with the Raspberry Pi imaging and setup process, you can follow the steps listed here under the quick start prerequisites. If you'd like more details, a more thorough guide page can be found here.
sudo apt update sudo apt install python3-pip python3-dev cmake build-essential git software-properties-common
wget -O - https://apt.llvm.org/llvm.sh | sudo bash -s 18
Create Virtual Environment
Create and activate a Python virtual environment with the commands below. I like to store virtual environments inside of a venvs/ folder in the home directory i.e. ~/venvs/. You can use this location, or swap in a different one in the commands.
python3 -m venv ~/venvs/bitnet-venv source ~/venvs/bitnet-venv/bin/activate
Clone BitNet & Install Python Requirements
Next clone the BitNet repo, cd inside of it, then install the Python requirements from its requirements.txt file.
git clone --recursive https://github.com/microsoft/BitNet.git cd BitNet pip install -r requirements.txt
Generate LUT Kernels Header & Config
This is a pre-build step that generates some headers and config files that the main build process will use. Skipping this step will result in errors about missing source files.
python utils/codegen_tl1.py \ --model bitnet_b1_58-3B \ --BM 160,320,320 \ --BK 64,128,64 \ --bm 32,64,32
Build bitnet.cpp
Now to build bitnet.cpp. It's is done with these commands. Copy and run them one by one.
export CC=clang-18 CXX=clang++-18 rm -rf build && mkdir build && cd build cmake .. -DCMAKE_BUILD_TYPE=Release make -j$(nproc)
It takes about 3-4 minutes to complete the build on a Raspberry Pi 5 with 8 GB of RAM, and longer on the Pi 4. When it is completed successfully it will output the message [100%] Built target llama-server.
During the build process, there are several warnings about implicit conversions, unused parameters, anonymous types, and other issues that get printed out. These look kind of scary, but don't cause any trouble for the rest of the build process. They can be safely ignored.
Download the Model
The following commands will move up and out of the build folder and download the quantized model files.
cd .. hf download microsoft/BitNet-b1.58-2B-4T-gguf --local-dir models/BitNet-b1.58-2B-4T
The model and associated files are around 1.2gb. It will take a few minutes to download, and is dependent on your network speed. When completed successfully, it will output:
Fetching 3 files: 100%|███████████████████| 3/3
Run the BitNet-b1.58-2B-4T Model
Running models with bitnet.cpp is done by invoking a Python script called run_inference.py and passing in the model to run, the starting prompt, and an optional flag for interactive conversation mode. The following command will run the BitNet-b1.58-2B-4T model.
python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "Hello from BitNet on Raspberry Pi!" -cnv
> please write a short poem about the Raspberry Pi. No more than one paragraph. In a tiny world, a wonder stands, A Raspberry Pi, in a humble command. A tiny computer, with a heart so grand, In the palm of your hand, it can expand. From coding to coding, it's always ready, A mini powerhouse, with the power to be. An artist's canvas, a musician's stage, A coding knight, in the digital age. It's a journey into the world of code, With endless possibilities, it's forever poised. A little device, with big dreams, > Ctrl+C pressed, exiting... llama_perf_sampler_print: sampling time = 15.24 ms / 128 runs ( 0.12 ms per token, 8397.30 tokens per second) llama_perf_context_print: load time = 953.44 ms llama_perf_context_print: prompt eval time = 6108.15 ms / 34 tokens ( 179.65 ms per token, 5.57 tokens per second) llama_perf_context_print: eval time = 16174.37 ms / 104 runs ( 155.52 ms per token, 6.43 tokens per second) llama_perf_context_print: total time = 23222.51 ms / 138 tokens
Optional Arguments
The run_inference.py script takes few other arguments that can be used to control the models behavior. Here is a short summary of what each one does.
-
-n,--n-predict: Tokens used to predict when generating text. The default is128. If you find that the model suddenly cuts off in the middle of outputting a response try using a larger value like256or512. -
-t,--threads: How many threads to use for generating text. The default is2. Raise it to4if you want it to work faster but consumes more resources. -
-c,--ctx-size: Size of the prompt context. The default is2048. If you have extra RAM to spare on your Pi, you can try increasing this to4096. It can help if you find that the model suddenly stops in the middle of outputting a response. -
-temp,--temperature: Controls the randomness of the generated text. The default is0.8. Experiment with higher or lower values to get more or less predictable outputs.
$ python run_inference.py --help
usage: run_inference.py [-h] [-m MODEL] [-n N_PREDICT] -p PROMPT [-t THREADS] [-c CTX_SIZE] [-temp TEMPERATURE] [-cnv]
Run inference
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Path to model file
-n N_PREDICT, --n-predict N_PREDICT
Number of tokens to predict when generating text
-p PROMPT, --prompt PROMPT
Prompt to generate text from
-t THREADS, --threads THREADS
Number of threads to use
-c CTX_SIZE, --ctx-size CTX_SIZE
Size of the prompt context
-temp TEMPERATURE, --temperature TEMPERATURE
Temperature, a hyperparameter that controls the randomness of the generated text
-cnv, --conversation Whether to enable chat mode or not (for instruct models.)
Page last edited September 10, 2025
Text editor powered by tinymce.