I ran this bulky LLM on an SBC cluster, and it's the most unhinged setup…

The topic I ran this bulky LLM on an SBC cluster, and it’s the most unhinged setup… is currently the subject of lively discussion — readers and analysts are keeping a close eye on developments.

This is taking place in a dynamic environment: companies’ decisions and competitors’ reactions can quickly change the picture.

Ever since I jumped headfirst into the local LLM rabbit-hole, I’ve grown fond of revitalizing old PCs by turning them into reliable AI workstations. With the right tweaks, I’ve even managed to run powerful LLMs that can rival their cloud-based counterparts on something as outdated as a 10-year-old rig. That said, most of my hardcore LLM experiments involve full-fledged x86 gaming systems with dedicated graphics cards and excess RAM.

That said, a Raspberry Pi 5 can handle up to 4B models without buckling under the extra load, which makes it a surprisingly decent option for hosting embedding models and simple chatbots. But since I wanted to run models that wouldn’t otherwise fit on this SBC, I figured I could try clustering some spare boards. And well, it’s probably one of the most cursed projects I’ve worked on (but it still has some sliver of utility).

Qwen3.6 runs on my old GPU and does what ChatGPT does for free

Starting with the SBCs that I wanted to use as the guinea pigs participants for this project, I’d initially planned to spin a cluster out of three devices. However, I quickly realized that most of my ARM boards were already engaged in some experiment or another, leaving a Raspberry Pi 5, Libre Computer Alta, and Le Frite as the only viable options. Unfortunately, the Le Frite is extremely weak for this project, and its USB 2.0 socket and 100M would end up bottlenecking an already feeble setup. So, I went with a 2-node cluster involving a Raspberry Pi 5 (8GB) and a Libre Computer Alta (4GB), with an RPC backend on llama.cpp splitting the inference tasks between the two systems.

Fortunately, the setup process was a lot simpler than I’d anticipated, even though I had to compile llama.cpp from scratch. Once I’d armed both systems with a CLI distro (an older version of Ubuntu on Alta and Raspberry Pi OS Lite on you-know-what) and configured openssh-server, I logged into them via PuTTY and installed the pre-requisite packages by running sudo apt install -y git build-essential cmake pkg-config. Then, I cloned the llama.cpp repo with git clone https://github.com/ggml-org/llama.cpp.git and switched to its freshly created directory via the cd llama.cpp command. Finally, I created yet another folder called build-rpc via mkdir -p build-rpc before switching to it and executing the following commands to compile llama.cpp with RPC capabilities:

Since I wanted the Alta SBC to act as the secondary server rig, I ran ./bin/rpc-server -H 0.0.0.0 -p 50052 on it and let the RPC server remain active for a while. After using the SCP command to move some LLMs from my main PC to the Raspberry Pi node, I ran the ./bin/llama-server -m /home/ayush/models/Qwen3.5-2B-Q4_K_M.gguf –rpc 192.168.0.150:50053 –host 0.0.0.0 –port 8080 command and waited for it to finish loading the model.

Since I was using the fairly lightweight Gemma 3 4B, I expected my cluster to perform somewhat better than just my Raspberry Pi. However, running a couple of prompts via the llama-server’s web UI proved otherwise. And I’m not talking about complex prompts or inference tasks involving MCP servers, either. For something as simple as “Tell me something cool,” the cluster would struggle to hit 2.20 tokens/second. So, I restarted my Raspberry Pi, and ran the llama-server command once again. Except, I got rid of the — rpc flag this time. Sure enough, the inference engine managed to hit 4.37 t/s, which is almost twice as fast as the clustered setup!

In theory, the cluster should either hit higher token generation rates, or, at the very least, provide speeds comparable to a Raspberry Pi-only setup. But it makes perfect sense when I factor the network and storage bottlenecks into the equation. You see, both SBCs feature a 1GbE connection, which is somewhat slow for high-speed AI inference tasks. Worse still, I’d run out of SSDs in my home lab, so I had to make do with mere microSD cards, which definitely contribute to the speed factor (or the lack thereof). Toss in the fact that LLM operations are very sensitive to latency, and it’s clear why my cluster performs terribly. I was about to label this project a failure and wrap things up here, but I wanted to try one last experiment before dissolving the cluster…

Ollama is great for getting you started… just don’t stick around.

While its lackluster performance was a total buzzkill, my main objective behind this wacky project was to run large models that a Raspberry Pi with merely 8GB of RAM wouldn’t be able to host. So, I spun llama-server up once again without the RPC flag and began working my way up the model parameter size. Qwen 3.5 (9B) is where llama-server crashed, as the SBC couldn’t accommodate the large model.

But when I ran it with the RPC flag pointing to the Alta, llama-server was able to load the LLM with relative ease. Just to satisfy my curiosity, I opened the web UI and began prompting the LLM. Well, it definitely worked, though it could only generate 1.27 tokens every second. That’s nowhere near a feasible number for my productivity tasks and coding workloads. But it’s still somewhat usable for automated tasks like generating tags for bookmarks or performing OCR scans on documents, especially considering that I can just leave my SBCs running all day without worrying about their energy consumption.

And to be brutally honest, I figured I’d end up measuring seconds per token instead of tokens per second. So, a token generation rate of 1.27 t/s is somewhat surprising, especially since it’s a model that my SBC couldn’t even load in the first place. While I probably wouldn’t use this cluster for SBC inference tasks, RPC definitely sounds useful. In fact, I might just try using it for my current LXC-based LLM-hosting workstations, which feature full-fledged 10G NICs.

Llama.cpp is an open-source framework that runs large language models locally on your computer.