Running llama.cpp across multiple CPU nodes on Discoverer: possibilities and expectations

Running llama.cpp across multiple CPU nodes on Discoverer: possibilities and expectations

Most people running llama.cpp are familiar with its single-node CPU mode, where inference is spread across cores using multithreading. What is less commonly known is that llama.cpp can also be distributed across multiple machines — but understanding what that actually means in practice is essential before building a cluster setup around it.

The built-in RPC backend

llama.cpp includes an RPC feature that connects multiple nodes over TCP. A master node holds the model file and coordinates inference, while worker nodes contribute resources. Setting it up is relatively straightforward: build with RPC support enabled, start an rpc-server process on each worker, and point the master at them via the --rpc flag.

Memory pooling, not compute scaling

Here is where expectations need to be calibrated carefully. The RPC backend is designed to pool memory across nodes, allowing you to run models that are too large to fit in a single machine’s RAM. It is not designed to make inference faster by parallelizing computation. Users running large CPU clusters with llama.cpp’s RPC have reported that only a handful of cores ends up being used across all worker nodes — a known limitation tied to how the ggml scheduler distributes work. If your model already fits on one node, adding RPC workers will likely slow things down due to network overhead.

When you actually want to compute distribution: prima.cpp

For true compute distribution across CPU nodes, prima.cpp is the more appropriate tool. It is a fork of llama.cpp that implements pipeline parallelism, where each node is assigned a slice of the model’s layers and passes activations to the next node in sequence. This approach genuinely uses CPU resources across the cluster, though it requires a fast network interconnect to avoid the internode communication becoming the bottleneck.

Other options worth knowing

LocalAI, which is built on top of llama.cpp, offers two distributed modes. Its federated mode load-balances requests across nodes that each hold a full copy of the model, improving throughput for concurrent users. Its worker mode shards model weights across nodes proportionally to available memory, similar in effect to llama.cpp’s RPC backend.

For containerized environments, llama.cpp can also be deployed using Kubernetes LeaderWorkerSet, with a leader pod distributing model layers to worker pods — all running on CPU.

The practical conclusions

If the goal is to run a model that exceeds single-node memory, llama.cpp’s RPC backend works, provided you explicitly configure thread counts on worker nodes to avoid the low-utilization bug. If the goal is to harness CPU compute cluster-wide for faster inference, pipeline-parallel approaches like prima.cpp are the right direction, with the caveat that network latency will shape your results significantly.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Appliance - Powered by TurnKey Linux