AI & Local LLMs on Community Servers
Why Put a Local LLM on the Relay Server?
Prototype 5 and 6 establish a local relay server model — a Raspberry Pi or similar SBC that handles message store-and-forward for a community. This same device has spare compute between message bursts.
Running a small language model or ML model on this server unlocks capabilities that would otherwise require cloud connectivity:
- Translation: Messages between users who don't share a language, processed locally without data leaving the community space
- Speech-to-text: Voice messages transcribed on-device, enabling accessibility features offline
- Semantic search: Users can search their own message history by meaning, not just keyword
- Spam / metadata filtering: Detecting coordinated inauthentic behavior via structural patterns without reading content
- Community knowledge base: A local Q&A assistant trained on community-curated documents
- Smart DTN routing: Predicting which relay node is most likely to deliver a message based on network topology history
Critically, these features operate on metadata and user-initiated opt-in content only. Messages are end-to-end encrypted — the relay cannot read them. AI capabilities apply to what the server can legitimately see (routing metadata, public posts, user-submitted files) or to data users explicitly submit for processing.
Hardware Options
Community relay servers don't need to be expensive. The following hardware tiers cover the realistic deployment spectrum, from a backpack relay at a protest to a fixed community space server.
Tier 1: ARM SBCs (Low Power, Portable)
| Device | CPU | RAM | NPU/GPU | Price | LLM Fit |
|---|---|---|---|---|---|
| Raspberry Pi 5 (8GB) | 4× Cortex-A76 @ 2.4 GHz | 8 GB LPDDR4X | None | ~$80 | 1–3B param models |
| Raspberry Pi 5 (16GB) | 4× Cortex-A76 @ 2.4 GHz | 16 GB LPDDR4X | None | ~$120 | 3–7B param models |
| Pi 5 + Hailo AI HAT+ | Pi 5 CPU | 8–16 GB | 26 TOPS NPU | ~$150–200 | Vision + small LLM together |
| Orange Pi 5 | RK3588 (4× A76 + 4× A55) | 8–16 GB | 6 TOPS NPU | ~$80–140 | 1–3B + NPU-accelerated tasks |
| Orange Pi 5 Max | RK3588 | 16–32 GB | 6 TOPS NPU | ~$150–200 | 7B param models feasible |
Raspberry Pi 5 benchmarks (llama.cpp, Q4_K_M quantization):
| Model | Pi 5 8GB | Pi 5 16GB |
|---|---|---|
| TinyLlama 1.1B | ~12–15 tok/s | ~15–18 tok/s |
| Gemma 3 1B | ~10–13 tok/s | ~12–15 tok/s |
| Phi-3 Mini 3.8B | ~3–5 tok/s | ~4–6 tok/s |
| Qwen2.5 1.5B | ~8–12 tok/s | ~10–14 tok/s |
Hailo AI HAT+ (26 TOPS) offloads vision tasks and accelerates transformer layers, particularly attention heads. For text-only LLMs, the bottleneck is memory bandwidth (LPDDR4X ~40 GB/s), not compute — the Hailo does not help much for pure text inference but enables simultaneous whisper.cpp transcription + LLM without saturating the CPU.
RK3588 NPU (6 TOPS): Rockchip's NPU runs models compiled with RKNN Toolkit. Performance is competitive for quantized small models (1–3B) but the toolchain requires converting models to .rknn format. The Mali GPU can accelerate via OpenCL. Total memory bandwidth on RK3588 is ~68 GB/s, giving a meaningful advantage over Pi 5 for 7B models.
Tier 2: Dedicated AI Edge Hardware
| Device | Accelerator | RAM | Power | Price | LLM Fit |
|---|---|---|---|---|---|
| Jetson Orin Nano (8GB) | 1024-core Ampere GPU + 32 TOPS DLA | 8 GB LPDDR5 | 7–15 W | ~$150–200 | 7B models at usable speeds |
| Jetson Orin NX (16GB) | 1024-core Ampere + 70 TOPS | 16 GB LPDDR5 | 10–25 W | ~$300–450 | 13B models feasible |
Jetson Orin Nano benchmarks (llama.cpp with CUDA offload):
| Model | Tokens/s |
|---|---|
| Gemma 3 1B (Q4) | ~40–55 tok/s |
| Phi-3 Mini 3.8B (Q4) | ~18–25 tok/s |
| Llama 3.2 3B (Q4) | ~20–28 tok/s |
| Mistral 7B (Q4) | ~8–12 tok/s |
The Jetson line is the best single-device option for running 7B+ models at interactive speeds in a low-power form factor (~7–15W idle to full load). The tradeoff is cost and the need for active cooling.
Tier 3: Mini PCs (Higher Power, Fixed Location)
| Device | CPU | GPU/NPU | RAM | Price | LLM Fit |
|---|---|---|---|---|---|
| Beelink SER7 / GMKtec | Ryzen 7840HS | Radeon 780M (iGPU, 12 CU) | 32–64 GB DDR5 | ~$300–400 | 13B–34B models |
| Intel N100 mini PCs | Intel Alder Lake-N | Intel UHD (32 EU) | 8–16 GB DDR4 | ~$80–150 | 1–3B models |
| Old Android phone | Snapdragon 855/865 | Adreno 640/650 | 8–12 GB LPDDR5 | ~$20–50 used | 1–2B via on-device framework |
Ryzen 7840HS + integrated Radeon 780M is a compelling choice for fixed community spaces. The iGPU shares system RAM (no discrete VRAM limit) and llama.cpp VULKAN/ROCm support allows 13B Q4 models at 15–25 tok/s, with all 32–64 GB of system RAM available as model context. Power draw is ~25–45W under LLM load.
Inference Software
llama.cpp
The reference implementation for efficient CPU/GPU inference of quantized LLMs. Supports GGUF format, all major quantization levels (Q2 through Q8, and KV cache quantization), and backends: CPU (AVX2/SVE), CUDA, Metal, Vulkan, SYCL, OpenCL.
# Build for CPU only (Pi 5 / ARM SBCs)
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)
# Serve as OpenAI-compatible API
./build/bin/llama-server -m models/gemma-3-1b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 -ngl 0 --ctx-size 2048
Quantization tradeoffs on low-RAM devices:
| Quant | Size (1B model) | Quality | RAM (7B model) |
|---|---|---|---|
| Q2_K | ~0.5 GB | Degraded | ~2.8 GB |
| Q4_K_M | ~0.7 GB | Good | ~4.1 GB |
| Q5_K_M | ~0.9 GB | Better | ~5.1 GB |
| Q8_0 | ~1.1 GB | Near full | ~7.2 GB |
For Pi 5 (8GB): use Q4_K_M for 1–3B models; leave ~2 GB for OS and the message relay process.
Ollama
Higher-level wrapper around llama.cpp with automatic model downloading, management, and a simple REST API. Easiest to deploy on a community server — one command installs and serves a model.
curl -fsSL https://ollama.com/install.sh | sh
ollama run gemma3:1b # Downloads and runs immediately
# REST API available at http://localhost:11434
Ollama's /api/generate and /api/chat endpoints are drop-in compatible with OpenAI's API format. Phone apps can use any OpenAI-compatible client library pointed at the local server's IP.
llamafile
Mozilla's single-file executable format — model weights + inference engine in one file. Download and run, no installation. Ideal for community deployments where setup simplicity matters.
# Example: download and run a model as a server
wget https://huggingface.co/.../gemma-3-1b.Q4_K_M.llamafile
chmod +x gemma-3-1b.Q4_K_M.llamafile
./gemma-3-1b.Q4_K_M.llamafile --server --host 0.0.0.0 --port 8080
Moonshine (Speech-to-Text)
GitHub: usefulsensors/moonshine
Moonshine is an ASR (automatic speech recognition) model optimized for edge devices — specifically designed to run faster and more efficiently than Whisper on ARM CPUs without GPUs. Developed by Useful Sensors.
| Model | Size | Pi 5 Speed | Accuracy vs Whisper |
|---|---|---|---|
| Moonshine Tiny | 27 MB | ~25× RT | Comparable to Whisper tiny |
| Moonshine Base | 61 MB | ~18× RT | Comparable to Whisper base |
At ~25× real-time on Pi 5 (vs ~15× for Whisper tiny), Moonshine is meaningfully faster for our 30-second voice message transcription use case. It processes a 30-second clip in ~1.2 seconds vs ~2 seconds for Whisper tiny. The quality tradeoff relative to Whisper small is the main consideration — test both for your target languages.
Recommendation: Evaluate Moonshine Base as an alternative to Whisper Small for the /transcribe endpoint. If accuracy is acceptable for the community's primary languages, the speed improvement is worthwhile on Pi 5.
Piper (Text-to-Speech)
Piper is a fast, local neural TTS engine designed specifically for Raspberry Pi and similar SBCs. Uses a VITS (Variational Inference with adversarial learning for end-to-End Text-to-Speech) architecture.
- Speed: Real-time on Pi 4/5 (can synthesize speech faster than playback speed)
- Size: Models range from 28 MB (low quality) to 80 MB (high quality)
- Languages: 30+ languages with multiple voice variants
- RAM: ~100-150 MB per loaded model
Use case: Accessibility feature for community relay — phone app sends text to /speak endpoint, relay synthesizes and returns audio. Enables message-to-audio for users with visual impairments or low literacy, entirely offline.
echo "Community meeting tonight at 7pm" | \
piper --model en_US-lessac-medium.onnx --output_file announcement.wav
whisper.cpp
C++ port of OpenAI's Whisper speech recognition. Runs on Pi 5 at 4–8× real-time for small/medium models, meaning a 30-second voice message transcribes in ~4–8 seconds.
| Model | Size | Pi 5 Speed | Accuracy |
|---|---|---|---|
| tiny | 39 MB | ~15× RT | Low |
| base | 74 MB | ~10× RT | Decent |
| small | 244 MB | ~5× RT | Good |
| medium | 769 MB | ~2× RT | Very good |
| large-v3-turbo | 809 MB | ~1.5× RT | Best |
NLLB-200 language detection can be run before Whisper to auto-select the correct model language. On Pi 5, a small Whisper model fits comfortably alongside a 1B LLM within 8GB RAM.
LocalAI
Self-hosted, OpenAI API-compatible inference server. Supports text generation (LLaMA, Mistral, Phi), image generation (Stable Diffusion), speech-to-text (Whisper), and text-to-speech — all through a single OpenAI-compatible REST API. Can replace the individual llama.cpp + whisper.cpp setup with a unified service.
docker run -p 8080:8080 localai/localai:latest-aio-cpu
# Immediately provides: /v1/chat/completions, /v1/audio/transcriptions, /v1/audio/speech
Advantage over raw llama.cpp: Model management, automatic downloading, and a unified API mean the phone app can use any standard OpenAI client library. The /v1/audio/speech endpoint adds text-to-speech (Piper backend) without additional setup.
Tradeoff: Docker adds overhead (~200-400MB RAM). On Pi 5 8GB, prefer raw llama.cpp + whisper.cpp for tighter memory control. LocalAI shines on mini PC (Tier 3) deployments where memory is not the constraint.
Jan
Desktop application (Electron) for running LLMs locally — essentially a privacy-respecting ChatGPT replacement for the relay server operator's workstation. Not a server daemon but a GUI tool for model management and testing.
Relevance to the project: Jan provides a convenient interface for relay administrators to test and benchmark models before deploying them to the community server via llama.cpp/Ollama. Its model hub downloads GGUF models with one click — useful for evaluating new models.
On-Phone Inference Frameworks
For use cases where the phone should do its own local AI (bypassing the relay entirely):
| Framework | Platform | Models | Notes |
|---|---|---|---|
| MediaPipe LLM | Android + iOS | Gemma 2B, Phi-2 | Google; easiest integration, limited model choice |
| MLX | Apple Silicon (macOS/iOS) | Most GGUF/MLX models | Apple-native, very fast on M-series and A-series chips |
| ExecuTorch | Android + iOS | PyTorch-exported models | Meta/PyTorch; production-grade, complex setup |
| llama.cpp (JNI/JNA) | Android | Any GGUF | Can compile for arm64-v8a; power consumption is significant |
| TFLite / LiteRT | Android + iOS | Gemma, custom models | Google; mature for small models and embeddings |
On modern Android phones (Snapdragon 8 Gen 2/3 with Hexagon NPU, or Dimensity 9300), 1–3B models run at 20–40 tok/s on-device. This is a viable alternative to the relay server for private/personal AI tasks. The relay server model is better for shared community tools (translation for events, public knowledge base).
Recommended Models
| Model | Params | Size (Q4_K_M) | Best Use | Min RAM |
|---|---|---|---|---|
| Gemma 3 1B | 1B | ~700 MB | Translation, simple Q&A, classification | 2 GB |
| TinyLlama 1.1B | 1.1B | ~700 MB | Fast text generation, classification | 2 GB |
| Qwen2.5 1.5B | 1.5B | ~1.1 GB | Multilingual tasks, strong for size | 3 GB |
| Phi-3 Mini 3.8B | 3.8B | ~2.4 GB | Reasoning, knowledge Q&A | 5 GB |
| Llama 3.2 3B | 3B | ~2.0 GB | General purpose, good balance | 4 GB |
| Gemma 3 4B | 4B | ~2.8 GB | High quality multilingual | 6 GB |
For Pi 5 8GB: Gemma 3 1B or TinyLlama as the always-on model; Qwen2.5 1.5B for multilingual tasks.
For Pi 5 16GB: Phi-3 Mini or Llama 3.2 3B runs comfortably alongside the relay daemon.
For Jetson Orin Nano: Llama 3.2 3B or Gemma 3 4B at interactive speeds.
NLLB-200 (No Language Left Behind, Meta) is specifically for translation — 200 language pairs. Use via CTranslate2 for fast CPU inference, or via the Hugging Face transformers API. The 600M distilled model fits in 1.2 GB and achieves competitive translation quality.
Embedding models for semantic search: nomic-embed-text (137M, 274 MB) or all-MiniLM-L6-v2 (22M, 90 MB) are the right scale for a Pi-based semantic search index over local message archives.
Use Cases in Detail
Translation (NLLB-200 + CTranslate2)
┌─────────────┐ encrypted msg ┌──────────────────┐
│ Phone A │ ──────────────────► │ Relay Server │
│ (Spanish) │ │ │
└─────────────┘ │ User B requests │
│ translation of │
│ *their own* msg │
│ history via API │
│ │
│ NLLB-200 runs │
│ locally → EN │
└──────────────────┘
The relay server cannot translate end-to-end encrypted messages it cannot read. Translation applies to:
- Public/unencrypted community posts explicitly shared with the server
- User-submitted text sent directly to the
/translateAPI endpoint from the phone app
# CTranslate2 NLLB-200 inference (Python)
from ctranslate2 import Translator
translator = Translator("nllb-200-distilled-600M-ct2", device="cpu")
result = translator.translate_batch([["spa_Latn", "Hola, ¿cómo estás?"]],
target_prefix=[["eng_Latn"]])
At ~200–400 tokens/s on Pi 5 CPU, short messages translate in under a second.
Speech-to-Text (whisper.cpp)
Voice messages sent as audio blobs. The phone sends the audio to the relay's /transcribe endpoint (opt-in, user-initiated). The relay runs whisper.cpp, returns the transcript, discards the audio. The transcript never leaves the local WiFi network.
[Phone records voice message]
│
▼
[Phone sends audio to relay /transcribe]
│
▼
[whisper.cpp: audio → text] (3–8 sec on Pi 5 for 30-sec audio)
│
▼
[Relay returns transcript to phone]
│
▼
[Phone app displays transcript; relay discards audio]
Semantic Search (Embedding Model)
Users can search their own message archive by meaning. The embedding model runs on the relay (or on-device). Messages are embedded at write time; queries are embedded at search time; cosine similarity finds relevant results.
User query: "what did we decide about the meeting place?"
│
▼
[Embed query → vector]
│
▼
[Search local SQLite vector index (sqlite-vec or hnswlib)]
│
▼
[Return top-k semantically similar messages from user's archive]
The embedding index is user-specific and stays local to the relay or phone. The relay has no access to the index of other users' messages.
Metadata Spam Detection
The relay can observe structural message metadata even when content is encrypted: message frequency, burst patterns, recipient fan-out, message size distributions. A small classifier (scikit-learn, ONNX runtime) trained on these features can flag accounts exhibiting coordinated inauthentic patterns without reading any message content.
This is metadata-as-defense, not metadata-as-surveillance. The classifier output is boolean (flag/no-flag) and the relay acts only on the flag, not the metadata features. Transparency requires disclosing that this runs.
Community Knowledge Base
A community maintains a collection of documents (guides, maps, rules, resources) on the relay. A RAG (Retrieval-Augmented Generation) pipeline lets community members query this knowledge base via the phone app:
[Phone app]: "What are the community guidelines for the east wing?"
│
▼
[Relay: embed query → retrieve relevant doc chunks from vector store]
│
▼
[Relay: LLM generates answer grounded in retrieved chunks]
│
▼
[Phone displays answer with source references]
No sensitive personal data needed. The knowledge base is community-curated public content.
Smart DTN Routing
In delay-tolerant network scenarios (Prototype 6: BLE relay devices), routing decisions benefit from history. A simple ML model trained on relay encounter history (which relays meet which relays, how often, at what times) can predict optimal next-hop routing for a message to reach its destination.
This is the same approach used in Epidemic/PROPHET routing in DTN research. The model is a small time-series predictor or a lookup table, not a large LLM.
Privacy Implications
What AI Can and Cannot Do with E2E Encrypted Messages
What the relay sees What the relay CANNOT see
───────────────────── ─────────────────────────
• Sender public key hash • Message plaintext
• Recipient public key hash • Sender's name or identity
• Message size (bytes) • Message content or topic
• Timestamp • Who knows whom
• Delivery status • Social graph
• Message count / frequency • Any metadata sender/receiver
chose to hide
E2E encryption is the non-negotiable baseline. AI on the relay operates only on the left column. Any feature that requires reading message content must be processed on the user's own device.
Opt-in Content Processing
Translation, transcription, and knowledge base Q&A are all request-initiated by the user's phone app. The relay does not proactively process any messages. The phone sends a specific request with the data the user chooses to share. There is no ambient surveillance mode.
The phone app should display a clear indicator when a request is sent to the local AI server, analogous to how apps show a spinner when making a network request.
Metadata-as-Surveillance Risk
Even without reading content, metadata reveals a great deal. The spam detection use case walks this line carefully:
- Acceptable: Flagging accounts sending 500 messages/minute (obvious spam pattern)
- Unacceptable: Building social graphs from who-sends-to-whom patterns, tracking user movement via message timestamps, selling or sharing metadata with third parties
Community-run relays operated transparently with open-source software and local governance are meaningfully different from commercial surveillance systems. But the technical capability exists for misuse — community trust and governance matter as much as the technical design.
Federated Learning Risks
Federated learning (training a shared model on device data without centralizing data) is sometimes proposed as a privacy-preserving AI technique. It is not appropriate for this project:
- Federated learning still leaks information through gradient updates (membership inference attacks)
- It creates a coordination mechanism between users that doesn't exist by design in our system
- Model updates are a covert channel for metadata exfiltration
- The threat model for political/activist use cases requires stronger guarantees than federated learning provides
The correct approach is: no model training on user data at all. Models are pre-trained, downloaded as static artifacts, and run in inference-only mode.
On-Device Phone AI vs Server AI
| Dimension | On-Device (Phone) | Server (Relay) |
|---|---|---|
| Privacy | Maximum — data never leaves device | High — data leaves device only for specific user-initiated requests |
| Availability | Works offline, always | Requires WiFi connection to relay |
| Performance | Fast on flagship phones (Snapdragon 8 Gen 3, A17 Pro) | Slow-medium on Pi 5; fast on Jetson |
| Model size | Limited to 1–4B (thermal/battery) | Up to 7B+ depending on hardware |
| Battery impact | High (NPU helps; still drains fast) | None on phone |
| Shared functionality | Per-user only | Shared across all users (translation, knowledge base) |
| Update model | App store update required | Admin deploys new GGUF file |
| Cost | Zero (uses user's phone) | Server hardware cost shared by community |
Recommendation: Use on-device AI for personal features (private semantic search of user's own messages, private transcription). Use relay AI for community features (shared translation, community knowledge base). The two modes are complementary.
Architecture: Relay Server with AI Layer
Community WiFi Network
──────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────┐
│ Relay Server (Pi 5 / Jetson) │
│ │
│ ┌─────────────────┐ ┌───────────────────────────┐ │
│ │ Message Relay │ │ AI Service Layer │ │
│ │ (existing) │ │ │ │
│ │ │ │ POST /translate │ │
│ │ - Store/fwd │ │ POST /transcribe │ │
│ │ - E2E blobs │ │ POST /search │ │
│ │ - Key registry │ │ POST /ask │ │
│ │ │ │ │ │
│ │ Port 8443 │ │ Port 8080 (local only) │ │
│ └─────────────────┘ └───────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Model Runtime │ │
│ │ llama.cpp / Ollama / whisper.cpp │ │
│ │ NLLB-200 (CTranslate2) │ │
│ │ Embedding model (nomic-embed-text) │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────┐ │
│ │ Vector Store │ │
│ │ sqlite-vec / hnswlib │ │
│ │ (per-user, isolated) │ │
│ └───────────────────────────┘ │
└──────────────────────────────────────────────────────┘
▲ ▲ ▲
│ │ │
Phone A Phone B Phone C
API Exposure
The AI service layer is exposed only on the local WiFi network — not on any internet-facing interface. Requests require the same authentication token as the message relay (derived from the user's key material, established during initial pairing). Rate limiting prevents a single user from monopolizing the compute.
# Example: Phone app sends translation request
curl -X POST http://192.168.4.1:8080/translate \
-H "Authorization: Bearer <user-token>" \
-H "Content-Type: application/json" \
-d '{"text": "Hola mundo", "source_lang": "spa_Latn", "target_lang": "eng_Latn"}'
Resource Sharing with Message Relay
The LLM inference process runs at low priority (Linux nice +15). The message relay process gets CPU priority. On Pi 5, idle LLM inference uses ~200–400 MB RAM with the model loaded; the relay daemon uses ~50–100 MB. Total overhead with Gemma 3 1B loaded: ~600–800 MB — well within the 8GB budget.
Model loading time is the main latency concern: cold-starting llama.cpp takes 3–8 seconds to load a 1B GGUF into memory. The solution is keep-alive: the model stays loaded in memory between requests. If the relay has been idle for 30 minutes, evict the model from memory to free RAM for message buffering.
Sneakernet Model Updates
In air-gapped or low-connectivity deployments, model updates arrive via USB drive or SD card — the same "sneakernet" data distribution mechanism described in the Willow protocol specification. A relay admin inserts a drive containing a new GGUF file; a systemd path unit triggers the model swap automatically.
# /etc/systemd/system/llm-model-watcher.path
[Path]
PathExistsGlob=/media/usb/*.gguf
[Install]
WantedBy=multi-user.target
This keeps model updates entirely offline, consistent with the project's offline-first design philosophy.
Related Pages
- Research Prototypes — Prototype 5 (WiFi relay server) and Prototype 6 (BLE relay) are the hardware context for this page
- Architecture — System design for the full Connect application
- Location APIs & Security — Privacy threat models relevant to metadata handling
Updates (2025–2026)
New Hardware: Raspberry Pi AI HAT+ 2 (Hailo-10H, 40 TOPS)
⚠ Clarification on existing content: The page states the Hailo AI HAT+ (26 TOPS) enables "Vision + small LLM together" but "does not help much for pure text inference." This is accurate for the original Hailo-8-based HAT+ (2024). In January 2026, Raspberry Pi released the AI HAT+ 2 (raspberrypi.com), built on the Hailo-10H chip at 40 TOPS INT4 with 8 GB of dedicated on-board RAM. The HAT+ 2 adds genuine LLM capability — approximately 9.45 tok/s on Qwen2-1.5B in early benchmarks — and removes the memory-bandwidth bottleneck that limits pure-CPU inference on the Pi 5. Priced at $130, it is the first Pi accessory that materially accelerates text LLM inference. The hardware tier table should be updated to distinguish the original HAT+ (CV-only) from the HAT+ 2 (LLM-capable).
New Models: Gemma 3 Family (March 2025)
Google released the full Gemma 3 family in March 2025 with sizes 270M, 1B, 4B, 12B, and 27B (blog.google). The 4B model fits in under 3 GB at Q4_K_M and adds 128K context with improved multilingual coverage. For community relay servers, the Gemma 3 4B is the preferred single-model choice on Pi 5 16 GB, while the Gemma 3 1B remains the best option for 8 GB deployments. Benchmarks from a Pi 5 evaluation (stratosphereips.org) confirm Gemma 3 1B has the highest throughput among tested models on Pi 5 CPU.
New Models: Qwen3 Small Series Supersedes Qwen2.5 (April 2025)
Alibaba released Qwen3 under Apache 2.0 in April 2025 (qwenlm.github.io), including 0.6B, 1.7B, and 4B dense models. Qwen3-1.7B matches or exceeds the earlier Qwen2.5-3B on standard benchmarks. All three small variants run on Ollama, llama.cpp, and MLX without modification. Qwen3-1.7B supersedes Qwen2.5-1.5B as the recommended 1–2B class model for community relay servers — slightly larger but substantially more capable.
New Models: Phi-4-Mini Supersedes Phi-3 Mini (February 2025)
Microsoft released Phi-4-Mini (3.8B) in February 2025 (azure.microsoft.com), which matches 8B-class models on reasoning and math while fitting in 4 GB RAM. A subsequent Phi-4-Mini-Flash-Reasoning variant achieves 10× higher throughput and 2–3× lower latency compared to Phi-4-Mini at the same parameter count. Phi-4-Mini supersedes Phi-3 Mini 3.8B as the recommended 3–4B model: same RAM footprint, meaningfully better benchmark scores.
Llama 4 (April 2025) — Not Recommended for Relay Servers
Meta released Llama 4 Scout (17B active / 16 experts MoE) and Maverick in April 2025 (ai.meta.com). These are multimodal mixture-of-experts models designed for datacenter inference. The smallest Llama 4 model activates 17B parameters — too large for the hardware tiers covered in this page. Llama 3.2 1B/3B remains the recommended Llama-family option for community relay servers on current Tier 1–2 hardware.
Ollama: MLX Backend and Minions Protocol (2025–2026)
In March 2026, Ollama v0.19 switched its Apple Silicon backend to MLX (ollama.com/blog/mlx), with decode speeds jumping from 58 to 112 tok/s for M-series Mac mini relay deployments. In February 2025, the Minions framework enabled small on-device models to collaborate with larger cloud models by shifting sub-tasks to local hardware — applicable to hybrid relay architectures where internet connectivity is intermittent.
whisper.cpp: Integrated-GPU Acceleration (2025)
whisper.cpp release 1.8.3 added integrated-GPU (iGPU) support for AMD and Intel graphics, reporting up to a 12× performance boost over CPU-only mode (phoronix.com). For Tier 3 mini PCs (Ryzen 7840HS with Radeon 780M iGPU) this means transcription overhead is largely removed from the CPU budget, allowing simultaneous LLM inference and STT without contention.