AI & Local LLMs on Community Servers

Why Put a Local LLM on the Relay Server?

Prototype 5 and 6 establish a local relay server model — a Raspberry Pi or similar SBC that handles message store-and-forward for a community. This same device has spare compute between message bursts.

Running a small language model or ML model on this server unlocks capabilities that would otherwise require cloud connectivity:

Translation: Messages between users who don't share a language, processed locally without data leaving the community space
Speech-to-text: Voice messages transcribed on-device, enabling accessibility features offline
Semantic search: Users can search their own message history by meaning, not just keyword
Spam / metadata filtering: Detecting coordinated inauthentic behavior via structural patterns without reading content
Community knowledge base: A local Q&A assistant trained on community-curated documents
Smart DTN routing: Predicting which relay node is most likely to deliver a message based on network topology history

Critically, these features operate on metadata and user-initiated opt-in content only. Messages are end-to-end encrypted — the relay cannot read them. AI capabilities apply to what the server can legitimately see (routing metadata, public posts, user-submitted files) or to data users explicitly submit for processing.

Hardware Options

Community relay servers don't need to be expensive. The following hardware tiers cover the realistic deployment spectrum, from a backpack relay at a protest to a fixed community space server.

Tier 1: ARM SBCs (Low Power, Portable)

Device	CPU	RAM	NPU/GPU	Price	LLM Fit
Raspberry Pi 5 (8GB)	4× Cortex-A76 @ 2.4 GHz	8 GB LPDDR4X	None	~$80	1–3B param models
Raspberry Pi 5 (16GB)	4× Cortex-A76 @ 2.4 GHz	16 GB LPDDR4X	None	~$120	3–7B param models
Pi 5 + Hailo AI HAT+	Pi 5 CPU	8–16 GB	26 TOPS NPU	~$150–200	Vision + small LLM together
Orange Pi 5	RK3588 (4× A76 + 4× A55)	8–16 GB	6 TOPS NPU	~$80–140	1–3B + NPU-accelerated tasks
Orange Pi 5 Max	RK3588	16–32 GB	6 TOPS NPU	~$150–200	7B param models feasible

Raspberry Pi 5 benchmarks (llama.cpp, Q4_K_M quantization):

Model	Pi 5 8GB	Pi 5 16GB
TinyLlama 1.1B	~12–15 tok/s	~15–18 tok/s
Gemma 3 1B	~10–13 tok/s	~12–15 tok/s
Phi-3 Mini 3.8B	~3–5 tok/s	~4–6 tok/s
Qwen2.5 1.5B	~8–12 tok/s	~10–14 tok/s

Hailo AI HAT+ (26 TOPS) offloads vision tasks and accelerates transformer layers, particularly attention heads. For text-only LLMs, the bottleneck is memory bandwidth (LPDDR4X ~40 GB/s), not compute — the Hailo does not help much for pure text inference but enables simultaneous whisper.cpp transcription + LLM without saturating the CPU.

RK3588 NPU (6 TOPS): Rockchip's NPU runs models compiled with RKNN Toolkit. Performance is competitive for quantized small models (1–3B) but the toolchain requires converting models to .rknn format. The Mali GPU can accelerate via OpenCL. Total memory bandwidth on RK3588 is ~68 GB/s, giving a meaningful advantage over Pi 5 for 7B models.

Tier 2: Dedicated AI Edge Hardware

Device	Accelerator	RAM	Power	Price	LLM Fit
Jetson Orin Nano (8GB)	1024-core Ampere GPU + 32 TOPS DLA	8 GB LPDDR5	7–15 W	~$150–200	7B models at usable speeds
Jetson Orin NX (16GB)	1024-core Ampere + 70 TOPS	16 GB LPDDR5	10–25 W	~$300–450	13B models feasible

Jetson Orin Nano benchmarks (llama.cpp with CUDA offload):

Model	Tokens/s
Gemma 3 1B (Q4)	~40–55 tok/s
Phi-3 Mini 3.8B (Q4)	~18–25 tok/s
Llama 3.2 3B (Q4)	~20–28 tok/s
Mistral 7B (Q4)	~8–12 tok/s

The Jetson line is the best single-device option for running 7B+ models at interactive speeds in a low-power form factor (~7–15W idle to full load). The tradeoff is cost and the need for active cooling.

Tier 3: Mini PCs (Higher Power, Fixed Location)

Device	CPU	GPU/NPU	RAM	Price	LLM Fit
Beelink SER7 / GMKtec	Ryzen 7840HS	Radeon 780M (iGPU, 12 CU)	32–64 GB DDR5	~$300–400	13B–34B models
Intel N100 mini PCs	Intel Alder Lake-N	Intel UHD (32 EU)	8–16 GB DDR4	~$80–150	1–3B models
Old Android phone	Snapdragon 855/865	Adreno 640/650	8–12 GB LPDDR5	~$20–50 used	1–2B via on-device framework

Ryzen 7840HS + integrated Radeon 780M is a compelling choice for fixed community spaces. The iGPU shares system RAM (no discrete VRAM limit) and llama.cpp VULKAN/ROCm support allows 13B Q4 models at 15–25 tok/s, with all 32–64 GB of system RAM available as model context. Power draw is ~25–45W under LLM load.

Inference Software

llama.cpp

The reference implementation for efficient CPU/GPU inference of quantized LLMs. Supports GGUF format, all major quantization levels (Q2 through Q8, and KV cache quantization), and backends: CPU (AVX2/SVE), CUDA, Metal, Vulkan, SYCL, OpenCL.

# Build for CPU only (Pi 5 / ARM SBCs)
cmake -B build -DGGML_NATIVE=ON
cmake --build build --config Release -j$(nproc)

# Serve as OpenAI-compatible API
./build/bin/llama-server -m models/gemma-3-1b-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 -ngl 0 --ctx-size 2048

Quantization tradeoffs on low-RAM devices:

Quant	Size (1B model)	Quality	RAM (7B model)
Q2_K	~0.5 GB	Degraded	~2.8 GB
Q4_K_M	~0.7 GB	Good	~4.1 GB
Q5_K_M	~0.9 GB	Better	~5.1 GB
Q8_0	~1.1 GB	Near full	~7.2 GB

For Pi 5 (8GB): use Q4_K_M for 1–3B models; leave ~2 GB for OS and the message relay process.

Ollama

Higher-level wrapper around llama.cpp with automatic model downloading, management, and a simple REST API. Easiest to deploy on a community server — one command installs and serves a model.

curl -fsSL https://ollama.com/install.sh | sh
ollama run gemma3:1b   # Downloads and runs immediately
# REST API available at http://localhost:11434

Ollama's /api/generate and /api/chat endpoints are drop-in compatible with OpenAI's API format. Phone apps can use any OpenAI-compatible client library pointed at the local server's IP.

llamafile

Mozilla's single-file executable format — model weights + inference engine in one file. Download and run, no installation. Ideal for community deployments where setup simplicity matters.

# Example: download and run a model as a server
wget https://huggingface.co/.../gemma-3-1b.Q4_K_M.llamafile
chmod +x gemma-3-1b.Q4_K_M.llamafile
./gemma-3-1b.Q4_K_M.llamafile --server --host 0.0.0.0 --port 8080

Moonshine (Speech-to-Text)

GitHub: usefulsensors/moonshine

Moonshine is an ASR (automatic speech recognition) model optimized for edge devices — specifically designed to run faster and more efficiently than Whisper on ARM CPUs without GPUs. Developed by Useful Sensors.

Model	Size	Pi 5 Speed	Accuracy vs Whisper
Moonshine Tiny	27 MB	~25× RT	Comparable to Whisper tiny
Moonshine Base	61 MB	~18× RT	Comparable to Whisper base

At ~25× real-time on Pi 5 (vs ~15× for Whisper tiny), Moonshine is meaningfully faster for our 30-second voice message transcription use case. It processes a 30-second clip in ~1.2 seconds vs ~2 seconds for Whisper tiny. The quality tradeoff relative to Whisper small is the main consideration — test both for your target languages.

Recommendation: Evaluate Moonshine Base as an alternative to Whisper Small for the /transcribe endpoint. If accuracy is acceptable for the community's primary languages, the speed improvement is worthwhile on Pi 5.

Piper (Text-to-Speech)

GitHub: rhasspy/piper

Piper is a fast, local neural TTS engine designed specifically for Raspberry Pi and similar SBCs. Uses a VITS (Variational Inference with adversarial learning for end-to-End Text-to-Speech) architecture.

Speed: Real-time on Pi 4/5 (can synthesize speech faster than playback speed)
Size: Models range from 28 MB (low quality) to 80 MB (high quality)
Languages: 30+ languages with multiple voice variants
RAM: ~100-150 MB per loaded model

Use case: Accessibility feature for community relay — phone app sends text to /speak endpoint, relay synthesizes and returns audio. Enables message-to-audio for users with visual impairments or low literacy, entirely offline.

echo "Community meeting tonight at 7pm" | \
  piper --model en_US-lessac-medium.onnx --output_file announcement.wav

whisper.cpp

C++ port of OpenAI's Whisper speech recognition. Runs on Pi 5 at 4–8× real-time for small/medium models, meaning a 30-second voice message transcribes in ~4–8 seconds.

Model	Size	Pi 5 Speed	Accuracy
tiny	39 MB	~15× RT	Low
base	74 MB	~10× RT	Decent
small	244 MB	~5× RT	Good
medium	769 MB	~2× RT	Very good
large-v3-turbo	809 MB	~1.5× RT	Best

NLLB-200 language detection can be run before Whisper to auto-select the correct model language. On Pi 5, a small Whisper model fits comfortably alongside a 1B LLM within 8GB RAM.

LocalAI

localai.io · GitHub

Self-hosted, OpenAI API-compatible inference server. Supports text generation (LLaMA, Mistral, Phi), image generation (Stable Diffusion), speech-to-text (Whisper), and text-to-speech — all through a single OpenAI-compatible REST API. Can replace the individual llama.cpp + whisper.cpp setup with a unified service.

docker run -p 8080:8080 localai/localai:latest-aio-cpu
# Immediately provides: /v1/chat/completions, /v1/audio/transcriptions, /v1/audio/speech

Advantage over raw llama.cpp: Model management, automatic downloading, and a unified API mean the phone app can use any standard OpenAI client library. The /v1/audio/speech endpoint adds text-to-speech (Piper backend) without additional setup.

Tradeoff: Docker adds overhead (~200-400MB RAM). On Pi 5 8GB, prefer raw llama.cpp + whisper.cpp for tighter memory control. LocalAI shines on mini PC (Tier 3) deployments where memory is not the constraint.

Jan

jan.ai · GitHub

Desktop application (Electron) for running LLMs locally — essentially a privacy-respecting ChatGPT replacement for the relay server operator's workstation. Not a server daemon but a GUI tool for model management and testing.

Relevance to the project: Jan provides a convenient interface for relay administrators to test and benchmark models before deploying them to the community server via llama.cpp/Ollama. Its model hub downloads GGUF models with one click — useful for evaluating new models.

On-Phone Inference Frameworks

For use cases where the phone should do its own local AI (bypassing the relay entirely):

Framework	Platform	Models	Notes
MediaPipe LLM	Android + iOS	Gemma 2B, Phi-2	Google; easiest integration, limited model choice
MLX	Apple Silicon (macOS/iOS)	Most GGUF/MLX models	Apple-native, very fast on M-series and A-series chips
ExecuTorch	Android + iOS	PyTorch-exported models	Meta/PyTorch; production-grade, complex setup
llama.cpp (JNI/JNA)	Android	Any GGUF	Can compile for arm64-v8a; power consumption is significant
TFLite / LiteRT	Android + iOS	Gemma, custom models	Google; mature for small models and embeddings

On modern Android phones (Snapdragon 8 Gen 2/3 with Hexagon NPU, or Dimensity 9300), 1–3B models run at 20–40 tok/s on-device. This is a viable alternative to the relay server for private/personal AI tasks. The relay server model is better for shared community tools (translation for events, public knowledge base).

Recommended Models

Model	Params	Size (Q4_K_M)	Best Use	Min RAM
Gemma 3 1B	1B	~700 MB	Translation, simple Q&A, classification	2 GB
TinyLlama 1.1B	1.1B	~700 MB	Fast text generation, classification	2 GB
Qwen2.5 1.5B	1.5B	~1.1 GB	Multilingual tasks, strong for size	3 GB
Phi-3 Mini 3.8B	3.8B	~2.4 GB	Reasoning, knowledge Q&A	5 GB
Llama 3.2 3B	3B	~2.0 GB	General purpose, good balance	4 GB
Gemma 3 4B	4B	~2.8 GB	High quality multilingual	6 GB

For Pi 5 8GB: Gemma 3 1B or TinyLlama as the always-on model; Qwen2.5 1.5B for multilingual tasks.

For Pi 5 16GB: Phi-3 Mini or Llama 3.2 3B runs comfortably alongside the relay daemon.

For Jetson Orin Nano: Llama 3.2 3B or Gemma 3 4B at interactive speeds.

NLLB-200 (No Language Left Behind, Meta) is specifically for translation — 200 language pairs. Use via CTranslate2 for fast CPU inference, or via the Hugging Face transformers API. The 600M distilled model fits in 1.2 GB and achieves competitive translation quality.

Embedding models for semantic search: nomic-embed-text (137M, 274 MB) or all-MiniLM-L6-v2 (22M, 90 MB) are the right scale for a Pi-based semantic search index over local message archives.

Use Cases in Detail

Translation (NLLB-200 + CTranslate2)

┌─────────────┐    encrypted msg    ┌──────────────────┐
│  Phone A    │ ──────────────────► │  Relay Server    │
│ (Spanish)   │                     │                  │
└─────────────┘                     │  User B requests │
                                    │  translation of  │
                                    │  *their own* msg │
                                    │  history via API │
                                    │                  │
                                    │  NLLB-200 runs   │
                                    │  locally → EN    │
                                    └──────────────────┘

The relay server cannot translate end-to-end encrypted messages it cannot read. Translation applies to:

Public/unencrypted community posts explicitly shared with the server
User-submitted text sent directly to the /translate API endpoint from the phone app

# CTranslate2 NLLB-200 inference (Python)
from ctranslate2 import Translator
translator = Translator("nllb-200-distilled-600M-ct2", device="cpu")
result = translator.translate_batch([["spa_Latn", "Hola, ¿cómo estás?"]],
    target_prefix=[["eng_Latn"]])

At ~200–400 tokens/s on Pi 5 CPU, short messages translate in under a second.

Speech-to-Text (whisper.cpp)

Voice messages sent as audio blobs. The phone sends the audio to the relay's /transcribe endpoint (opt-in, user-initiated). The relay runs whisper.cpp, returns the transcript, discards the audio. The transcript never leaves the local WiFi network.

[Phone records voice message]
       │
       ▼
[Phone sends audio to relay /transcribe]
       │
       ▼
[whisper.cpp: audio → text]  (3–8 sec on Pi 5 for 30-sec audio)
       │
       ▼
[Relay returns transcript to phone]
       │
       ▼
[Phone app displays transcript; relay discards audio]

Semantic Search (Embedding Model)

Users can search their own message archive by meaning. The embedding model runs on the relay (or on-device). Messages are embedded at write time; queries are embedded at search time; cosine similarity finds relevant results.

User query: "what did we decide about the meeting place?"
     │
     ▼
[Embed query → vector]
     │
     ▼
[Search local SQLite vector index (sqlite-vec or hnswlib)]
     │
     ▼
[Return top-k semantically similar messages from user's archive]

The embedding index is user-specific and stays local to the relay or phone. The relay has no access to the index of other users' messages.

Metadata Spam Detection

The relay can observe structural message metadata even when content is encrypted: message frequency, burst patterns, recipient fan-out, message size distributions. A small classifier (scikit-learn, ONNX runtime) trained on these features can flag accounts exhibiting coordinated inauthentic patterns without reading any message content.

This is metadata-as-defense, not metadata-as-surveillance. The classifier output is boolean (flag/no-flag) and the relay acts only on the flag, not the metadata features. Transparency requires disclosing that this runs.

Community Knowledge Base

A community maintains a collection of documents (guides, maps, rules, resources) on the relay. A RAG (Retrieval-Augmented Generation) pipeline lets community members query this knowledge base via the phone app:

[Phone app]: "What are the community guidelines for the east wing?"
     │
     ▼
[Relay: embed query → retrieve relevant doc chunks from vector store]
     │
     ▼
[Relay: LLM generates answer grounded in retrieved chunks]
     │
     ▼
[Phone displays answer with source references]

No sensitive personal data needed. The knowledge base is community-curated public content.

Smart DTN Routing

In delay-tolerant network scenarios (Prototype 6: BLE relay devices), routing decisions benefit from history. A simple ML model trained on relay encounter history (which relays meet which relays, how often, at what times) can predict optimal next-hop routing for a message to reach its destination.

This is the same approach used in Epidemic/PROPHET routing in DTN research. The model is a small time-series predictor or a lookup table, not a large LLM.

Privacy Implications

What AI Can and Cannot Do with E2E Encrypted Messages

What the relay sees               What the relay CANNOT see
─────────────────────             ─────────────────────────
• Sender public key hash          • Message plaintext
• Recipient public key hash       • Sender's name or identity
• Message size (bytes)            • Message content or topic
• Timestamp                       • Who knows whom
• Delivery status                 • Social graph
• Message count / frequency       • Any metadata sender/receiver
                                    chose to hide

E2E encryption is the non-negotiable baseline. AI on the relay operates only on the left column. Any feature that requires reading message content must be processed on the user's own device.

Opt-in Content Processing

Translation, transcription, and knowledge base Q&A are all request-initiated by the user's phone app. The relay does not proactively process any messages. The phone sends a specific request with the data the user chooses to share. There is no ambient surveillance mode.

The phone app should display a clear indicator when a request is sent to the local AI server, analogous to how apps show a spinner when making a network request.

Metadata-as-Surveillance Risk

Even without reading content, metadata reveals a great deal. The spam detection use case walks this line carefully:

Acceptable: Flagging accounts sending 500 messages/minute (obvious spam pattern)
Unacceptable: Building social graphs from who-sends-to-whom patterns, tracking user movement via message timestamps, selling or sharing metadata with third parties

Community-run relays operated transparently with open-source software and local governance are meaningfully different from commercial surveillance systems. But the technical capability exists for misuse — community trust and governance matter as much as the technical design.

Federated Learning Risks

Federated learning (training a shared model on device data without centralizing data) is sometimes proposed as a privacy-preserving AI technique. It is not appropriate for this project:

Federated learning still leaks information through gradient updates (membership inference attacks)
It creates a coordination mechanism between users that doesn't exist by design in our system
Model updates are a covert channel for metadata exfiltration
The threat model for political/activist use cases requires stronger guarantees than federated learning provides

The correct approach is: no model training on user data at all. Models are pre-trained, downloaded as static artifacts, and run in inference-only mode.

On-Device Phone AI vs Server AI

Dimension	On-Device (Phone)	Server (Relay)
Privacy	Maximum — data never leaves device	High — data leaves device only for specific user-initiated requests
Availability	Works offline, always	Requires WiFi connection to relay
Performance	Fast on flagship phones (Snapdragon 8 Gen 3, A17 Pro)	Slow-medium on Pi 5; fast on Jetson
Model size	Limited to 1–4B (thermal/battery)	Up to 7B+ depending on hardware
Battery impact	High (NPU helps; still drains fast)	None on phone
Shared functionality	Per-user only	Shared across all users (translation, knowledge base)
Update model	App store update required	Admin deploys new GGUF file
Cost	Zero (uses user's phone)	Server hardware cost shared by community

Recommendation: Use on-device AI for personal features (private semantic search of user's own messages, private transcription). Use relay AI for community features (shared translation, community knowledge base). The two modes are complementary.

Architecture: Relay Server with AI Layer

                     Community WiFi Network
    ──────────────────────────────────────────────────────

    ┌──────────────────────────────────────────────────────┐
    │              Relay Server (Pi 5 / Jetson)            │
    │                                                      │
    │  ┌─────────────────┐   ┌───────────────────────────┐ │
    │  │  Message Relay  │   │      AI Service Layer     │ │
    │  │  (existing)     │   │                           │ │
    │  │                 │   │  POST /translate           │ │
    │  │  - Store/fwd    │   │  POST /transcribe          │ │
    │  │  - E2E blobs    │   │  POST /search             │ │
    │  │  - Key registry │   │  POST /ask                │ │
    │  │                 │   │                           │ │
    │  │  Port 8443      │   │  Port 8080 (local only)   │ │
    │  └─────────────────┘   └───────────────────────────┘ │
    │                                                      │
    │  ┌────────────────────────────────────────────────┐  │
    │  │              Model Runtime                     │  │
    │  │   llama.cpp / Ollama / whisper.cpp             │  │
    │  │   NLLB-200 (CTranslate2)                       │  │
    │  │   Embedding model (nomic-embed-text)           │  │
    │  └────────────────────────────────────────────────┘  │
    │                                                      │
    │  ┌───────────────────────────┐                       │
    │  │  Vector Store             │                       │
    │  │  sqlite-vec / hnswlib     │                       │
    │  │  (per-user, isolated)     │                       │
    │  └───────────────────────────┘                       │
    └──────────────────────────────────────────────────────┘
              ▲              ▲              ▲
              │              │              │
          Phone A         Phone B        Phone C

API Exposure

The AI service layer is exposed only on the local WiFi network — not on any internet-facing interface. Requests require the same authentication token as the message relay (derived from the user's key material, established during initial pairing). Rate limiting prevents a single user from monopolizing the compute.

# Example: Phone app sends translation request
curl -X POST http://192.168.4.1:8080/translate \
  -H "Authorization: Bearer <user-token>" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hola mundo", "source_lang": "spa_Latn", "target_lang": "eng_Latn"}'

The LLM inference process runs at low priority (Linux nice +15). The message relay process gets CPU priority. On Pi 5, idle LLM inference uses ~200–400 MB RAM with the model loaded; the relay daemon uses ~50–100 MB. Total overhead with Gemma 3 1B loaded: ~600–800 MB — well within the 8GB budget.

Model loading time is the main latency concern: cold-starting llama.cpp takes 3–8 seconds to load a 1B GGUF into memory. The solution is keep-alive: the model stays loaded in memory between requests. If the relay has been idle for 30 minutes, evict the model from memory to free RAM for message buffering.

Sneakernet Model Updates

In air-gapped or low-connectivity deployments, model updates arrive via USB drive or SD card — the same "sneakernet" data distribution mechanism described in the Willow protocol specification. A relay admin inserts a drive containing a new GGUF file; a systemd path unit triggers the model swap automatically.

# /etc/systemd/system/llm-model-watcher.path
[Path]
PathExistsGlob=/media/usb/*.gguf

[Install]
WantedBy=multi-user.target

This keeps model updates entirely offline, consistent with the project's offline-first design philosophy.

Research Prototypes — Prototype 5 (WiFi relay server) and Prototype 6 (BLE relay) are the hardware context for this page
Architecture — System design for the full Connect application
Location APIs & Security — Privacy threat models relevant to metadata handling

Updates (2025–2026)

New Hardware: Raspberry Pi AI HAT+ 2 (Hailo-10H, 40 TOPS)

⚠ Clarification on existing content: The page states the Hailo AI HAT+ (26 TOPS) enables "Vision + small LLM together" but "does not help much for pure text inference." This is accurate for the original Hailo-8-based HAT+ (2024). In January 2026, Raspberry Pi released the AI HAT+ 2 (raspberrypi.com), built on the Hailo-10H chip at 40 TOPS INT4 with 8 GB of dedicated on-board RAM. The HAT+ 2 adds genuine LLM capability — approximately 9.45 tok/s on Qwen2-1.5B in early benchmarks — and removes the memory-bandwidth bottleneck that limits pure-CPU inference on the Pi 5. Priced at $130, it is the first Pi accessory that materially accelerates text LLM inference. The hardware tier table should be updated to distinguish the original HAT+ (CV-only) from the HAT+ 2 (LLM-capable).

New Models: Gemma 3 Family (March 2025)

Google released the full Gemma 3 family in March 2025 with sizes 270M, 1B, 4B, 12B, and 27B (blog.google). The 4B model fits in under 3 GB at Q4_K_M and adds 128K context with improved multilingual coverage. For community relay servers, the Gemma 3 4B is the preferred single-model choice on Pi 5 16 GB, while the Gemma 3 1B remains the best option for 8 GB deployments. Benchmarks from a Pi 5 evaluation (stratosphereips.org) confirm Gemma 3 1B has the highest throughput among tested models on Pi 5 CPU.

New Models: Qwen3 Small Series Supersedes Qwen2.5 (April 2025)

Alibaba released Qwen3 under Apache 2.0 in April 2025 (qwenlm.github.io), including 0.6B, 1.7B, and 4B dense models. Qwen3-1.7B matches or exceeds the earlier Qwen2.5-3B on standard benchmarks. All three small variants run on Ollama, llama.cpp, and MLX without modification. Qwen3-1.7B supersedes Qwen2.5-1.5B as the recommended 1–2B class model for community relay servers — slightly larger but substantially more capable.

New Models: Phi-4-Mini Supersedes Phi-3 Mini (February 2025)

Microsoft released Phi-4-Mini (3.8B) in February 2025 (azure.microsoft.com), which matches 8B-class models on reasoning and math while fitting in 4 GB RAM. A subsequent Phi-4-Mini-Flash-Reasoning variant achieves 10× higher throughput and 2–3× lower latency compared to Phi-4-Mini at the same parameter count. Phi-4-Mini supersedes Phi-3 Mini 3.8B as the recommended 3–4B model: same RAM footprint, meaningfully better benchmark scores.

Llama 4 (April 2025) — Not Recommended for Relay Servers

Meta released Llama 4 Scout (17B active / 16 experts MoE) and Maverick in April 2025 (ai.meta.com). These are multimodal mixture-of-experts models designed for datacenter inference. The smallest Llama 4 model activates 17B parameters — too large for the hardware tiers covered in this page. Llama 3.2 1B/3B remains the recommended Llama-family option for community relay servers on current Tier 1–2 hardware.

Ollama: MLX Backend and Minions Protocol (2025–2026)

In March 2026, Ollama v0.19 switched its Apple Silicon backend to MLX (ollama.com/blog/mlx), with decode speeds jumping from 58 to 112 tok/s for M-series Mac mini relay deployments. In February 2025, the Minions framework enabled small on-device models to collaborate with larger cloud models by shifting sub-tasks to local hardware — applicable to hybrid relay architectures where internet connectivity is intermittent.

whisper.cpp: Integrated-GPU Acceleration (2025)

whisper.cpp release 1.8.3 added integrated-GPU (iGPU) support for AMD and Intel graphics, reporting up to a 12× performance boost over CPU-only mode (phoronix.com). For Tier 3 mini PCs (Ryzen 7840HS with Radeon 780M iGPU) this means transcription overhead is largely removed from the CPU budget, allowing simultaneous LLM inference and STT without contention.