High-Level Architecture

Bird's-eye view of how every component connects. Two physical machines connected via Tailscale VPN or LAN, with the gateway as the central orchestrator.

Loading diagram...

Key Details

Two physical machines connected via Tailscale VPN or LAN (also supports single-machine deployment)
Machine 1 runs the web app, gateway, SQLite database, SearXNG, and Kokoro TTS
Machine 2 runs GPU inference servers (supports MLX, Ollama, vLLM, llama.cpp, or OpenAI-compatible APIs)
No Supabase dependency — all data stored locally in SQLite with sqlite-vec for vector search and FTS5 for full-text search
All inter-service communication is HTTP; no message queues or gRPC
Gateway is the central orchestrator — routes to inference, RAG, web search, TTS, memory, summarization, and database

Request Flow — Chat Completion

The main user-facing flow when sending a chat message, including web search, RAG, memory, and summarization with SSE streaming.

Loading diagram...

Key Details

Parallel context enrichment: web search, RAG retrieval, and memory recall happen simultaneously
SSE streaming delivers tokens incrementally for real-time rendering
Background tasks (memory extraction, summarization) run after each response
JWT authentication on every request with user-scoped data isolation
Evidence from web, RAG, and memory is injected into the system prompt

Authentication Flow

Gateway-managed authentication with JWT tokens. Supports single-user (password-only) and multi-user (email + password) modes.

Loading diagram...

Key Details

Gateway-managed auth — no external auth provider
JWTs signed with HS256 using a configurable secret
Single-user mode: password-only login, default user ID
Multi-user mode: email + password registration and login
Token expiry: 7 days, 401 triggers automatic logout + redirect
All database queries filter by user_id for data isolation

Inference Mode Routing

Three thinking modes with automatic fallback. Supports MLX, Ollama, vLLM, llama.cpp, and OpenAI-compatible backends.

Loading diagram...

Key Details

THINKING_DAILY_REQUEST_LIMIT caps heavy inference per user per day (default: 100)
THINKING_MAX_CONCURRENT limits parallel thinking requests (default: 2)
Fallback to instant tier is configurable
Cold-start timeout: 60s (configurable)
Supported backends: MLX (Apple Silicon), Ollama, vLLM, llama.cpp, OpenAI API

Web Search Pipeline

LLM-powered intent classification triggers self-hosted SearXNG search with full content extraction, deduplication, and trust scoring.

Loading diagram...

Key Details

Powered by SearXNG — self-hosted, privacy-respecting, no API keys needed
LLM intent classification decides whether a query needs web results
Fetches full page content for top results via trafilatura (up to 2000 chars each)
Source trust classification: official, reference, forum, news, web
Duplicate detection via Jaccard similarity on word sets
Smart search triggers: temporal keywords, news, weather, commerce, explicit intent

RAG Pipeline

Hybrid vector + BM25 retrieval with cross-encoder reranking. Documents are chunked, embedded, and indexed for accurate retrieval.

Loading diagram...

Key Details

Hybrid search: vector similarity (sqlite-vec) + BM25 full-text (FTS5) fused via RRF
Cross-encoder reranking for final relevance scoring (ms-marco-MiniLM-L6-v2)
Contextual retrieval: LLM generates context prefixes for each chunk (+35-49% retrieval quality)
LLM query rewriting to improve retrieval
Embedding model: configurable (default: all-MiniLM-L6-v2, 384-dim)
Match count: 20 candidates, threshold: 0.3, max context: 12000 chars

Voice Pipeline

Full voice conversation with wake-word detection, speech-to-text via Whisper, speaker verification, and streaming text-to-speech via Kokoro.

Loading diagram...

Key Details

STT: faster-whisper (local Whisper inference, 'tiny' model ~75MB, CTranslate2 backend)
TTS: Kokoro FastAPI (Docker, CPU) with 12 voice options (British/American, male/female)
Speaker verification: MFCC-based voice embeddings, 3-sample enrollment, similarity threshold 0.82
Wake-word detection: browser-side voice activity detection ('Hey Chat', 'Hello Chat', 'Hi Chat')
Streaming TTS: sentences play as the LLM generates them (continuous audio)
Voice settings: configurable voice, speed (0.5x-2.0x), auto-read toggle

Database Schema

SQLite database with WAL mode, sqlite-vec for vector search, and FTS5 for full-text search. All data stored locally with user isolation.

Loading diagram...

Key Details

SQLite with WAL mode enabled for concurrent reads
Foreign key constraints enforced for referential integrity
sqlite-vec virtual tables for vector similarity search
FTS5 virtual table for BM25 full-text search
mode_used supports: instant, thinking, thinking_harder
user_memories stores cross-session facts with categories: preference, background, project, instruction
rag_query_metrics tracks RAG search performance for analytics
Schema auto-initialized on first gateway start

Gateway API Routes

Complete REST API surface of the FastAPI gateway. All routes require JWT authentication except /auth endpoints.

Loading diagram...

Key Details

/auth — Authentication endpoints (login, register, mode detection)
/health — Health checks for gateway, auth validation, and inference status
/chat — Core chat operations: messaging, streaming, sessions, voice, TTS
/documents — Document upload, listing, deletion, and session attachment for RAG
All /chat and /documents routes require JWT Bearer token
POST /chat/stream is the primary endpoint — returns SSE events

Deployment Topology

Physical deployment across one or two machines, connected via Tailscale VPN mesh or LAN. All IPs are configurable via the CLI wizard.

Loading diagram...

Key Details

Two-machine: Gateway/webapp on server, inference on GPU node (Tailscale/LAN)
Single-machine: Everything on localhost (configured via molebie-ai install)
Auto-pull daemon: macOS LaunchAgent polls git and auto-updates on new commits
IPs are configurable via CLI wizard (no hardcoded Tailscale IPs)

Frontend Page Structure

Next.js 16 App Router with React 19. Dark glass UI theme with responsive mobile design, voice mode, document panel, and rich markdown rendering.

Loading diagram...

Key Details

Frontend stack: Next.js 16 (App Router), React 19, TypeScript, Tailwind CSS v4
Voice conversation mode with wake-word detection and streaming TTS
Document upload/attachment for RAG and per-session context
Image upload via paste, drag-and-drop, or file picker (stored locally)
KaTeX math rendering and syntax-highlighted code blocks
Session pinning, search, export, and responsive mobile drawer sidebar

CLI Tool — molebie-ai

Python CLI built with Typer + Rich. Handles installation, configuration, service management, model downloads, and diagnostics.

Loading diagram...

Key Details

Framework: Python + Typer + Rich
Config storage: .molebie/config.json (version 2)
Auto-generates .env.local from CLI config (including random JWT secret)
Prerequisite checker: detects and offers to install missing dependencies
Service manager: starts/stops all services via subprocess
Model management: download, remove, start, and stop LLM models per backend
Doctor: diagnose and optionally fix setup issues

Memory & Summarization

Cross-session memory extracts and stores user facts/preferences. Rolling conversation summarization manages the context window.

Loading diagram...

Key Details

Cross-session memory: extracts and stores user facts/preferences across conversations
Categories: preference, background, project, instruction
Deduplication via cosine similarity (threshold: 0.9)
Retrieval: top 5 memories by vector similarity (threshold: 0.5)
Auto-extraction every 6 messages (configurable)
Rolling conversation summaries triggered at 16+ unsummarized messages
Keeps last 10 messages raw (not summarized)
Max 200 memories per user with access tracking for relevance decay

Service Summary

Service	Port	Framework	Purpose
Web App	3000	Next.js 16	Chat UI, auth, voice, documents, images
Gateway	8000	FastAPI	Auth, routing, DB, inference proxy, RAG, web search, TTS, memory
SQLite DB	—	sqlite-vec + FTS5	Local database with vector + full-text search
Thinking LLM	8080	MLX / Ollama / vLLM	Deep reasoning with chain-of-thought
Instant LLM	8081	MLX / Ollama / vLLM	Fast responses, no CoT
SearXNG	8888	Docker	Self-hosted web search (no API keys)
Kokoro TTS	8880	Docker (FastAPI)	Text-to-speech (12 voices, CPU)
Tailscale	—	VPN mesh	Connects server + GPU node (optional)
CLI	—	Python (Typer)	Setup wizard, service management, diagnostics

The gateway is the central orchestrator: it authenticates every request, manages sessions in SQLite, routes to inference tiers, enriches context with web search and RAG results, retrieves cross-session memories, handles voice transcription and synthesis, manages image attachments, and applies cost controls.