High-Level Architecture
Bird's-eye view of how every component connects. Two physical machines connected via Tailscale VPN or LAN, with the gateway as the central orchestrator.
Key Details
- Two physical machines connected via Tailscale VPN or LAN (also supports single-machine deployment)
- Machine 1 runs the web app, gateway, SQLite database, SearXNG, and Kokoro TTS
- Machine 2 runs GPU inference servers (supports MLX, Ollama, vLLM, llama.cpp, or OpenAI-compatible APIs)
- No Supabase dependency — all data stored locally in SQLite with sqlite-vec for vector search and FTS5 for full-text search
- All inter-service communication is HTTP; no message queues or gRPC
- Gateway is the central orchestrator — routes to inference, RAG, web search, TTS, memory, summarization, and database
Request Flow — Chat Completion
The main user-facing flow when sending a chat message, including web search, RAG, memory, and summarization with SSE streaming.
Key Details
- Parallel context enrichment: web search, RAG retrieval, and memory recall happen simultaneously
- SSE streaming delivers tokens incrementally for real-time rendering
- Background tasks (memory extraction, summarization) run after each response
- JWT authentication on every request with user-scoped data isolation
- Evidence from web, RAG, and memory is injected into the system prompt
Authentication Flow
Gateway-managed authentication with JWT tokens. Supports single-user (password-only) and multi-user (email + password) modes.
Key Details
- Gateway-managed auth — no external auth provider
- JWTs signed with HS256 using a configurable secret
- Single-user mode: password-only login, default user ID
- Multi-user mode: email + password registration and login
- Token expiry: 7 days, 401 triggers automatic logout + redirect
- All database queries filter by user_id for data isolation
Inference Mode Routing
Three thinking modes with automatic fallback. Supports MLX, Ollama, vLLM, llama.cpp, and OpenAI-compatible backends.
Key Details
- THINKING_DAILY_REQUEST_LIMIT caps heavy inference per user per day (default: 100)
- THINKING_MAX_CONCURRENT limits parallel thinking requests (default: 2)
- Fallback to instant tier is configurable
- Cold-start timeout: 60s (configurable)
- Supported backends: MLX (Apple Silicon), Ollama, vLLM, llama.cpp, OpenAI API
Web Search Pipeline
LLM-powered intent classification triggers self-hosted SearXNG search with full content extraction, deduplication, and trust scoring.
Key Details
- Powered by SearXNG — self-hosted, privacy-respecting, no API keys needed
- LLM intent classification decides whether a query needs web results
- Fetches full page content for top results via trafilatura (up to 2000 chars each)
- Source trust classification: official, reference, forum, news, web
- Duplicate detection via Jaccard similarity on word sets
- Smart search triggers: temporal keywords, news, weather, commerce, explicit intent
RAG Pipeline
Hybrid vector + BM25 retrieval with cross-encoder reranking. Documents are chunked, embedded, and indexed for accurate retrieval.
Key Details
- Hybrid search: vector similarity (sqlite-vec) + BM25 full-text (FTS5) fused via RRF
- Cross-encoder reranking for final relevance scoring (ms-marco-MiniLM-L6-v2)
- Contextual retrieval: LLM generates context prefixes for each chunk (+35-49% retrieval quality)
- LLM query rewriting to improve retrieval
- Embedding model: configurable (default: all-MiniLM-L6-v2, 384-dim)
- Match count: 20 candidates, threshold: 0.3, max context: 12000 chars
Voice Pipeline
Full voice conversation with wake-word detection, speech-to-text via Whisper, speaker verification, and streaming text-to-speech via Kokoro.
Key Details
- STT: faster-whisper (local Whisper inference, 'tiny' model ~75MB, CTranslate2 backend)
- TTS: Kokoro FastAPI (Docker, CPU) with 12 voice options (British/American, male/female)
- Speaker verification: MFCC-based voice embeddings, 3-sample enrollment, similarity threshold 0.82
- Wake-word detection: browser-side voice activity detection ('Hey Chat', 'Hello Chat', 'Hi Chat')
- Streaming TTS: sentences play as the LLM generates them (continuous audio)
- Voice settings: configurable voice, speed (0.5x-2.0x), auto-read toggle
Database Schema
SQLite database with WAL mode, sqlite-vec for vector search, and FTS5 for full-text search. All data stored locally with user isolation.
Key Details
- SQLite with WAL mode enabled for concurrent reads
- Foreign key constraints enforced for referential integrity
- sqlite-vec virtual tables for vector similarity search
- FTS5 virtual table for BM25 full-text search
- mode_used supports: instant, thinking, thinking_harder
- user_memories stores cross-session facts with categories: preference, background, project, instruction
- rag_query_metrics tracks RAG search performance for analytics
- Schema auto-initialized on first gateway start
Gateway API Routes
Complete REST API surface of the FastAPI gateway. All routes require JWT authentication except /auth endpoints.
Key Details
- /auth — Authentication endpoints (login, register, mode detection)
- /health — Health checks for gateway, auth validation, and inference status
- /chat — Core chat operations: messaging, streaming, sessions, voice, TTS
- /documents — Document upload, listing, deletion, and session attachment for RAG
- All /chat and /documents routes require JWT Bearer token
- POST /chat/stream is the primary endpoint — returns SSE events
Deployment Topology
Physical deployment across one or two machines, connected via Tailscale VPN mesh or LAN. All IPs are configurable via the CLI wizard.
Key Details
- Two-machine: Gateway/webapp on server, inference on GPU node (Tailscale/LAN)
- Single-machine: Everything on localhost (configured via molebie-ai install)
- Auto-pull daemon: macOS LaunchAgent polls git and auto-updates on new commits
- IPs are configurable via CLI wizard (no hardcoded Tailscale IPs)
Frontend Page Structure
Next.js 16 App Router with React 19. Dark glass UI theme with responsive mobile design, voice mode, document panel, and rich markdown rendering.
Key Details
- Frontend stack: Next.js 16 (App Router), React 19, TypeScript, Tailwind CSS v4
- Voice conversation mode with wake-word detection and streaming TTS
- Document upload/attachment for RAG and per-session context
- Image upload via paste, drag-and-drop, or file picker (stored locally)
- KaTeX math rendering and syntax-highlighted code blocks
- Session pinning, search, export, and responsive mobile drawer sidebar
CLI Tool — molebie-ai
Python CLI built with Typer + Rich. Handles installation, configuration, service management, model downloads, and diagnostics.
Key Details
- Framework: Python + Typer + Rich
- Config storage: .molebie/config.json (version 2)
- Auto-generates .env.local from CLI config (including random JWT secret)
- Prerequisite checker: detects and offers to install missing dependencies
- Service manager: starts/stops all services via subprocess
- Model management: download, remove, start, and stop LLM models per backend
- Doctor: diagnose and optionally fix setup issues
Memory & Summarization
Cross-session memory extracts and stores user facts/preferences. Rolling conversation summarization manages the context window.
Key Details
- Cross-session memory: extracts and stores user facts/preferences across conversations
- Categories: preference, background, project, instruction
- Deduplication via cosine similarity (threshold: 0.9)
- Retrieval: top 5 memories by vector similarity (threshold: 0.5)
- Auto-extraction every 6 messages (configurable)
- Rolling conversation summaries triggered at 16+ unsummarized messages
- Keeps last 10 messages raw (not summarized)
- Max 200 memories per user with access tracking for relevance decay
Service Summary
| Service | Port | Purpose |
|---|---|---|
| Web App | 3000 | Chat UI, auth, voice, documents, images |
| Gateway | 8000 | Auth, routing, DB, inference proxy, RAG, web search, TTS, memory |
| SQLite DB | — | Local database with vector + full-text search |
| Thinking LLM | 8080 | Deep reasoning with chain-of-thought |
| Instant LLM | 8081 | Fast responses, no CoT |
| SearXNG | 8888 | Self-hosted web search (no API keys) |
| Kokoro TTS | 8880 | Text-to-speech (12 voices, CPU) |
| Tailscale | — | Connects server + GPU node (optional) |
| CLI | — | Setup wizard, service management, diagnostics |
The gateway is the central orchestrator: it authenticates every request, manages sessions in SQLite, routes to inference tiers, enriches context with web search and RAG results, retrieves cross-session memories, handles voice transcription and synthesis, manages image attachments, and applies cost controls.