System
Architecture

Complete infrastructure and architecture of Molebie AI — covering all services, data flows, authentication, database schema, and deployment topology across 13 detailed diagrams.

01

High-Level Architecture

Bird's-eye view of how every component connects. Two physical machines connected via Tailscale VPN or LAN, with the gateway as the central orchestrator.

Loading diagram...

Key Details

  • Two physical machines connected via Tailscale VPN or LAN (also supports single-machine deployment)
  • Machine 1 runs the web app, gateway, SQLite database, SearXNG, and Kokoro TTS
  • Machine 2 runs GPU inference servers (supports MLX, Ollama, vLLM, llama.cpp, or OpenAI-compatible APIs)
  • No Supabase dependency — all data stored locally in SQLite with sqlite-vec for vector search and FTS5 for full-text search
  • All inter-service communication is HTTP; no message queues or gRPC
  • Gateway is the central orchestrator — routes to inference, RAG, web search, TTS, memory, summarization, and database
02

Request Flow — Chat Completion

The main user-facing flow when sending a chat message, including web search, RAG, memory, and summarization with SSE streaming.

Loading diagram...

Key Details

  • Parallel context enrichment: web search, RAG retrieval, and memory recall happen simultaneously
  • SSE streaming delivers tokens incrementally for real-time rendering
  • Background tasks (memory extraction, summarization) run after each response
  • JWT authentication on every request with user-scoped data isolation
  • Evidence from web, RAG, and memory is injected into the system prompt
03

Authentication Flow

Gateway-managed authentication with JWT tokens. Supports single-user (password-only) and multi-user (email + password) modes.

Loading diagram...

Key Details

  • Gateway-managed auth — no external auth provider
  • JWTs signed with HS256 using a configurable secret
  • Single-user mode: password-only login, default user ID
  • Multi-user mode: email + password registration and login
  • Token expiry: 7 days, 401 triggers automatic logout + redirect
  • All database queries filter by user_id for data isolation
04

Inference Mode Routing

Three thinking modes with automatic fallback. Supports MLX, Ollama, vLLM, llama.cpp, and OpenAI-compatible backends.

Loading diagram...

Key Details

  • THINKING_DAILY_REQUEST_LIMIT caps heavy inference per user per day (default: 100)
  • THINKING_MAX_CONCURRENT limits parallel thinking requests (default: 2)
  • Fallback to instant tier is configurable
  • Cold-start timeout: 60s (configurable)
  • Supported backends: MLX (Apple Silicon), Ollama, vLLM, llama.cpp, OpenAI API
06

RAG Pipeline

Hybrid vector + BM25 retrieval with cross-encoder reranking. Documents are chunked, embedded, and indexed for accurate retrieval.

Loading diagram...

Key Details

  • Hybrid search: vector similarity (sqlite-vec) + BM25 full-text (FTS5) fused via RRF
  • Cross-encoder reranking for final relevance scoring (ms-marco-MiniLM-L6-v2)
  • Contextual retrieval: LLM generates context prefixes for each chunk (+35-49% retrieval quality)
  • LLM query rewriting to improve retrieval
  • Embedding model: configurable (default: all-MiniLM-L6-v2, 384-dim)
  • Match count: 20 candidates, threshold: 0.3, max context: 12000 chars
07

Voice Pipeline

Full voice conversation with wake-word detection, speech-to-text via Whisper, speaker verification, and streaming text-to-speech via Kokoro.

Loading diagram...

Key Details

  • STT: faster-whisper (local Whisper inference, 'tiny' model ~75MB, CTranslate2 backend)
  • TTS: Kokoro FastAPI (Docker, CPU) with 12 voice options (British/American, male/female)
  • Speaker verification: MFCC-based voice embeddings, 3-sample enrollment, similarity threshold 0.82
  • Wake-word detection: browser-side voice activity detection ('Hey Chat', 'Hello Chat', 'Hi Chat')
  • Streaming TTS: sentences play as the LLM generates them (continuous audio)
  • Voice settings: configurable voice, speed (0.5x-2.0x), auto-read toggle
08

Database Schema

SQLite database with WAL mode, sqlite-vec for vector search, and FTS5 for full-text search. All data stored locally with user isolation.

Loading diagram...

Key Details

  • SQLite with WAL mode enabled for concurrent reads
  • Foreign key constraints enforced for referential integrity
  • sqlite-vec virtual tables for vector similarity search
  • FTS5 virtual table for BM25 full-text search
  • mode_used supports: instant, thinking, thinking_harder
  • user_memories stores cross-session facts with categories: preference, background, project, instruction
  • rag_query_metrics tracks RAG search performance for analytics
  • Schema auto-initialized on first gateway start
09

Gateway API Routes

Complete REST API surface of the FastAPI gateway. All routes require JWT authentication except /auth endpoints.

Loading diagram...

Key Details

  • /auth — Authentication endpoints (login, register, mode detection)
  • /health — Health checks for gateway, auth validation, and inference status
  • /chat — Core chat operations: messaging, streaming, sessions, voice, TTS
  • /documents — Document upload, listing, deletion, and session attachment for RAG
  • All /chat and /documents routes require JWT Bearer token
  • POST /chat/stream is the primary endpoint — returns SSE events
10

Deployment Topology

Physical deployment across one or two machines, connected via Tailscale VPN mesh or LAN. All IPs are configurable via the CLI wizard.

Loading diagram...

Key Details

  • Two-machine: Gateway/webapp on server, inference on GPU node (Tailscale/LAN)
  • Single-machine: Everything on localhost (configured via molebie-ai install)
  • Auto-pull daemon: macOS LaunchAgent polls git and auto-updates on new commits
  • IPs are configurable via CLI wizard (no hardcoded Tailscale IPs)
11

Frontend Page Structure

Next.js 16 App Router with React 19. Dark glass UI theme with responsive mobile design, voice mode, document panel, and rich markdown rendering.

Loading diagram...

Key Details

  • Frontend stack: Next.js 16 (App Router), React 19, TypeScript, Tailwind CSS v4
  • Voice conversation mode with wake-word detection and streaming TTS
  • Document upload/attachment for RAG and per-session context
  • Image upload via paste, drag-and-drop, or file picker (stored locally)
  • KaTeX math rendering and syntax-highlighted code blocks
  • Session pinning, search, export, and responsive mobile drawer sidebar
12

CLI Tool — molebie-ai

Python CLI built with Typer + Rich. Handles installation, configuration, service management, model downloads, and diagnostics.

Loading diagram...

Key Details

  • Framework: Python + Typer + Rich
  • Config storage: .molebie/config.json (version 2)
  • Auto-generates .env.local from CLI config (including random JWT secret)
  • Prerequisite checker: detects and offers to install missing dependencies
  • Service manager: starts/stops all services via subprocess
  • Model management: download, remove, start, and stop LLM models per backend
  • Doctor: diagnose and optionally fix setup issues
13

Memory & Summarization

Cross-session memory extracts and stores user facts/preferences. Rolling conversation summarization manages the context window.

Loading diagram...

Key Details

  • Cross-session memory: extracts and stores user facts/preferences across conversations
  • Categories: preference, background, project, instruction
  • Deduplication via cosine similarity (threshold: 0.9)
  • Retrieval: top 5 memories by vector similarity (threshold: 0.5)
  • Auto-extraction every 6 messages (configurable)
  • Rolling conversation summaries triggered at 16+ unsummarized messages
  • Keeps last 10 messages raw (not summarized)
  • Max 200 memories per user with access tracking for relevance decay

Service Summary

ServicePortPurpose
Web App3000Chat UI, auth, voice, documents, images
Gateway8000Auth, routing, DB, inference proxy, RAG, web search, TTS, memory
SQLite DBLocal database with vector + full-text search
Thinking LLM8080Deep reasoning with chain-of-thought
Instant LLM8081Fast responses, no CoT
SearXNG8888Self-hosted web search (no API keys)
Kokoro TTS8880Text-to-speech (12 voices, CPU)
TailscaleConnects server + GPU node (optional)
CLISetup wizard, service management, diagnostics

The gateway is the central orchestrator: it authenticates every request, manages sessions in SQLite, routes to inference tiers, enriches context with web search and RAG results, retrieves cross-session memories, handles voice transcription and synthesis, manages image attachments, and applies cost controls.