Voice Chat
Full bidirectional voice — streaming speech-to-text with faster-whisper, text-to-speech with Piper, Kokoro, and Qwen3-TTS, and a dedicated voice chat page with waveform visualization.
Speech-to-text
Voice activity detection
Silero VAD detects when you start and stop speaking. No manual start/stop button needed - just talk naturally.
Audio streaming
Raw 16-bit PCM audio streams over WebSocket from the dashboard or mobile app to the voice server in real-time.
Whisper transcription
faster-whisper transcribes speech segments with language detection and confidence scores. Interim results appear as you speak.
NLP analysis
Transcriptions are analyzed for intent, entities, sentiment, and suggested actions — displayed alongside the waveform in the dashboard voice chat.
Text-to-speech
Three TTS engines with per-user voice settings
Piper TTS
Fast, lightweight, CPU-friendly neural TTS. Multiple voice models with configurable speed and pitch. Great for low-latency responses.
Kokoro TTS
High-quality neural text-to-speech with natural prosody and expressive voices.
Qwen3-TTS
GPU-accelerated TTS from Qwen3. Highest quality output, opt-in for GPU-heavy workloads.
Technical details
faster-whisper
CTranslate2-optimized Whisper inference. Configurable model size (medium.en default), int8 quantization, CPU or CUDA.
Silero VAD
Stateful voice activity detection with speech start/end events. Accurate speech boundary detection for clean segments.
WebSocket streaming
Binary PCM audio in, JSON transcriptions out. Control messages for config, stop, and reset. Per-session processors.
100% local
All models run on your hardware. Audio never leaves your network. STT and TTS models are cached on first run.