Iter Iter

Voice Chat

Full bidirectional voice — streaming speech-to-text with faster-whisper, text-to-speech with Piper, Kokoro, and Qwen3-TTS, and a dedicated voice chat page with waveform visualization.

Voice chat with waveform visualization

Speech-to-text

1

Voice activity detection

Silero VAD detects when you start and stop speaking. No manual start/stop button needed - just talk naturally.

2

Audio streaming

Raw 16-bit PCM audio streams over WebSocket from the dashboard or mobile app to the voice server in real-time.

3

Whisper transcription

faster-whisper transcribes speech segments with language detection and confidence scores. Interim results appear as you speak.

4

NLP analysis

Transcriptions are analyzed for intent, entities, sentiment, and suggested actions — displayed alongside the waveform in the dashboard voice chat.

Text-to-speech

Three TTS engines with per-user voice settings

Piper TTS

Fast, lightweight, CPU-friendly neural TTS. Multiple voice models with configurable speed and pitch. Great for low-latency responses.

Kokoro TTS

High-quality neural text-to-speech with natural prosody and expressive voices.

Qwen3-TTS

GPU-accelerated TTS from Qwen3. Highest quality output, opt-in for GPU-heavy workloads.

Technical details

faster-whisper

CTranslate2-optimized Whisper inference. Configurable model size (medium.en default), int8 quantization, CPU or CUDA.

Silero VAD

Stateful voice activity detection with speech start/end events. Accurate speech boundary detection for clean segments.

WebSocket streaming

Binary PCM audio in, JSON transcriptions out. Control messages for config, stop, and reset. Per-session processors.

100% local

All models run on your hardware. Audio never leaves your network. STT and TTS models are cached on first run.

Talk to your AI — and hear it talk back