Voice & Language Intelligence
A complete voice pipeline — speech-to-text, text-to-speech, and natural language understanding. Talk to your AI from the dashboard or mobile app. Hear responses spoken back.
Speech-to-Text Pipeline
Real-time transcription with voice activity detection
Voice activity detection
Silero VAD detects when you start and stop speaking. Speech boundaries are tracked with configurable silence thresholds and pre-speech padding.
Audio streaming
Raw 16-bit PCM audio streams over WebSocket. An AudioWorklet captures microphone input, converts samples, and computes RMS audio levels for waveform visualization.
Whisper transcription
faster-whisper transcribes speech segments with language detection and confidence scores. Runs in a thread pool so audio reception is never blocked.
NLP enrichment
Final transcriptions pass through spaCy for intent classification, sentiment analysis, named entity recognition, and keyword extraction. Commands are parsed into structured actions.
Text-to-Speech Engines
Multiple synthesis backends with a unified API
Piper
Fast ONNX inference for offline TTS. Low-latency, lightweight models with espeak-ng phonemization. Great for quick responses.
Coqui XTTS-v2
High-quality neural TTS with voice cloning. Clone any voice from a 6+ second audio sample. Multi-speaker, multi-language.
Kokoro
82M parameter neural TTS with 28 built-in voices (American + British English). Compact model with natural-sounding output.
Qwen3 TTS (GPU)
GPU-accelerated multilingual synthesis via proxy. 9 voices, voice design mode, and voice cloning. Runs on DGX Spark or similar.
Engines are loaded lazily — unavailable engines are silently skipped. OpenAI-compatible /v1/audio/speech endpoint included.
Natural Language Processing
Understand what the user means, not just what they said
Intent Classification
Classifies utterances as questions, commands, statements, greetings, or acknowledgments using dependency parsing and heuristics.
Sentiment Analysis
TextBlob-powered polarity scoring from -1 to +1. Conversation-level sentiment trend tracking across turns.
Entity Recognition
spaCy NER extracts people, organizations, locations, products, and more. Keywords extracted via POS tagging.
Dashboard Voice UI
Voice-first and voice-assisted interfaces
Chat Page — Voice Input
- ● Mic button next to send — tap to record, tap to stop
- ● Interim transcription displayed as you speak
- ● Speaker button on assistant messages for TTS playback
Voice Chat Page — Hands-Free
- ● Large mic button with waveform visualization
- ● NLP badges: intent, sentiment, entities, keywords
- ● Auto-sends transcriptions to LLM, auto-speaks responses
- ● Voice settings: engine, voice selection, auto-TTS toggle
Technical details
faster-whisper
CTranslate2-optimized Whisper inference. Configurable model size (medium.en default), int8 quantization, CPU or CUDA.
Async WebSocket Architecture
Concurrent receiver + processor tasks. Audio is queued and batched so Whisper inference never blocks frame reception.
Pluggable Engine System
TTS engines register lazily with graceful fallback. Add new engines by implementing the base class. OpenAI API compatible.
100% Local
STT and NLP run entirely on your hardware. Audio never leaves your network. TTS can run locally or proxy to a GPU server.