Iter Iter

Voice & Language Intelligence

A complete voice pipeline — speech-to-text, text-to-speech, and natural language understanding. Talk to your AI from the dashboard or mobile app. Hear responses spoken back.

Voice chat with waveform visualization and NLP analysis

Speech-to-Text Pipeline

Real-time transcription with voice activity detection

1

Voice activity detection

Silero VAD detects when you start and stop speaking. Speech boundaries are tracked with configurable silence thresholds and pre-speech padding.

2

Audio streaming

Raw 16-bit PCM audio streams over WebSocket. An AudioWorklet captures microphone input, converts samples, and computes RMS audio levels for waveform visualization.

3

Whisper transcription

faster-whisper transcribes speech segments with language detection and confidence scores. Runs in a thread pool so audio reception is never blocked.

4

NLP enrichment

Final transcriptions pass through spaCy for intent classification, sentiment analysis, named entity recognition, and keyword extraction. Commands are parsed into structured actions.

Text-to-Speech Engines

Multiple synthesis backends with a unified API

Piper

Fast ONNX inference for offline TTS. Low-latency, lightweight models with espeak-ng phonemization. Great for quick responses.

Coqui XTTS-v2

High-quality neural TTS with voice cloning. Clone any voice from a 6+ second audio sample. Multi-speaker, multi-language.

Kokoro

82M parameter neural TTS with 28 built-in voices (American + British English). Compact model with natural-sounding output.

Qwen3 TTS (GPU)

GPU-accelerated multilingual synthesis via proxy. 9 voices, voice design mode, and voice cloning. Runs on DGX Spark or similar.

Engines are loaded lazily — unavailable engines are silently skipped. OpenAI-compatible /v1/audio/speech endpoint included.

Natural Language Processing

Understand what the user means, not just what they said

?

Intent Classification

Classifies utterances as questions, commands, statements, greetings, or acknowledgments using dependency parsing and heuristics.

+

Sentiment Analysis

TextBlob-powered polarity scoring from -1 to +1. Conversation-level sentiment trend tracking across turns.

#

Entity Recognition

spaCy NER extracts people, organizations, locations, products, and more. Keywords extracted via POS tagging.

Dashboard Voice UI

Voice-first and voice-assisted interfaces

Chat Page — Voice Input

  • Mic button next to send — tap to record, tap to stop
  • Interim transcription displayed as you speak
  • Speaker button on assistant messages for TTS playback

Voice Chat Page — Hands-Free

  • Large mic button with waveform visualization
  • NLP badges: intent, sentiment, entities, keywords
  • Auto-sends transcriptions to LLM, auto-speaks responses
  • Voice settings: engine, voice selection, auto-TTS toggle

Technical details

faster-whisper

CTranslate2-optimized Whisper inference. Configurable model size (medium.en default), int8 quantization, CPU or CUDA.

Async WebSocket Architecture

Concurrent receiver + processor tasks. Audio is queued and batched so Whisper inference never blocks frame reception.

Pluggable Engine System

TTS engines register lazily with graceful fallback. Add new engines by implementing the base class. OpenAI API compatible.

100% Local

STT and NLP run entirely on your hardware. Audio never leaves your network. TTS can run locally or proxy to a GPU server.