# Voice Agent A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions. Uses Workers AI for all models — zero external API keys required: - **STT**: Deepgram Nova 3 (`@cf/deepgram/nova-3`) - **TTS**: Deepgram Aura (`@cf/deepgram/aura-1`) - **VAD**: Pipecat Smart Turn v2 (`@cf/pipecat-ai/smart-turn-v2`) - **LLM**: Kimi K2.5 (`@cf/moonshotai/kimi-k2.5`) ## Run it ```bash npm install npm run dev ``` No API keys needed — all AI models run via the Workers AI binding. ## How it works ``` Browser Durable Object (VoiceAgent) ┌──────────┐ binary WS frames ┌──────────────────────────┐ │ Mic PCM │ ────────────────────► │ Audio Buffer │ │ (16kHz) │ │ ↓ │ │ │ JSON: end_of_speech │ VAD (smart-turn-v2) │ │ │ ────────────────────► │ ↓ │ │ │ │ STT (nova-3) │ │ │ JSON: transcript │ ↓ │ │ │ ◄──────────────────── │ LLM (kimi-k2.5) │ │ │ binary: MP3 audio │ ↓ (sentence chunking) │ │ Speaker │ ◄──────────────────── │ TTS (aura-1, streaming) │ └──────────┘ └──────────────────────────┘ single WebSocket connection ``` 1. Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM 2. PCM streams to the Agent over the existing WebSocket connection (binary frames) 3. Client-side silence detection (500ms) triggers end-of-speech 4. Server-side VAD (smart-turn-v2) confirms the user finished speaking 5. Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS 6. TTS audio streams back per-sentence as MP3 while the LLM is still generating 7. Browser decodes and plays audio; user can interrupt at any time ## Features - **Streaming TTS** — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated. - **Interruption handling** — speak over the agent to cut it off mid-sentence. The client detects sustained speech during playback and aborts the server pipeline. - **Server-side VAD** — `smart-turn-v2` validates end-of-speech after client silence detection, reducing false triggers on mid-sentence pauses. - **Conversation persistence** — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations. - **Agent tools** — the LLM can call `get_current_time`, `set_reminder`, and `get_weather` during conversation. - **Proactive scheduling** — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected). - **`useVoiceAgent` hook** — the client uses the `agents/voice-react` hook, which encapsulates all audio infrastructure in ~10 lines of setup.