branch:
README.md
3463 bytesRaw
# Voice Agent

A real-time voice agent running entirely inside a Durable Object. Talk to an AI assistant that can answer questions, set spoken reminders, and check the weather — with streaming responses, interruption support, and conversation memory across sessions.

Uses Workers AI for all models — zero external API keys required:

- **STT**: Deepgram Nova 3 (`@cf/deepgram/nova-3`)
- **TTS**: Deepgram Aura (`@cf/deepgram/aura-1`)
- **VAD**: Pipecat Smart Turn v2 (`@cf/pipecat-ai/smart-turn-v2`)
- **LLM**: Kimi K2.5 (`@cf/moonshotai/kimi-k2.5`)

## Run it

```bash
npm install
npm run dev
```

No API keys needed — all AI models run via the Workers AI binding.

## How it works

```
Browser                          Durable Object (VoiceAgent)
┌──────────┐   binary WS frames   ┌──────────────────────────┐
│ Mic PCM  │ ────────────────────► │ Audio Buffer             │
│ (16kHz)  │                       │   ↓                      │
│          │   JSON: end_of_speech │ VAD (smart-turn-v2)      │
│          │ ────────────────────► │   ↓                      │
│          │                       │ STT (nova-3)             │
│          │   JSON: transcript    │   ↓                      │
│          │ ◄──────────────────── │ LLM (kimi-k2.5)      │
│          │   binary: MP3 audio   │   ↓ (sentence chunking)  │
│ Speaker  │ ◄──────────────────── │ TTS (aura-1, streaming)  │
└──────────┘                       └──────────────────────────┘
              single WebSocket connection
```

1. Browser captures mic audio via AudioWorklet, downsamples to 16kHz mono PCM
2. PCM streams to the Agent over the existing WebSocket connection (binary frames)
3. Client-side silence detection (500ms) triggers end-of-speech
4. Server-side VAD (smart-turn-v2) confirms the user finished speaking
5. Agent runs the voice pipeline: STT → LLM (with tools) → streaming TTS
6. TTS audio streams back per-sentence as MP3 while the LLM is still generating
7. Browser decodes and plays audio; user can interrupt at any time

## Features

- **Streaming TTS** — LLM output is split into sentences and synthesized concurrently, so the user hears the first sentence while the rest is still being generated.
- **Interruption handling** — speak over the agent to cut it off mid-sentence. The client detects sustained speech during playback and aborts the server pipeline.
- **Server-side VAD** — `smart-turn-v2` validates end-of-speech after client silence detection, reducing false triggers on mid-sentence pauses.
- **Conversation persistence** — all messages are stored in SQLite and survive restarts. The agent remembers previous conversations.
- **Agent tools** — the LLM can call `get_current_time`, `set_reminder`, and `get_weather` during conversation.
- **Proactive scheduling** — reminders set via voice fire on schedule and are spoken to connected clients (or saved to history if disconnected).
- **`useVoiceAgent` hook** — the client uses the `agents/voice-react` hook, which encapsulates all audio infrastructure in ~10 lines of setup.