# Voice Agents Build real-time voice agents with speech-to-text, text-to-speech, and conversation persistence. Audio streams over WebSocket — no SFU or meeting infrastructure required. ## Overview `@cloudflare/voice` provides two server-side mixins and matching React hooks: | Export | Import | Purpose | | ---------------- | -------------------------- | -------------------------------------------- | | `withVoice` | `@cloudflare/voice` | Full voice agent: STT, LLM, TTS, persistence | | `withVoiceInput` | `@cloudflare/voice` | STT-only: transcription without response | | `useVoiceAgent` | `@cloudflare/voice/react` | React hook for `withVoice` agents | | `useVoiceInput` | `@cloudflare/voice/react` | React hook for `withVoiceInput` agents | | `VoiceClient` | `@cloudflare/voice/client` | Framework-agnostic client | Built on Cloudflare Durable Objects, you get: - **Real-time audio** — mic audio streams as binary WebSocket frames, TTS audio streams back - **Automatic conversation persistence** — messages stored in SQLite, survive restarts - **Streaming TTS** — LLM tokens are sentence-chunked and synthesized concurrently - **Interruption handling** — user speech during playback cancels the current response - **Voice activity detection** — optional server-side VAD confirms end-of-turn - **Streaming STT** — optional real-time transcription with interim results - **Pipeline hooks** — intercept and transform audio/text at every stage > **Experimental.** This API is under active development and will break between releases. Pin your version. ## Quick Start ### Install ```sh npm install @cloudflare/voice agents ``` ### Server ```typescript import { Agent } from "agents"; import { withVoice, WorkersAISTT, WorkersAITTS, WorkersAIVAD, type VoiceTurnContext } from "@cloudflare/voice"; const VoiceAgent = withVoice(Agent); export class MyAgent extends VoiceAgent { stt = new WorkersAISTT(this.env.AI); tts = new WorkersAITTS(this.env.AI); vad = new WorkersAIVAD(this.env.AI); async onTurn(transcript: string, context: VoiceTurnContext) { // Return a string for single-shot TTS return "Hello! I heard you say: " + transcript; } } ``` ### Client (React) ```tsx import { useVoiceAgent } from "@cloudflare/voice/react"; function VoiceUI() { const { status, transcript, interimTranscript, audioLevel, isMuted, startCall, endCall, toggleMute } = useVoiceAgent({ agent: "MyAgent" }); return (

Status: {status}

{interimTranscript && (

{interimTranscript}

)} {transcript.map((msg, i) => (

{msg.role}: {msg.text}

))}
); } ``` ### Wrangler Config ```jsonc // wrangler.jsonc { "ai": { "binding": "AI" }, "durable_objects": { "bindings": [{ "name": "MyAgent", "class_name": "MyAgent" }] }, "migrations": [{ "tag": "v1", "new_sqlite_classes": ["MyAgent"] }] } ``` ## How It Works ``` Browser Durable Object (withVoice) ┌──────────┐ binary PCM (16kHz) ┌──────────────────────────┐ │ Mic │ ──────────────────────► │ Audio buffer │ │ │ │ ↓ │ │ │ JSON: end_of_speech │ VAD (optional) │ │ │ ──────────────────────► │ ↓ │ │ │ │ STT │ │ │ JSON: transcript │ ↓ │ │ │ ◄────────────────────── │ onTurn() → your LLM code │ │ │ binary: audio │ ↓ (sentence chunking) │ │ Speaker │ ◄────────────────────── │ TTS │ └──────────┘ └──────────────────────────┘ ``` 1. The client captures mic audio and sends it as binary WebSocket frames (16kHz mono 16-bit PCM) 2. Client-side silence detection sends `end_of_speech` after 500ms of silence 3. Server-side VAD (if configured) confirms end-of-turn 4. STT transcribes the audio (batch or streaming) 5. Your `onTurn()` method runs — typically an LLM call 6. The response is sentence-chunked and synthesized via TTS 7. Audio streams back to the client for playback ## Server API: `withVoice` `withVoice(Agent)` adds the full voice pipeline to an Agent class. ### Providers Set providers as class properties. Class field initializers run after `super()`, so `this.env` is available. | Property | Type | Required | Description | | -------------- | ---------------------- | -------- | --------------------------------------- | | `stt` | `STTProvider` | Yes\* | Batch speech-to-text | | `tts` | `TTSProvider` | Yes | Text-to-speech | | `vad` | `VADProvider` | No | Voice activity detection | | `streamingStt` | `StreamingSTTProvider` | No | Streaming STT (replaces `stt` when set) | \*Not required if `streamingStt` is set. ```typescript import { withVoice, WorkersAISTT, WorkersAITTS, WorkersAIVAD } from "@cloudflare/voice"; const VoiceAgent = withVoice(Agent); export class MyAgent extends VoiceAgent { stt = new WorkersAISTT(this.env.AI); tts = new WorkersAITTS(this.env.AI); vad = new WorkersAIVAD(this.env.AI); } ``` ### `onTurn(transcript, context)` **Required.** Called when the user finishes speaking and the transcript is ready. Return a `string`, `AsyncIterable`, or `ReadableStream` for streaming responses: **Simple response:** ```typescript async onTurn(transcript: string, context: VoiceTurnContext) { return "You said: " + transcript; } ``` **Streaming response (recommended for LLM):** ```typescript import { streamText, convertToModelMessages } from "ai"; import { createWorkersAI } from "workers-ai-provider"; async onTurn(transcript: string, context: VoiceTurnContext) { const workersai = createWorkersAI({ binding: this.env.AI }); const result = streamText({ model: workersai("@cf/moonshotai/kimi-k2.5"), system: "You are a helpful voice assistant. Keep responses concise.", messages: [ ...context.messages.map(m => ({ role: m.role as "user" | "assistant", content: m.content })), { role: "user", content: transcript } ], abortSignal: context.signal }); return result.textStream; } ``` The `context` object provides: | Field | Type | Description | | ------------ | ------------------------------------------ | ---------------------------------- | | `connection` | `Connection` | The WebSocket connection | | `messages` | `Array<{ role: string; content: string }>` | Conversation history from SQLite | | `signal` | `AbortSignal` | Aborted on interrupt or disconnect | ### Lifecycle Hooks | Method | Description | | ----------------------------- | ------------------------------------------- | | `beforeCallStart(connection)` | Return `false` to reject the call | | `onCallStart(connection)` | Called after a call is accepted | | `onCallEnd(connection)` | Called when a call ends | | `onInterrupt(connection)` | Called when user interrupts during playback | ### Pipeline Hooks Intercept and transform data at each pipeline stage. Return `null` to skip the current utterance. | Method | Receives | Can skip? | | ------------------------------------------ | ----------------- | --------- | | `beforeTranscribe(audio, connection)` | Raw PCM after VAD | Yes | | `afterTranscribe(transcript, connection)` | STT text | Yes | | `beforeSynthesize(text, connection)` | Text before TTS | Yes | | `afterSynthesize(audio, text, connection)` | Audio after TTS | Yes | ```typescript export class MyAgent extends VoiceAgent { // Filter out short/noise transcripts afterTranscribe(transcript: string, connection: Connection) { if (transcript.length < 3) return null; // skip return transcript; } // Add SSML or modify text before TTS beforeSynthesize(text: string, connection: Connection) { return text.replace(/\bAI\b/g, "A.I."); // improve pronunciation } } ``` ### Convenience Methods | Method | Description | | -------------------------- | -------------------------------------------- | | `speak(connection, text)` | Synthesize and send audio to one connection | | `speakAll(text)` | Synthesize and send audio to all connections | | `forceEndCall(connection)` | Programmatically end a call | | `saveMessage(role, text)` | Persist a message to conversation history | | `getConversationHistory()` | Retrieve conversation history from SQLite | ### Configuration Options Pass options to `withVoice()` as the second argument: ```typescript const VoiceAgent = withVoice(Agent, { historyLimit: 20, // Max messages loaded for context (default: 20) audioFormat: "mp3", // Audio format sent to client (default: "mp3") maxMessageCount: 1000, // Max messages in SQLite (default: 1000) minAudioBytes: 16000, // Min audio to process, 0.5s (default: 16000) vadThreshold: 0.5, // VAD probability threshold (default: 0.5) vadPushbackSeconds: 2, // Audio pushed back on VAD reject (default: 2) vadRetryMs: 3000 // Retry delay after VAD reject (default: 3000) }); ``` ## Server API: `withVoiceInput` `withVoiceInput(Agent)` adds STT-only voice input — no TTS, no LLM, no response generation. Use this for dictation, search-by-voice, or any UI where you need speech-to-text without a conversational agent. ```typescript import { Agent } from "agents"; import { withVoiceInput, WorkersAIFluxSTT } from "@cloudflare/voice"; const InputAgent = withVoiceInput(Agent); export class DictationAgent extends InputAgent { streamingStt = new WorkersAIFluxSTT(this.env.AI); onTranscript(text: string, connection: Connection) { console.log("User said:", text); // Save to storage, trigger a search, forward to another service, etc. } } ``` ### `onTranscript(text, connection)` Called after each utterance is transcribed. Override this to process the transcript. ### Hooks `withVoiceInput` supports the same lifecycle and STT pipeline hooks as `withVoice`: - `beforeCallStart(connection)` — return `false` to reject - `onCallStart(connection)`, `onCallEnd(connection)`, `onInterrupt(connection)` - `beforeTranscribe(audio, connection)`, `afterTranscribe(transcript, connection)` It does **not** have TTS hooks (`beforeSynthesize`, `afterSynthesize`) or `onTurn`. ## Client API: React Hooks ### `useVoiceAgent` Wraps `VoiceClient` for `withVoice` agents. Manages connection, mic capture, playback, silence detection, and interrupt detection. ```tsx import { useVoiceAgent } from "@cloudflare/voice/react"; const { status, // "idle" | "listening" | "thinking" | "speaking" transcript, // TranscriptMessage[] — conversation history interimTranscript, // string | null — real-time partial transcript metrics, // VoicePipelineMetrics | null audioLevel, // number (0–1) — current mic RMS level isMuted, // boolean connected, // boolean — WebSocket connected error, // string | null startCall, // () => Promise endCall, // () => void toggleMute, // () => void sendText, // (text: string) => void — bypass STT sendJSON, // (data: Record) => void lastCustomMessage // unknown — last non-voice message from server } = useVoiceAgent({ agent: "MyAgent", // Required: Durable Object class name name: "default", // Instance name (default: "default") host: window.location.host // Host to connect to }); ``` #### Tuning Options | Option | Type | Default | Description | | -------------------- | -------- | ------- | ------------------------------------------------ | | `silenceThreshold` | `number` | `0.04` | RMS below this is silence | | `silenceDurationMs` | `number` | `500` | Silence duration before `end_of_speech` (ms) | | `interruptThreshold` | `number` | `0.05` | RMS to detect speech during playback | | `interruptChunks` | `number` | `2` | Consecutive high-RMS chunks to trigger interrupt | Changing tuning options triggers a client reconnect (the connection key includes them). ### `useVoiceInput` Lightweight hook for dictation / voice-to-text. Accumulates user transcripts into a single string. ```tsx import { useVoiceInput } from "@cloudflare/voice/react"; function Dictation() { const { transcript, // string — accumulated text from all utterances interimTranscript, // string | null — current partial transcript isListening, // boolean audioLevel, // number (0–1) isMuted, // boolean error, // string | null start, // () => Promise stop, // () => void toggleMute, // () => void clear // () => void — clear accumulated transcript } = useVoiceInput({ agent: "DictationAgent" }); return (