branch:
voice.md
24680 bytesRaw
# Voice Agents
Build real-time voice agents with speech-to-text, text-to-speech, and conversation persistence. Audio streams over WebSocket — no SFU or meeting infrastructure required.
## Overview
`@cloudflare/voice` provides two server-side mixins and matching React hooks:
| Export | Import | Purpose |
| ---------------- | -------------------------- | -------------------------------------------- |
| `withVoice` | `@cloudflare/voice` | Full voice agent: STT, LLM, TTS, persistence |
| `withVoiceInput` | `@cloudflare/voice` | STT-only: transcription without response |
| `useVoiceAgent` | `@cloudflare/voice/react` | React hook for `withVoice` agents |
| `useVoiceInput` | `@cloudflare/voice/react` | React hook for `withVoiceInput` agents |
| `VoiceClient` | `@cloudflare/voice/client` | Framework-agnostic client |
Built on Cloudflare Durable Objects, you get:
- **Real-time audio** — mic audio streams as binary WebSocket frames, TTS audio streams back
- **Automatic conversation persistence** — messages stored in SQLite, survive restarts
- **Streaming TTS** — LLM tokens are sentence-chunked and synthesized concurrently
- **Interruption handling** — user speech during playback cancels the current response
- **Voice activity detection** — optional server-side VAD confirms end-of-turn
- **Streaming STT** — optional real-time transcription with interim results
- **Pipeline hooks** — intercept and transform audio/text at every stage
> **Experimental.** This API is under active development and will break between releases. Pin your version.
## Quick Start
### Install
```sh
npm install @cloudflare/voice agents
```
### Server
```typescript
import { Agent } from "agents";
import {
withVoice,
WorkersAISTT,
WorkersAITTS,
WorkersAIVAD,
type VoiceTurnContext
} from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent<Env> {
stt = new WorkersAISTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
vad = new WorkersAIVAD(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
// Return a string for single-shot TTS
return "Hello! I heard you say: " + transcript;
}
}
```
### Client (React)
```tsx
import { useVoiceAgent } from "@cloudflare/voice/react";
function VoiceUI() {
const {
status,
transcript,
interimTranscript,
audioLevel,
isMuted,
startCall,
endCall,
toggleMute
} = useVoiceAgent({ agent: "MyAgent" });
return (
<div>
<p>Status: {status}</p>
<button onClick={status === "idle" ? startCall : endCall}>
{status === "idle" ? "Start Call" : "End Call"}
</button>
<button onClick={toggleMute}>{isMuted ? "Unmute" : "Mute"}</button>
{interimTranscript && (
<p>
<em>{interimTranscript}</em>
</p>
)}
{transcript.map((msg, i) => (
<p key={i}>
<strong>{msg.role}:</strong> {msg.text}
</p>
))}
</div>
);
}
```
### Wrangler Config
```jsonc
// wrangler.jsonc
{
"ai": { "binding": "AI" },
"durable_objects": {
"bindings": [{ "name": "MyAgent", "class_name": "MyAgent" }]
},
"migrations": [{ "tag": "v1", "new_sqlite_classes": ["MyAgent"] }]
}
```
## How It Works
```
Browser Durable Object (withVoice)
┌──────────┐ binary PCM (16kHz) ┌──────────────────────────┐
│ Mic │ ──────────────────────► │ Audio buffer │
│ │ │ ↓ │
│ │ JSON: end_of_speech │ VAD (optional) │
│ │ ──────────────────────► │ ↓ │
│ │ │ STT │
│ │ JSON: transcript │ ↓ │
│ │ ◄────────────────────── │ onTurn() → your LLM code │
│ │ binary: audio │ ↓ (sentence chunking) │
│ Speaker │ ◄────────────────────── │ TTS │
└──────────┘ └──────────────────────────┘
```
1. The client captures mic audio and sends it as binary WebSocket frames (16kHz mono 16-bit PCM)
2. Client-side silence detection sends `end_of_speech` after 500ms of silence
3. Server-side VAD (if configured) confirms end-of-turn
4. STT transcribes the audio (batch or streaming)
5. Your `onTurn()` method runs — typically an LLM call
6. The response is sentence-chunked and synthesized via TTS
7. Audio streams back to the client for playback
## Server API: `withVoice`
`withVoice(Agent)` adds the full voice pipeline to an Agent class.
### Providers
Set providers as class properties. Class field initializers run after `super()`, so `this.env` is available.
| Property | Type | Required | Description |
| -------------- | ---------------------- | -------- | --------------------------------------- |
| `stt` | `STTProvider` | Yes\* | Batch speech-to-text |
| `tts` | `TTSProvider` | Yes | Text-to-speech |
| `vad` | `VADProvider` | No | Voice activity detection |
| `streamingStt` | `StreamingSTTProvider` | No | Streaming STT (replaces `stt` when set) |
\*Not required if `streamingStt` is set.
```typescript
import {
withVoice,
WorkersAISTT,
WorkersAITTS,
WorkersAIVAD
} from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent<Env> {
stt = new WorkersAISTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
vad = new WorkersAIVAD(this.env.AI);
}
```
### `onTurn(transcript, context)`
**Required.** Called when the user finishes speaking and the transcript is ready.
Return a `string`, `AsyncIterable<string>`, or `ReadableStream` for streaming responses:
**Simple response:**
```typescript
async onTurn(transcript: string, context: VoiceTurnContext) {
return "You said: " + transcript;
}
```
**Streaming response (recommended for LLM):**
```typescript
import { streamText, convertToModelMessages } from "ai";
import { createWorkersAI } from "workers-ai-provider";
async onTurn(transcript: string, context: VoiceTurnContext) {
const workersai = createWorkersAI({ binding: this.env.AI });
const result = streamText({
model: workersai("@cf/moonshotai/kimi-k2.5"),
system: "You are a helpful voice assistant. Keep responses concise.",
messages: [
...context.messages.map(m => ({
role: m.role as "user" | "assistant",
content: m.content
})),
{ role: "user", content: transcript }
],
abortSignal: context.signal
});
return result.textStream;
}
```
The `context` object provides:
| Field | Type | Description |
| ------------ | ------------------------------------------ | ---------------------------------- |
| `connection` | `Connection` | The WebSocket connection |
| `messages` | `Array<{ role: string; content: string }>` | Conversation history from SQLite |
| `signal` | `AbortSignal` | Aborted on interrupt or disconnect |
### Lifecycle Hooks
| Method | Description |
| ----------------------------- | ------------------------------------------- |
| `beforeCallStart(connection)` | Return `false` to reject the call |
| `onCallStart(connection)` | Called after a call is accepted |
| `onCallEnd(connection)` | Called when a call ends |
| `onInterrupt(connection)` | Called when user interrupts during playback |
### Pipeline Hooks
Intercept and transform data at each pipeline stage. Return `null` to skip the current utterance.
| Method | Receives | Can skip? |
| ------------------------------------------ | ----------------- | --------- |
| `beforeTranscribe(audio, connection)` | Raw PCM after VAD | Yes |
| `afterTranscribe(transcript, connection)` | STT text | Yes |
| `beforeSynthesize(text, connection)` | Text before TTS | Yes |
| `afterSynthesize(audio, text, connection)` | Audio after TTS | Yes |
```typescript
export class MyAgent extends VoiceAgent<Env> {
// Filter out short/noise transcripts
afterTranscribe(transcript: string, connection: Connection) {
if (transcript.length < 3) return null; // skip
return transcript;
}
// Add SSML or modify text before TTS
beforeSynthesize(text: string, connection: Connection) {
return text.replace(/\bAI\b/g, "A.I."); // improve pronunciation
}
}
```
### Convenience Methods
| Method | Description |
| -------------------------- | -------------------------------------------- |
| `speak(connection, text)` | Synthesize and send audio to one connection |
| `speakAll(text)` | Synthesize and send audio to all connections |
| `forceEndCall(connection)` | Programmatically end a call |
| `saveMessage(role, text)` | Persist a message to conversation history |
| `getConversationHistory()` | Retrieve conversation history from SQLite |
### Configuration Options
Pass options to `withVoice()` as the second argument:
```typescript
const VoiceAgent = withVoice(Agent, {
historyLimit: 20, // Max messages loaded for context (default: 20)
audioFormat: "mp3", // Audio format sent to client (default: "mp3")
maxMessageCount: 1000, // Max messages in SQLite (default: 1000)
minAudioBytes: 16000, // Min audio to process, 0.5s (default: 16000)
vadThreshold: 0.5, // VAD probability threshold (default: 0.5)
vadPushbackSeconds: 2, // Audio pushed back on VAD reject (default: 2)
vadRetryMs: 3000 // Retry delay after VAD reject (default: 3000)
});
```
## Server API: `withVoiceInput`
`withVoiceInput(Agent)` adds STT-only voice input — no TTS, no LLM, no response generation. Use this for dictation, search-by-voice, or any UI where you need speech-to-text without a conversational agent.
```typescript
import { Agent } from "agents";
import { withVoiceInput, WorkersAIFluxSTT } from "@cloudflare/voice";
const InputAgent = withVoiceInput(Agent);
export class DictationAgent extends InputAgent<Env> {
streamingStt = new WorkersAIFluxSTT(this.env.AI);
onTranscript(text: string, connection: Connection) {
console.log("User said:", text);
// Save to storage, trigger a search, forward to another service, etc.
}
}
```
### `onTranscript(text, connection)`
Called after each utterance is transcribed. Override this to process the transcript.
### Hooks
`withVoiceInput` supports the same lifecycle and STT pipeline hooks as `withVoice`:
- `beforeCallStart(connection)` — return `false` to reject
- `onCallStart(connection)`, `onCallEnd(connection)`, `onInterrupt(connection)`
- `beforeTranscribe(audio, connection)`, `afterTranscribe(transcript, connection)`
It does **not** have TTS hooks (`beforeSynthesize`, `afterSynthesize`) or `onTurn`.
## Client API: React Hooks
### `useVoiceAgent`
Wraps `VoiceClient` for `withVoice` agents. Manages connection, mic capture, playback, silence detection, and interrupt detection.
```tsx
import { useVoiceAgent } from "@cloudflare/voice/react";
const {
status, // "idle" | "listening" | "thinking" | "speaking"
transcript, // TranscriptMessage[] — conversation history
interimTranscript, // string | null — real-time partial transcript
metrics, // VoicePipelineMetrics | null
audioLevel, // number (0–1) — current mic RMS level
isMuted, // boolean
connected, // boolean — WebSocket connected
error, // string | null
startCall, // () => Promise<void>
endCall, // () => void
toggleMute, // () => void
sendText, // (text: string) => void — bypass STT
sendJSON, // (data: Record<string, unknown>) => void
lastCustomMessage // unknown — last non-voice message from server
} = useVoiceAgent({
agent: "MyAgent", // Required: Durable Object class name
name: "default", // Instance name (default: "default")
host: window.location.host // Host to connect to
});
```
#### Tuning Options
| Option | Type | Default | Description |
| -------------------- | -------- | ------- | ------------------------------------------------ |
| `silenceThreshold` | `number` | `0.04` | RMS below this is silence |
| `silenceDurationMs` | `number` | `500` | Silence duration before `end_of_speech` (ms) |
| `interruptThreshold` | `number` | `0.05` | RMS to detect speech during playback |
| `interruptChunks` | `number` | `2` | Consecutive high-RMS chunks to trigger interrupt |
Changing tuning options triggers a client reconnect (the connection key includes them).
### `useVoiceInput`
Lightweight hook for dictation / voice-to-text. Accumulates user transcripts into a single string.
```tsx
import { useVoiceInput } from "@cloudflare/voice/react";
function Dictation() {
const {
transcript, // string — accumulated text from all utterances
interimTranscript, // string | null — current partial transcript
isListening, // boolean
audioLevel, // number (0–1)
isMuted, // boolean
error, // string | null
start, // () => Promise<void>
stop, // () => void
toggleMute, // () => void
clear // () => void — clear accumulated transcript
} = useVoiceInput({ agent: "DictationAgent" });
return (
<div>
<textarea
value={transcript + (interimTranscript ? " " + interimTranscript : "")}
readOnly
/>
<button onClick={isListening ? stop : start}>
{isListening ? "Stop" : "Dictate"}
</button>
</div>
);
}
```
## Client API: `VoiceClient`
Framework-agnostic client for environments without React.
```typescript
import { VoiceClient } from "@cloudflare/voice/client";
const client = new VoiceClient({ agent: "MyAgent" });
client.addEventListener("statuschange", (status) => {
console.log("Status:", status);
});
client.addEventListener("transcriptchange", (messages) => {
console.log("Transcript:", messages);
});
client.addEventListener("error", (err) => {
console.error("Error:", err);
});
client.connect();
await client.startCall();
// Later:
client.endCall();
client.disconnect();
```
### Events
| Event | Data Type | Description |
| ------------------- | ---------------------- | ------------------------------------- |
| `statuschange` | `VoiceStatus` | Pipeline state changed |
| `transcriptchange` | `TranscriptMessage[]` | Transcript updated |
| `interimtranscript` | `string \| null` | Interim transcript from streaming STT |
| `metricschange` | `VoicePipelineMetrics` | Pipeline timing metrics |
| `audiolevelchange` | `number` | Mic audio level (0–1) |
| `connectionchange` | `boolean` | WebSocket connected/disconnected |
| `mutechange` | `boolean` | Mute state changed |
| `error` | `string \| null` | Error occurred |
| `custommessage` | `unknown` | Non-voice message from server |
### Advanced Options
| Option | Type | Description |
| ----------------- | ------------------ | ----------------------------------------------------- |
| `transport` | `VoiceTransport` | Custom transport (default: WebSocket via PartySocket) |
| `audioInput` | `VoiceAudioInput` | Custom mic capture (default: built-in AudioWorklet) |
| `preferredFormat` | `VoiceAudioFormat` | Hint for server audio format (advisory only) |
## Providers
### Built-in (Workers AI)
No API keys required — use your Workers AI binding:
| Class | Type | Default Model |
| ------------------ | ------------- | ------------------------------ |
| `WorkersAISTT` | Batch STT | `@cf/deepgram/nova-3` |
| `WorkersAIFluxSTT` | Streaming STT | `@cf/deepgram/nova-3` |
| `WorkersAITTS` | TTS | `@cf/deepgram/aura-1` |
| `WorkersAIVAD` | VAD | `@cf/pipecat-ai/smart-turn-v2` |
```typescript
import {
WorkersAISTT,
WorkersAITTS,
WorkersAIVAD,
WorkersAIFluxSTT
} from "@cloudflare/voice";
// Default options
stt = new WorkersAISTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
vad = new WorkersAIVAD(this.env.AI);
// Custom options
stt = new WorkersAISTT(this.env.AI, {
model: "@cf/deepgram/nova-3",
language: "en"
});
tts = new WorkersAITTS(this.env.AI, {
model: "@cf/deepgram/aura-1",
speaker: "asteria"
});
```
### Third-Party Providers
| Package | Class | Description |
| ------------------------------ | ---------------------- | ----------------------- |
| `@cloudflare/voice-deepgram` | `DeepgramStreamingSTT` | Real-time streaming STT |
| `@cloudflare/voice-elevenlabs` | `ElevenLabsTTS` | High-quality TTS |
| `@cloudflare/voice-twilio` | Twilio adapter | Telephony (phone calls) |
**ElevenLabs TTS:**
```typescript
import { ElevenLabsTTS } from "@cloudflare/voice-elevenlabs";
export class MyAgent extends VoiceAgent<Env> {
stt = new WorkersAISTT(this.env.AI);
tts = new ElevenLabsTTS({
apiKey: this.env.ELEVENLABS_API_KEY,
voiceId: "21m00Tcm4TlvDq8ikWAM"
});
}
```
**Deepgram Streaming STT:**
```typescript
import { DeepgramStreamingSTT } from "@cloudflare/voice-deepgram";
export class MyAgent extends VoiceAgent<Env> {
streamingStt = new DeepgramStreamingSTT({
apiKey: this.env.DEEPGRAM_API_KEY
});
tts = new WorkersAITTS(this.env.AI);
}
```
### Custom Providers
Any object satisfying the provider interface works:
```typescript
export class MyAgent extends VoiceAgent<Env> {
stt = {
transcribe: async (audio: ArrayBuffer, signal?: AbortSignal) => {
const resp = await fetch("https://my-stt.example.com/v1/transcribe", {
method: "POST",
body: audio,
signal
});
return ((await resp.json()) as { text: string }).text;
}
};
tts = {
synthesize: async (text: string, signal?: AbortSignal) => {
const resp = await fetch("https://my-tts.example.com/v1/synthesize", {
method: "POST",
body: JSON.stringify({ text }),
headers: { "Content-Type": "application/json" },
signal
});
return resp.arrayBuffer();
}
};
}
```
## Streaming STT
Streaming STT transcribes audio in real time as the user speaks, eliminating the latency of batch transcription. When a streaming STT provider is set, the pipeline creates a per-utterance session that receives audio chunks incrementally.
The client receives `transcript_interim` messages with partial results as the user speaks. By the time the user stops, the transcript is already (nearly) ready — `session.finish()` typically takes ~50ms.
```typescript
export class MyAgent extends VoiceAgent<Env> {
// Streaming STT replaces batch stt when set
streamingStt = new DeepgramStreamingSTT({
apiKey: this.env.DEEPGRAM_API_KEY
});
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
// transcript is the final, stable text
return "You said: " + transcript;
}
}
```
The client displays interim transcripts automatically:
```tsx
const { interimTranscript, transcript } = useVoiceAgent({ agent: "MyAgent" });
// interimTranscript updates in real time as the user speaks
// transcript contains finalized messages
```
Some streaming STT providers (like Deepgram) support **provider-driven end-of-turn**: the provider detects when the user has finished speaking and triggers the LLM pipeline immediately, bypassing client-side silence detection. This further reduces latency.
## Text Messages
`withVoice` agents can also receive text messages, bypassing STT entirely. This is useful for chat-style input alongside voice.
**Client:**
```tsx
const { sendText } = useVoiceAgent({ agent: "MyAgent" });
// Send text — goes straight to onTurn() without STT
sendText("What is the weather like today?");
```
Text messages work both during and outside of active calls. During a call, the response is spoken aloud via TTS. Outside a call, the response is sent as text-only transcript messages.
## Custom Messages
Send and receive application-level JSON messages alongside voice protocol messages. Non-voice messages pass through to your `onMessage` handler on the server and emit `custommessage` events on the client.
**Server:**
```typescript
export class MyAgent extends VoiceAgent<Env> {
onMessage(connection: Connection, message: WSMessage) {
const data = JSON.parse(message as string);
if (data.type === "kick_speaker") {
this.forceEndCall(connection);
}
}
}
```
**Client:**
```tsx
const { sendJSON, lastCustomMessage } = useVoiceAgent({ agent: "MyAgent" });
// Send custom JSON
sendJSON({ type: "kick_speaker" });
// Receive custom messages
useEffect(() => {
if (lastCustomMessage) {
console.log("Custom message:", lastCustomMessage);
}
}, [lastCustomMessage]);
```
## Single-Speaker Enforcement
Use `beforeCallStart` to restrict who can start a call. This example enforces single-speaker — only one connection can be the active speaker at a time:
```typescript
export class MyAgent extends VoiceAgent<Env> {
#speakerId: string | null = null;
beforeCallStart(connection: Connection) {
if (this.#speakerId !== null) {
return false; // reject — someone else is speaking
}
this.#speakerId = connection.id;
return true;
}
onCallEnd(connection: Connection) {
if (this.#speakerId === connection.id) {
this.#speakerId = null;
}
}
}
```
## Telephony (Twilio)
Connect phone calls to your voice agent using the Twilio adapter:
```sh
npm install @cloudflare/voice-twilio
```
The adapter bridges Twilio Media Streams to your VoiceAgent:
```
Phone → Twilio → WebSocket → TwilioAdapter → WebSocket → VoiceAgent
```
**Important:** `WorkersAITTS` returns MP3, which cannot be decoded to PCM in the Workers runtime. When using the Twilio adapter, use a TTS provider that outputs raw PCM (for example, ElevenLabs with `outputFormat: "pcm_16000"`).
## Pipeline Metrics
`withVoice` agents emit timing metrics after each turn:
```tsx
const { metrics } = useVoiceAgent({ agent: "MyAgent" });
// metrics: {
// vad_ms: 45, // VAD check time
// stt_ms: 120, // STT transcription time
// llm_ms: 850, // LLM response time
// tts_ms: 200, // Cumulative TTS synthesis time
// first_audio_ms: 950, // Time to first audio byte
// total_ms: 1200 // Total pipeline time
// }
```
## Conversation History
`withVoice` automatically persists conversation messages to SQLite. Access history in your `onTurn` via `context.messages`, or directly:
```typescript
// Get history (most recent N messages)
const history = this.getConversationHistory(20);
// Manually save a message
this.saveMessage("assistant", "Welcome! How can I help?");
```
History survives Durable Object restarts, hibernation, and client reconnections.
## Examples
- [`examples/voice-agent`](https://github.com/cloudflare/agents/tree/main/examples/voice-agent) — full voice agent with Workers AI
- [`examples/voice-input`](https://github.com/cloudflare/agents/tree/main/examples/voice-input) — voice input (dictation) example
## Related
- [Agent Class](./agent-class.md) — understanding the base Agent class
- [Chat Agents](./chat-agents.md) — text-based AI chat agents
- [State Management](./state.md) — managing agent state