# @cloudflare/voice Voice pipeline for [Cloudflare Agents](https://github.com/cloudflare/agents) -- STT, TTS, VAD, streaming, and real-time audio over WebSocket. > **Experimental.** This API is under active development and will break between releases. Pin your version and expect to rewrite when upgrading. ## Install ```bash npm install @cloudflare/voice ``` ## Exports | Export path | What it provides | | -------------------------- | ------------------------------------------------------------------------------------------------------- | | `@cloudflare/voice` | Server-side mixins (`withVoice`, `withVoiceInput`), provider types, Workers AI providers, SFU utilities | | `@cloudflare/voice/react` | React hooks (`useVoiceAgent`, `useVoiceInput`) | | `@cloudflare/voice/client` | Framework-agnostic `VoiceClient` class | ## Server: full voice agent (`withVoice`) Adds the complete voice pipeline: audio buffering, VAD, STT, LLM turn handling, streaming TTS, interruption, and conversation persistence. ```typescript import { Agent } from "agents"; import { withVoice, type VoiceTurnContext } from "@cloudflare/voice"; const VoiceAgent = withVoice(Agent); export class MyAgent extends VoiceAgent { async onTurn(transcript: string, context: VoiceTurnContext) { // Return a string or AsyncIterable (for streaming TTS) return "Hello! I heard you say: " + transcript; } } ``` ### Provider properties | Property | Type | Required | Description | | -------------- | ---------------------- | -------- | ------------------------------------------ | | `stt` | `STTProvider` | No | Batch speech-to-text (default: Workers AI) | | `tts` | `TTSProvider` | Yes | Text-to-speech (default: Workers AI) | | `vad` | `VADProvider` | No | Voice activity detection | | `streamingStt` | `StreamingSTTProvider` | No | Streaming STT for real-time transcripts | ### Lifecycle hooks | Method | Description | | -------------------------------- | ---------------------------------------------------------------------------------- | | `onTurn(transcript, context)` | **Required.** Handle a user utterance. Return `string` or `AsyncIterable`. | | `onCallStart(connection)` | Called when a voice call begins. | | `onCallEnd(connection)` | Called when a voice call ends. | | `onInterrupt(connection)` | Called when user interrupts playback. | | `beforeCallStart(connection)` | Return `false` to reject a call. | | `onMessage(connection, message)` | Handle non-voice WebSocket messages (voice protocol is intercepted automatically). | ### Pipeline hooks | Method | Description | | ------------------------------------------ | ---------------------------------------------------- | | `beforeTranscribe(audio, connection)` | Process audio before STT. Return `null` to skip. | | `afterTranscribe(transcript, connection)` | Process transcript after STT. Return `null` to skip. | | `beforeSynthesize(text, connection)` | Process text before TTS. Return `null` to skip. | | `afterSynthesize(audio, text, connection)` | Process audio after TTS. Return `null` to skip. | ### Convenience methods - `speak(connection, text)` -- synthesize and send audio to one connection - `speakAll(text)` -- synthesize and send audio to all connections - `forceEndCall(connection)` -- programmatically end a call - `saveMessage(role, content)` -- persist a message to conversation history - `getConversationHistory()` -- retrieve conversation history from SQLite ## Server: voice input only (`withVoiceInput`) STT-only mixin -- no TTS, no LLM. Use when you only need speech-to-text (e.g., dictation, transcription). ```typescript import { Server } from "partyserver"; import { withVoiceInput, WorkersAIFluxSTT } from "@cloudflare/voice"; const InputServer = withVoiceInput(Server); export class VoiceInputAgent extends InputServer { streamingStt = new WorkersAIFluxSTT(this.env.AI); onTranscript(text: string, connection: Connection) { console.log("User said:", text); } } ``` ## Client: React ```tsx import { useVoiceAgent } from "@cloudflare/voice/react"; function App() { const { status, // "idle" | "listening" | "thinking" | "speaking" transcript, // TranscriptMessage[] interimTranscript, // string | null (real-time partial transcript) metrics, // VoicePipelineMetrics | null audioLevel, // number (0-1) isMuted, // boolean connected, // boolean error, // string | null startCall, // () => Promise endCall, // () => void toggleMute, // () => void sendText, // (text: string) => void sendJSON // (data: Record) => void } = useVoiceAgent({ agent: "my-agent" }); return
Status: {status}
; } ``` For voice input only: ```tsx import { useVoiceInput } from "@cloudflare/voice/react"; const { transcript, interimTranscript, isListening, start, stop, clear } = useVoiceInput({ agent: "VoiceInputAgent" }); ``` ## Client: vanilla JavaScript ```typescript import { VoiceClient } from "@cloudflare/voice/client"; const client = new VoiceClient({ agent: "my-agent" }); client.addEventListener("statuschange", () => console.log(client.status)); client.connect(); await client.startCall(); ``` ## Workers AI providers (built-in) All default providers use Workers AI bindings -- no API keys required: | Class | Type | Workers AI model | | ------------------ | ------------- | --------------------------------- | | `WorkersAISTT` | Batch STT | `@cf/deepgram/nova-3` | | `WorkersAIFluxSTT` | Streaming STT | `@cf/deepgram/nova-3` (WebSocket) | | `WorkersAITTS` | TTS | `@cf/deepgram/aura-1` | | `WorkersAIVAD` | VAD | `@cf/pipecat-ai/smart-turn-v2` | ## Third-party providers | Package | What it provides | | ------------------------------ | ---------------------------------------- | | `@cloudflare/voice-deepgram` | Streaming STT (Deepgram Nova) | | `@cloudflare/voice-elevenlabs` | TTS (ElevenLabs) | | `@cloudflare/voice-twilio` | Telephony adapter (Twilio Media Streams) | ## Related - [`examples/voice-agent`](../../examples/voice-agent) -- full voice agent example - [`examples/voice-input`](../../examples/voice-input) -- voice input (dictation) example - [`experimental/voice.md`](../../experimental/voice.md) -- detailed API reference and protocol docs