README.md - deathbyknowledge/agents

branch:
README.md
7289 bytesRaw
# @cloudflare/voice

Voice pipeline for [Cloudflare Agents](https://github.com/cloudflare/agents) -- STT, TTS, VAD, streaming, and real-time audio over WebSocket.

> **Experimental.** This API is under active development and will break between releases. Pin your version and expect to rewrite when upgrading.

## Install

```bash
npm install @cloudflare/voice
```

## Exports

| Export path                | What it provides                                                                                        |
| -------------------------- | ------------------------------------------------------------------------------------------------------- |
| `@cloudflare/voice`        | Server-side mixins (`withVoice`, `withVoiceInput`), provider types, Workers AI providers, SFU utilities |
| `@cloudflare/voice/react`  | React hooks (`useVoiceAgent`, `useVoiceInput`)                                                          |
| `@cloudflare/voice/client` | Framework-agnostic `VoiceClient` class                                                                  |

## Server: full voice agent (`withVoice`)

Adds the complete voice pipeline: audio buffering, VAD, STT, LLM turn handling, streaming TTS, interruption, and conversation persistence.

```typescript
import { Agent } from "agents";
import { withVoice, type VoiceTurnContext } from "@cloudflare/voice";

const VoiceAgent = withVoice(Agent);

export class MyAgent extends VoiceAgent<Env> {
  async onTurn(transcript: string, context: VoiceTurnContext) {
    // Return a string or AsyncIterable<string> (for streaming TTS)
    return "Hello! I heard you say: " + transcript;
  }
}
```

### Provider properties

| Property       | Type                   | Required | Description                                |
| -------------- | ---------------------- | -------- | ------------------------------------------ |
| `stt`          | `STTProvider`          | No       | Batch speech-to-text (default: Workers AI) |
| `tts`          | `TTSProvider`          | Yes      | Text-to-speech (default: Workers AI)       |
| `vad`          | `VADProvider`          | No       | Voice activity detection                   |
| `streamingStt` | `StreamingSTTProvider` | No       | Streaming STT for real-time transcripts    |

### Lifecycle hooks

| Method                           | Description                                                                        |
| -------------------------------- | ---------------------------------------------------------------------------------- |
| `onTurn(transcript, context)`    | **Required.** Handle a user utterance. Return `string` or `AsyncIterable<string>`. |
| `onCallStart(connection)`        | Called when a voice call begins.                                                   |
| `onCallEnd(connection)`          | Called when a voice call ends.                                                     |
| `onInterrupt(connection)`        | Called when user interrupts playback.                                              |
| `beforeCallStart(connection)`    | Return `false` to reject a call.                                                   |
| `onMessage(connection, message)` | Handle non-voice WebSocket messages (voice protocol is intercepted automatically). |

### Pipeline hooks

| Method                                     | Description                                          |
| ------------------------------------------ | ---------------------------------------------------- |
| `beforeTranscribe(audio, connection)`      | Process audio before STT. Return `null` to skip.     |
| `afterTranscribe(transcript, connection)`  | Process transcript after STT. Return `null` to skip. |
| `beforeSynthesize(text, connection)`       | Process text before TTS. Return `null` to skip.      |
| `afterSynthesize(audio, text, connection)` | Process audio after TTS. Return `null` to skip.      |

### Convenience methods

- `speak(connection, text)` -- synthesize and send audio to one connection
- `speakAll(text)` -- synthesize and send audio to all connections
- `forceEndCall(connection)` -- programmatically end a call
- `saveMessage(role, content)` -- persist a message to conversation history
- `getConversationHistory()` -- retrieve conversation history from SQLite

## Server: voice input only (`withVoiceInput`)

STT-only mixin -- no TTS, no LLM. Use when you only need speech-to-text (e.g., dictation, transcription).

```typescript
import { Server } from "partyserver";
import { withVoiceInput, WorkersAIFluxSTT } from "@cloudflare/voice";

const InputServer = withVoiceInput(Server);

export class VoiceInputAgent extends InputServer<Env> {
  streamingStt = new WorkersAIFluxSTT(this.env.AI);

  onTranscript(text: string, connection: Connection) {
    console.log("User said:", text);
  }
}
```

## Client: React

```tsx
import { useVoiceAgent } from "@cloudflare/voice/react";

function App() {
  const {
    status, // "idle" | "listening" | "thinking" | "speaking"
    transcript, // TranscriptMessage[]
    interimTranscript, // string | null (real-time partial transcript)
    metrics, // VoicePipelineMetrics | null
    audioLevel, // number (0-1)
    isMuted, // boolean
    connected, // boolean
    error, // string | null
    startCall, // () => Promise<void>
    endCall, // () => void
    toggleMute, // () => void
    sendText, // (text: string) => void
    sendJSON // (data: Record<string, unknown>) => void
  } = useVoiceAgent({ agent: "my-agent" });

  return <div>Status: {status}</div>;
}
```

For voice input only:

```tsx
import { useVoiceInput } from "@cloudflare/voice/react";

const { transcript, interimTranscript, isListening, start, stop, clear } =
  useVoiceInput({ agent: "VoiceInputAgent" });
```

## Client: vanilla JavaScript

```typescript
import { VoiceClient } from "@cloudflare/voice/client";

const client = new VoiceClient({ agent: "my-agent" });

client.addEventListener("statuschange", () => console.log(client.status));
client.connect();
await client.startCall();
```

## Workers AI providers (built-in)

All default providers use Workers AI bindings -- no API keys required:

| Class              | Type          | Workers AI model                  |
| ------------------ | ------------- | --------------------------------- |
| `WorkersAISTT`     | Batch STT     | `@cf/deepgram/nova-3`             |
| `WorkersAIFluxSTT` | Streaming STT | `@cf/deepgram/nova-3` (WebSocket) |
| `WorkersAITTS`     | TTS           | `@cf/deepgram/aura-1`             |
| `WorkersAIVAD`     | VAD           | `@cf/pipecat-ai/smart-turn-v2`    |

## Third-party providers

| Package                        | What it provides                         |
| ------------------------------ | ---------------------------------------- |
| `@cloudflare/voice-deepgram`   | Streaming STT (Deepgram Nova)            |
| `@cloudflare/voice-elevenlabs` | TTS (ElevenLabs)                         |
| `@cloudflare/voice-twilio`     | Telephony adapter (Twilio Media Streams) |

## Related

- [`examples/voice-agent`](../../examples/voice-agent) -- full voice agent example
- [`examples/voice-input`](../../examples/voice-input) -- voice input (dictation) example
- [`experimental/voice.md`](../../experimental/voice.md) -- detailed API reference and protocol docs