branch:
README.md
2997 bytesRaw
# @cloudflare/voice-deepgram
Deepgram streaming speech-to-text provider for the [Cloudflare Agents](https://github.com/cloudflare/agents) voice pipeline.
Uses Deepgram's real-time WebSocket API to transcribe audio incrementally as it arrives, producing interim and final results in real time. This eliminates STT latency from the critical path — by the time the user stops speaking, the transcript is already (nearly) ready.
## Install
```bash
npm install @cloudflare/voice-deepgram
```
## Usage
Set `streamingStt` on your voice agent:
```typescript
import { Agent } from "agents";
import { withVoice, type VoiceTurnContext } from "@cloudflare/voice";
import { DeepgramStreamingSTT } from "@cloudflare/voice-deepgram";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent<Env> {
streamingStt = new DeepgramStreamingSTT({
apiKey: this.env.DEEPGRAM_API_KEY
});
async onTurn(transcript: string, context: VoiceTurnContext) {
// your LLM logic — transcript arrives with near-zero STT latency
}
}
```
The client receives `transcript_interim` messages in real time, which can be displayed as the user speaks. The `useVoiceAgent` React hook exposes this as `interimTranscript`.
## Options
| Option | Default | Description |
| ------------- | ------------ | -------------------------------------------------------------- |
| `apiKey` | (required) | Deepgram API key |
| `model` | `"nova-3"` | Deepgram model. Nova-3 is the latest and most accurate. |
| `language` | `"en"` | Language code (e.g. `"en"`, `"es"`, `"fr"`) |
| `smartFormat` | `true` | Enable smart formatting (numbers, dates, currency) |
| `punctuate` | `true` | Enable automatic punctuation |
| `fillerWords` | `false` | Include filler words (um, uh) in transcripts |
| `encoding` | `"linear16"` | Audio encoding. Must match the voice pipeline (16-bit PCM). |
| `sampleRate` | `16000` | Sample rate in Hz. Must match the voice pipeline (16kHz). |
| `channels` | `1` | Number of audio channels. Must match the voice pipeline (mono) |
## How it works
1. When the user starts speaking, a WebSocket session is opened to Deepgram
2. Audio chunks are forwarded to Deepgram in real time via `feed()`
3. Deepgram sends back interim (unstable) and final (stable) transcript segments
4. These are relayed to the client as `transcript_interim` messages
5. When the user stops speaking, `finish()` sends a `CloseStream` message and returns the full transcript
6. The transcript is passed to `onTurn()` with near-zero additional STT latency
## Without a Deepgram key
If you do not have a Deepgram API key, the default voice agent uses Workers AI STT (batch mode) with no external API key required. Streaming STT is opt-in.