branch:
retries.md
13552 bytesRaw
# Retries
Retry failed operations with exponential backoff and jitter. The Agents SDK provides built-in retry support for scheduled tasks, queued tasks, and a general-purpose `this.retry()` method for your own code.
## Overview
Transient failures are common when calling external APIs, interacting with other services, or running background tasks. The retry system handles these automatically:
- **Exponential backoff** — each retry waits longer than the last
- **Jitter** — randomized delays prevent thundering herd problems
- **Configurable** — tune attempts, delays, and caps per call site
- **Built-in** — schedule, queue, and workflow operations retry automatically
## Quick Start
Use `this.retry()` to retry any async operation:
```typescript
import { Agent } from "agents";
export class MyAgent extends Agent {
async fetchWithRetry(url: string) {
const response = await this.retry(async () => {
const res = await fetch(url);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
});
return response;
}
}
```
By default, `this.retry()` makes up to 3 attempts with jittered exponential backoff.
## `this.retry()`
The `retry()` method is available on every `Agent` instance. It retries the provided function on any thrown error by default.
```typescript
async retry<T>(
fn: (attempt: number) => Promise<T>,
options?: RetryOptions & {
shouldRetry?: (err: unknown, nextAttempt: number) => boolean;
}
): Promise<T>
```
**Parameters:**
- `fn` — the async function to retry. Receives the current attempt number (1-indexed).
- `options` — optional retry configuration (see [RetryOptions](#retryoptions) below). Options are validated eagerly — invalid values throw immediately.
- `options.shouldRetry` — optional predicate called with the thrown error and the next attempt number. Return `false` to stop retrying immediately. If not provided, all errors are retried.
**Returns:** the result of `fn` on success.
**Throws:** the last error if all attempts fail or `shouldRetry` returns `false`.
### Examples
**Basic retry:**
```typescript
const data = await this.retry(() => fetch("https://api.example.com/data"));
```
**Custom retry options:**
```typescript
const data = await this.retry(
async () => {
const res = await fetch("https://slow-api.example.com/data");
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
},
{
maxAttempts: 5,
baseDelayMs: 500,
maxDelayMs: 10000
}
);
```
**Using the attempt number:**
```typescript
const result = await this.retry(async (attempt) => {
console.log(`Attempt ${attempt}...`);
return await this.callExternalService();
});
```
**Selective retry with `shouldRetry`:**
Use `shouldRetry` to stop retrying on specific errors. The predicate receives both the error and the next attempt number:
```typescript
const data = await this.retry(
async () => {
const res = await fetch("https://api.example.com/data");
if (!res.ok) throw new HttpError(res.status, await res.text());
return res.json();
},
{
maxAttempts: 5,
shouldRetry: (err, nextAttempt) => {
// Don't retry 4xx client errors — our request is wrong
if (err instanceof HttpError && err.status >= 400 && err.status < 500) {
return false;
}
return true; // retry everything else (5xx, network errors, etc.)
}
}
);
```
## Retries in Schedules
Pass retry options when creating a schedule:
```typescript
// Retry up to 5 times if the callback fails
await this.schedule(
60,
"processTask",
{ taskId: "123" },
{
retry: { maxAttempts: 5 }
}
);
// Retry with custom backoff
await this.schedule(
new Date("2026-03-01T09:00:00Z"),
"sendReport",
{},
{
retry: {
maxAttempts: 3,
baseDelayMs: 1000,
maxDelayMs: 30000
}
}
);
// Cron with retries
await this.schedule(
"0 8 * * *",
"dailyDigest",
{},
{
retry: { maxAttempts: 3 }
}
);
// Interval with retries
await this.scheduleEvery(
30,
"poll",
{ source: "api" },
{
retry: { maxAttempts: 5, baseDelayMs: 200 }
}
);
```
If the callback throws, it is retried according to the retry options. If all attempts fail, the error is logged and routed through `onError()`. The schedule is still removed (for one-time schedules) or rescheduled (for cron/interval) regardless of success or failure.
## Retries in Queues
Pass retry options when adding a task to the queue:
```typescript
await this.queue(
"sendEmail",
{ to: "user@example.com" },
{
retry: { maxAttempts: 5 }
}
);
await this.queue("processWebhook", webhookData, {
retry: {
maxAttempts: 3,
baseDelayMs: 500,
maxDelayMs: 5000
}
});
```
If the callback throws, it is retried before the task is dequeued. After all attempts are exhausted, the task is dequeued and the error is logged.
## Validation
Retry options are validated eagerly when you call `this.retry()`, `queue()`, `schedule()`, or `scheduleEvery()`. Invalid options throw immediately instead of failing later at execution time:
```typescript
// Throws immediately: "retry.maxAttempts must be >= 1"
await this.queue("sendEmail", data, {
retry: { maxAttempts: 0 }
});
// Throws immediately: "retry.baseDelayMs must be > 0"
await this.schedule(
60,
"process",
{},
{
retry: { baseDelayMs: -100 }
}
);
// Throws immediately: "retry.maxAttempts must be an integer"
await this.retry(() => fetch(url), { maxAttempts: 2.5 });
// Throws immediately: "retry.baseDelayMs must be <= retry.maxDelayMs"
// because baseDelayMs: 5000 exceeds the default maxDelayMs: 3000
await this.queue("sendEmail", data, {
retry: { baseDelayMs: 5000 }
});
```
Validation resolves partial options against class-level or built-in defaults before checking cross-field constraints. This means `{ baseDelayMs: 5000 }` is caught immediately when the resolved `maxDelayMs` is 3000, rather than failing later at execution time.
## Default Behavior
Even without explicit retry options, scheduled and queued callbacks are retried with sensible defaults:
| Setting | Default |
| ------------- | ------- |
| `maxAttempts` | 3 |
| `baseDelayMs` | 100 |
| `maxDelayMs` | 3000 |
These defaults apply to `this.retry()`, `queue()`, `schedule()`, and `scheduleEvery()`. Per-call-site options override them.
### Class-Level Defaults
Override the defaults for your entire agent via `static options`:
```typescript
class MyAgent extends Agent {
static options = {
retry: { maxAttempts: 5, baseDelayMs: 200, maxDelayMs: 5000 }
};
}
```
You only need to specify the fields you want to change — unset fields fall back to the built-in defaults:
```typescript
class MyAgent extends Agent {
// Only override maxAttempts; baseDelayMs (100) and maxDelayMs (3000) stay default
static options = {
retry: { maxAttempts: 10 }
};
}
```
Class-level defaults are used as fallbacks when a call site does not specify retry options. Per-call-site options always take priority:
```typescript
// Uses class-level defaults (10 attempts)
await this.retry(() => fetch(url));
// Overrides to 2 attempts for this specific call
await this.retry(() => fetch(url), { maxAttempts: 2 });
```
To disable retries for a specific task, set `maxAttempts: 1`:
```typescript
await this.schedule(
60,
"oneShot",
{},
{
retry: { maxAttempts: 1 }
}
);
```
## RetryOptions
```typescript
interface RetryOptions {
/** Maximum number of attempts (including the first). Must be an integer >= 1. Default: 3 */
maxAttempts?: number;
/** Base delay in milliseconds for exponential backoff. Must be > 0 and <= maxDelayMs. Default: 100 */
baseDelayMs?: number;
/** Maximum delay cap in milliseconds. Must be > 0. Default: 3000 */
maxDelayMs?: number;
}
```
The delay between retries uses **full jitter exponential backoff**:
```
delay = random(0, min(2^attempt * baseDelayMs, maxDelayMs))
```
This means early retries are fast (often under 200ms), and later retries back off to avoid overwhelming a failing service. The randomization (jitter) prevents multiple agents from retrying at the exact same moment.
## How It Works
### Backoff Strategy
The retry system uses the "Full Jitter" strategy from the [AWS Architecture Blog](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). Given 3 attempts with default settings:
| Attempt | Upper Bound | Actual Delay |
| ------- | ----------------------------- | ---------------- |
| 1 | min(2^1 \* 100, 3000) = 200ms | random(0, 200ms) |
| 2 | min(2^2 \* 100, 3000) = 400ms | random(0, 400ms) |
| 3 | (no retry — final attempt) | — |
With `maxAttempts: 5` and `baseDelayMs: 500`:
| Attempt | Upper Bound | Actual Delay |
| ------- | ----------------------------- | ----------------- |
| 1 | min(2 \* 500, 3000) = 1000ms | random(0, 1000ms) |
| 2 | min(4 \* 500, 3000) = 2000ms | random(0, 2000ms) |
| 3 | min(8 \* 500, 3000) = 3000ms | random(0, 3000ms) |
| 4 | min(16 \* 500, 3000) = 3000ms | random(0, 3000ms) |
| 5 | (no retry — final attempt) | — |
### MCP Server Retries
When adding an MCP server, you can configure retry options for connection and reconnection attempts:
```typescript
await this.addMcpServer("github", "https://mcp.github.com", {
retry: { maxAttempts: 5, baseDelayMs: 1000, maxDelayMs: 10000 }
});
```
These options are persisted and used when:
- Restoring server connections after hibernation
- Establishing connections after OAuth completion
Default: 3 attempts, 500ms base delay, 5s max delay.
### Internal Retries
The SDK also uses retries internally for platform operations:
- **Workflow operations** (`terminateWorkflow`, `pauseWorkflow`, `resumeWorkflow`, `restartWorkflow`, `sendEventToWorkflow`) — retried with Durable Object-aware error detection. Transient DO errors are retried; overloaded errors are not.
These internal retries use hardcoded defaults and are not configurable.
## Patterns
### Retry with Logging
```typescript
class MyAgent extends Agent {
async resilientTask(payload: { url: string }) {
try {
const result = await this.retry(
async (attempt) => {
if (attempt > 1) {
console.log(`Retrying ${payload.url} (attempt ${attempt})...`);
}
const res = await fetch(payload.url);
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
},
{ maxAttempts: 5 }
);
console.log("Success:", result);
} catch (e) {
console.error("All retries failed:", e);
}
}
}
```
### Retry with Fallback
```typescript
class MyAgent extends Agent {
async fetchData() {
try {
return await this.retry(
() => fetch("https://primary-api.example.com/data"),
{ maxAttempts: 3, baseDelayMs: 200 }
);
} catch {
// Primary failed, try fallback
return await this.retry(
() => fetch("https://fallback-api.example.com/data"),
{ maxAttempts: 2 }
);
}
}
}
```
### Combining Retries with Scheduling
For operations that might take a long time to recover (minutes or hours), combine `this.retry()` for immediate retries with `this.schedule()` for delayed retries:
```typescript
class MyAgent extends Agent {
async syncData(payload: { source: string; attempt?: number }) {
const attempt = payload.attempt ?? 1;
try {
// Immediate retries for transient failures (seconds)
await this.retry(() => this.fetchAndProcess(payload.source), {
maxAttempts: 3,
baseDelayMs: 1000
});
} catch (e) {
if (attempt >= 5) {
console.error("Giving up after 5 scheduled attempts");
return;
}
// Schedule a retry in 5 minutes for longer outages
const delaySeconds = 300 * attempt;
await this.schedule(delaySeconds, "syncData", {
source: payload.source,
attempt: attempt + 1
});
console.log(`Scheduled retry ${attempt + 1} in ${delaySeconds}s`);
}
}
}
```
## Limitations
- **No dead-letter queue.** If a queued or scheduled task fails all retry attempts, it is removed. Implement your own persistence if you need to track failed tasks.
- **Retry delays block the agent.** During the backoff delay, the Durable Object is awake but idle. For short delays (under 3 seconds) this is fine. For longer recovery times, use `this.schedule()` instead.
- **Queue retries are head-of-line blocking.** Queue items are processed sequentially. If one item is being retried with long delays, it blocks all subsequent items. If you need independent retry behavior, use `this.retry()` inside the callback rather than per-task retry options on `queue()`.
- **No circuit breaker.** The retry system does not track failure rates across calls. If a service is persistently down, each task will exhaust its retry budget independently.
- **`shouldRetry` is only available on `this.retry()`.** The `shouldRetry` predicate cannot be used with `schedule()` or `queue()` because functions cannot be serialized to the database. For scheduled/queued tasks, handle non-retryable errors inside the callback itself.
## Related
- [Scheduling](./scheduling.md) — schedule tasks for future execution
- [Queue](./queue.md) — background task queue
- [Workflows](./workflows.md) — durable multi-step processing with automatic retries