Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions webhooks/delivery.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You know what arrives ([Events](/webhooks/events)) and how to prove it's real ([
- **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side.
- **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok.
- **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed.
- **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today.
- **Bounded budget.** 15-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today.
- **At-least-once delivery.** A retry after your server timed out can re-deliver an event you already processed — always dedupe in your handler (see [Be idempotent](#be-idempotent) below).
- **URL guard, fail-closed.** Before every attempt the worker validates the target URL: it must be `https://`, must resolve to a public address, and must not redirect. A URL that fails the check is dropped immediately — fatal, no retry — see [Where we won't deliver](#where-we-wont-deliver) below.

Expand Down Expand Up @@ -46,7 +46,7 @@ sequenceDiagram
Note over W: ✓ delivered after retry
```

The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.
The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 15-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.

## Retry policy

Expand All @@ -61,7 +61,7 @@ Retries follow an exponential-backoff schedule with ±50% jitter applied to ever
| 5 | 10 seconds after attempt 4 ends (clamped from a formula value of 25s by the per-attempt cap) | `[5s, 15s)` |
| 6 | 10 seconds after attempt 5 ends (clamped from a formula value of 125s by the per-attempt cap) | `[5s, 15s)` |

Per-attempt timeout: **30 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below).
Per-attempt timeout: **15 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below).

After attempt 6 fails, the event is logged and dropped. There is no persistent queue and no dead-letter destination — both are out of scope for v1.

Expand All @@ -79,6 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn

| Knob | Default | Effect |
| --- | --- | --- |
| Per-attempt timeout | 15 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. |
| Initial delay | 200ms | The `i = 0` term — delay before the first retry. |
| Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). |
| Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. |
Expand All @@ -87,7 +88,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn
These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA).

<Tip>
If you're seeing duplicates after long handler waits — say, attempt 1 takes 28 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget.
If you're seeing duplicates after long handler waits — say, attempt 1 takes 20 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget.
</Tip>

## What your status codes mean to us
Expand All @@ -102,7 +103,7 @@ If you're seeing duplicates after long handler waits — say, attempt 1 takes 28
| Any other `4xx` (e.g. `400`, `401`, `403`, `404`, `422`) | Fatal | Don't retry. The assumption is that the request will never succeed (auth bug, schema mismatch, missing route). |
| Connection refused / TCP reset (after the URL guard passes) | Retriable | Wait, retry. |
| Hostname doesn't resolve (DNS failure) | Fatal | Caught by the URL guard *before* the request — fail-closed, no retry. |
| Per-attempt timeout (>30s) | Retriable | Wait, retry. |
| Per-attempt timeout (>15s) | Retriable | Wait, retry. |

<Tip>
**Return `4xx` deliberately.** Returning `400` or `401` from a real bug (e.g. signature verification failure) is correct — it tells us "stop retrying, this request will never work." Returning `500` for the same bug wastes our retry budget and your CPU cycles.
Expand Down Expand Up @@ -139,7 +140,7 @@ app.post('/spectrum-webhook', async (c) => {
});
```

If your handler takes >30 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice.
If your handler takes >15 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice.

### Be idempotent

Expand Down Expand Up @@ -173,7 +174,7 @@ Returning `503` on overload is fine — we'll back off and retry. But it eats in
| --- | --- |
| Endpoint returns `2xx` on first try | Best case. One delivery, one process. |
| Endpoint returns `503`, recovers within ~30s | Retried, eventually delivered. One process (assuming no `2xx` on the failed attempt). |
| Endpoint times out after 30s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. |
| Endpoint times out after 15s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. |
| Endpoint returns `400` (signature bug, etc.) | Dropped immediately, no retry. Event lost. Logged on our side. |
| Webhook URL is `http://` (not HTTPS) | Dropped immediately by the URL guard, no retry. Every event lost until you re-register an `https://` URL. |
| Webhook URL resolves to a private/internal IP | Dropped immediately, no retry (SSRF guard). Logged. |
Expand Down
4 changes: 4 additions & 0 deletions webhooks/managing-webhooks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,10 @@ The delete is logical — the row is soft-deleted with a `deletedAt` timestamp o

## Rotating the signing secret

<Note>
Your signing secret is **stable for the life of the registration**. Restarting your app, your relay, or the Spectrum worker never rotates it — the only things that change a secret are an explicit delete + re-register (below) or registering a brand-new webhook. If you find yourself capturing a new secret on every restart, you're deleting and re-creating the webhook when you don't need to.
</Note>

There is no dedicated rotation endpoint. To rotate, **delete and re-register**:

```sh
Expand Down
14 changes: 10 additions & 4 deletions webhooks/troubleshooting.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,18 @@ All of these drop the event as **fatal** — no retry. There's no update endpoin

## "I receive duplicates"

This is expected behavior under at-least-once delivery. The two scenarios that cause it:
Two flavors, with different fixes. The first two are retry-driven and solved by deduping; the third is a registration problem that deduping **can't** fix.

**Retry-driven (at-least-once delivery).** Expected under our delivery contract:

1. **Your handler succeeded but timed out before responding.** We retried, you processed twice.
2. **Your handler returned `5xx` after partially processing.** We retried, you re-ran the partial work.

### Fix
**Registration-driven.** Not a retry at all:

3. **More than one webhook is registered and more than one acts on the event.** Every registered URL receives *every* event (see [Multiple webhooks per project](/webhooks/managing-webhooks#multiple-webhooks-per-project)) — including **stale registrations you forgot to delete** after a URL change. If two endpoints both act (e.g. both reply), every message is doubled at the source. Deduping won't help here: two independent backends don't share a dedupe store, so each processes its own copy exactly once and the user still sees two. The fix is to keep one canonical webhook and [delete the rest](/webhooks/managing-webhooks#delete-a-webhook). If your URL changes on every restart or deploy (ngrok, preview environments), delete the old registration each time you add the new one — see ["ngrok URL keeps changing"](#ngrok-url-keeps-changing).

### Fix (scenarios 1 and 2)

Dedupe at the top of your handler using `X-Spectrum-Webhook-Id` plus `payload.message.id` as a composite key:

Expand All @@ -130,7 +136,7 @@ A 24-48 hour TTL is plenty — our retry budget is bounded to a few minutes at m

## "Deliveries time out"

If you're seeing your endpoint logged as "took >30s," it triggers a retry on our side and a likely duplicate processing on yours.
If you're seeing your endpoint logged as "took >15s," it triggers a retry on our side and a likely duplicate processing on yours.

### Diagnosis

Expand All @@ -145,7 +151,7 @@ app.post('/spectrum-webhook', async (c) => {
});
```

Anything network-dependent in the request path can blow past 30s.
Anything network-dependent in the request path can blow past 15s.

### Fix

Expand Down
Loading