diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx index d15ef2e..d11e4e6 100644 --- a/webhooks/delivery.mdx +++ b/webhooks/delivery.mdx @@ -10,7 +10,7 @@ You know what arrives ([Events](/webhooks/events)) and how to prove it's real ([ - **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side. - **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok. - **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed. -- **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today. +- **Bounded budget.** 15-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today. - **At-least-once delivery.** A retry after your server timed out can re-deliver an event you already processed — always dedupe in your handler (see [Be idempotent](#be-idempotent) below). - **URL guard, fail-closed.** Before every attempt the worker validates the target URL: it must be `https://`, must resolve to a public address, and must not redirect. A URL that fails the check is dropped immediately — fatal, no retry — see [Where we won't deliver](#where-we-wont-deliver) below. @@ -46,7 +46,7 @@ sequenceDiagram Note over W: ✓ delivered after retry ``` -The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless. +The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 15-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless. ## Retry policy @@ -61,7 +61,7 @@ Retries follow an exponential-backoff schedule with ±50% jitter applied to ever | 5 | 10 seconds after attempt 4 ends (clamped from a formula value of 25s by the per-attempt cap) | `[5s, 15s)` | | 6 | 10 seconds after attempt 5 ends (clamped from a formula value of 125s by the per-attempt cap) | `[5s, 15s)` | -Per-attempt timeout: **30 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below). +Per-attempt timeout: **15 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below). After attempt 6 fails, the event is logged and dropped. There is no persistent queue and no dead-letter destination — both are out of scope for v1. @@ -79,6 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn | Knob | Default | Effect | | --- | --- | --- | +| Per-attempt timeout | 15 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. | | Initial delay | 200ms | The `i = 0` term — delay before the first retry. | | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). | | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. | @@ -87,7 +88,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA). -If you're seeing duplicates after long handler waits — say, attempt 1 takes 28 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget. +If you're seeing duplicates after long handler waits — say, attempt 1 takes 20 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget. ## What your status codes mean to us @@ -102,7 +103,7 @@ If you're seeing duplicates after long handler waits — say, attempt 1 takes 28 | Any other `4xx` (e.g. `400`, `401`, `403`, `404`, `422`) | Fatal | Don't retry. The assumption is that the request will never succeed (auth bug, schema mismatch, missing route). | | Connection refused / TCP reset (after the URL guard passes) | Retriable | Wait, retry. | | Hostname doesn't resolve (DNS failure) | Fatal | Caught by the URL guard *before* the request — fail-closed, no retry. | -| Per-attempt timeout (>30s) | Retriable | Wait, retry. | +| Per-attempt timeout (>15s) | Retriable | Wait, retry. | **Return `4xx` deliberately.** Returning `400` or `401` from a real bug (e.g. signature verification failure) is correct — it tells us "stop retrying, this request will never work." Returning `500` for the same bug wastes our retry budget and your CPU cycles. @@ -139,7 +140,7 @@ app.post('/spectrum-webhook', async (c) => { }); ``` -If your handler takes >30 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice. +If your handler takes >15 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice. ### Be idempotent @@ -173,7 +174,7 @@ Returning `503` on overload is fine — we'll back off and retry. But it eats in | --- | --- | | Endpoint returns `2xx` on first try | Best case. One delivery, one process. | | Endpoint returns `503`, recovers within ~30s | Retried, eventually delivered. One process (assuming no `2xx` on the failed attempt). | -| Endpoint times out after 30s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. | +| Endpoint times out after 15s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. | | Endpoint returns `400` (signature bug, etc.) | Dropped immediately, no retry. Event lost. Logged on our side. | | Webhook URL is `http://` (not HTTPS) | Dropped immediately by the URL guard, no retry. Every event lost until you re-register an `https://` URL. | | Webhook URL resolves to a private/internal IP | Dropped immediately, no retry (SSRF guard). Logged. | diff --git a/webhooks/managing-webhooks.mdx b/webhooks/managing-webhooks.mdx index ed3ac22..ebcb17d 100644 --- a/webhooks/managing-webhooks.mdx +++ b/webhooks/managing-webhooks.mdx @@ -205,6 +205,10 @@ The delete is logical — the row is soft-deleted with a `deletedAt` timestamp o ## Rotating the signing secret + +Your signing secret is **stable for the life of the registration**. Restarting your app, your relay, or the Spectrum worker never rotates it — the only things that change a secret are an explicit delete + re-register (below) or registering a brand-new webhook. If you find yourself capturing a new secret on every restart, you're deleting and re-creating the webhook when you don't need to. + + There is no dedicated rotation endpoint. To rotate, **delete and re-register**: ```sh diff --git a/webhooks/troubleshooting.mdx b/webhooks/troubleshooting.mdx index c028865..6797d9b 100644 --- a/webhooks/troubleshooting.mdx +++ b/webhooks/troubleshooting.mdx @@ -110,12 +110,18 @@ All of these drop the event as **fatal** — no retry. There's no update endpoin ## "I receive duplicates" -This is expected behavior under at-least-once delivery. The two scenarios that cause it: +Two flavors, with different fixes. The first two are retry-driven and solved by deduping; the third is a registration problem that deduping **can't** fix. + +**Retry-driven (at-least-once delivery).** Expected under our delivery contract: 1. **Your handler succeeded but timed out before responding.** We retried, you processed twice. 2. **Your handler returned `5xx` after partially processing.** We retried, you re-ran the partial work. -### Fix +**Registration-driven.** Not a retry at all: + +3. **More than one webhook is registered and more than one acts on the event.** Every registered URL receives *every* event (see [Multiple webhooks per project](/webhooks/managing-webhooks#multiple-webhooks-per-project)) — including **stale registrations you forgot to delete** after a URL change. If two endpoints both act (e.g. both reply), every message is doubled at the source. Deduping won't help here: two independent backends don't share a dedupe store, so each processes its own copy exactly once and the user still sees two. The fix is to keep one canonical webhook and [delete the rest](/webhooks/managing-webhooks#delete-a-webhook). If your URL changes on every restart or deploy (ngrok, preview environments), delete the old registration each time you add the new one — see ["ngrok URL keeps changing"](#ngrok-url-keeps-changing). + +### Fix (scenarios 1 and 2) Dedupe at the top of your handler using `X-Spectrum-Webhook-Id` plus `payload.message.id` as a composite key: @@ -130,7 +136,7 @@ A 24-48 hour TTL is plenty — our retry budget is bounded to a few minutes at m ## "Deliveries time out" -If you're seeing your endpoint logged as "took >30s," it triggers a retry on our side and a likely duplicate processing on yours. +If you're seeing your endpoint logged as "took >15s," it triggers a retry on our side and a likely duplicate processing on yours. ### Diagnosis @@ -145,7 +151,7 @@ app.post('/spectrum-webhook', async (c) => { }); ``` -Anything network-dependent in the request path can blow past 30s. +Anything network-dependent in the request path can blow past 15s. ### Fix