From eeb0fcc8292a2f5b2d2cc86ae919eed569db314b Mon Sep 17 00:00:00 2001 From: Yan Xue Date: Fri, 12 Jun 2026 14:07:25 -0700 Subject: [PATCH 1/3] docs(webhooks): clarify duplicate causes, secret stability, timeout tunability MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - delivery: document the per-attempt request timeout as a tunable knob (30s default) so the contract reflects that a deployment can run it shorter, rather than asserting a flat 30s the worker can override - troubleshooting: add the multi-registration cause to "I receive duplicates" — every registered URL gets every event, so extra/stale webhooks double output and dedupe can't fix it across independent backends; keep one canonical URL - managing-webhooks: note the signing secret is stable for the life of the registration; app/relay/worker restarts never rotate it Co-Authored-By: Claude Fable 5 --- webhooks/delivery.mdx | 1 + webhooks/managing-webhooks.mdx | 4 ++++ webhooks/troubleshooting.mdx | 10 ++++++++-- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx index d15ef2e..6a919df 100644 --- a/webhooks/delivery.mdx +++ b/webhooks/delivery.mdx @@ -79,6 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn | Knob | Default | Effect | | --- | --- | --- | +| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry — the `>30s` ceiling from the section above. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. | | Initial delay | 200ms | The `i = 0` term — delay before the first retry. | | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). | | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. | diff --git a/webhooks/managing-webhooks.mdx b/webhooks/managing-webhooks.mdx index ed3ac22..ebcb17d 100644 --- a/webhooks/managing-webhooks.mdx +++ b/webhooks/managing-webhooks.mdx @@ -205,6 +205,10 @@ The delete is logical — the row is soft-deleted with a `deletedAt` timestamp o ## Rotating the signing secret + +Your signing secret is **stable for the life of the registration**. Restarting your app, your relay, or the Spectrum worker never rotates it — the only things that change a secret are an explicit delete + re-register (below) or registering a brand-new webhook. If you find yourself capturing a new secret on every restart, you're deleting and re-creating the webhook when you don't need to. + + There is no dedicated rotation endpoint. To rotate, **delete and re-register**: ```sh diff --git a/webhooks/troubleshooting.mdx b/webhooks/troubleshooting.mdx index c028865..532ef4c 100644 --- a/webhooks/troubleshooting.mdx +++ b/webhooks/troubleshooting.mdx @@ -110,12 +110,18 @@ All of these drop the event as **fatal** — no retry. There's no update endpoin ## "I receive duplicates" -This is expected behavior under at-least-once delivery. The two scenarios that cause it: +Two flavors, with different fixes. The first two are retry-driven and solved by deduping; the third is a registration problem that deduping **can't** fix. + +**Retry-driven (at-least-once delivery).** Expected under our delivery contract: 1. **Your handler succeeded but timed out before responding.** We retried, you processed twice. 2. **Your handler returned `5xx` after partially processing.** We retried, you re-ran the partial work. -### Fix +**Registration-driven.** Not a retry at all: + +3. **More than one webhook is registered and more than one acts on the event.** Every registered URL receives *every* event (see [Multiple webhooks per project](/webhooks/managing-webhooks#multiple-webhooks-per-project)) — including **stale registrations you forgot to delete** after a URL change. If two endpoints both act (e.g. both reply), every message is doubled at the source. Deduping won't help here: two independent backends don't share a dedupe store, so each processes its own copy exactly once and the user still sees two. The fix is to keep one canonical webhook and [delete the rest](/webhooks/managing-webhooks#delete-a-webhook). If your URL changes on every restart or deploy (ngrok, preview environments), delete the old registration each time you add the new one — see ["ngrok URL keeps changing"](#ngrok-url-keeps-changing). + +### Fix (scenarios 1 and 2) Dedupe at the top of your handler using `X-Spectrum-Webhook-Id` plus `payload.message.id` as a composite key: From 28eb4a59294c9a5f3e0a255b97ca342cacca6d9e Mon Sep 17 00:00:00 2001 From: Yan Xue Date: Fri, 12 Jun 2026 15:18:24 -0700 Subject: [PATCH 2/3] docs(webhooks): drop confusing ">30s above" cross-ref in timeout row The row's description is self-contained; the ">30s" notation actually appears below this table (status-code section), not above, and line 64 says "30 seconds" rather than ">30s". Addresses PR review feedback. Co-Authored-By: Claude Fable 5 --- webhooks/delivery.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx index 6a919df..7230c83 100644 --- a/webhooks/delivery.mdx +++ b/webhooks/delivery.mdx @@ -79,7 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn | Knob | Default | Effect | | --- | --- | --- | -| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry — the `>30s` ceiling from the section above. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. | +| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. | | Initial delay | 200ms | The `i = 0` term — delay before the first retry. | | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). | | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. | From 62a3aa794c6e07153400929ce4895e77ca0324f4 Mon Sep 17 00:00:00 2001 From: Yan Xue Date: Fri, 12 Jun 2026 16:36:02 -0700 Subject: [PATCH 3/3] docs(webhooks): align documented per-attempt timeout to 15s (prod value) Prod runs DELIVERY_TIMEOUT_MS=15000; the docs asserted 30s throughout. Swept every per-attempt-timeout reference to 15s and recomputed the hang-to-timeout worst case (6x15s + ~39s backoff ~= ~2 min, was ~3.5). Left the backoff/retry-window figures (~26-39s; the "~30s budget/window" rows) unchanged -- those are the retry-sleep schedule, independent of the per-attempt request timeout. Co-Authored-By: Claude Fable 5 --- webhooks/delivery.mdx | 16 ++++++++-------- webhooks/troubleshooting.mdx | 4 ++-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx index 7230c83..d11e4e6 100644 --- a/webhooks/delivery.mdx +++ b/webhooks/delivery.mdx @@ -10,7 +10,7 @@ You know what arrives ([Events](/webhooks/events)) and how to prove it's real ([ - **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side. - **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok. - **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed. -- **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today. +- **Bounded budget.** 15-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today. - **At-least-once delivery.** A retry after your server timed out can re-deliver an event you already processed — always dedupe in your handler (see [Be idempotent](#be-idempotent) below). - **URL guard, fail-closed.** Before every attempt the worker validates the target URL: it must be `https://`, must resolve to a public address, and must not redirect. A URL that fails the check is dropped immediately — fatal, no retry — see [Where we won't deliver](#where-we-wont-deliver) below. @@ -46,7 +46,7 @@ sequenceDiagram Note over W: ✓ delivered after retry ``` -The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless. +The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 15-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless. ## Retry policy @@ -61,7 +61,7 @@ Retries follow an exponential-backoff schedule with ±50% jitter applied to ever | 5 | 10 seconds after attempt 4 ends (clamped from a formula value of 25s by the per-attempt cap) | `[5s, 15s)` | | 6 | 10 seconds after attempt 5 ends (clamped from a formula value of 125s by the per-attempt cap) | `[5s, 15s)` | -Per-attempt timeout: **30 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below). +Per-attempt timeout: **15 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below). After attempt 6 fails, the event is logged and dropped. There is no persistent queue and no dead-letter destination — both are out of scope for v1. @@ -79,7 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn | Knob | Default | Effect | | --- | --- | --- | -| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. | +| Per-attempt timeout | 15 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. | | Initial delay | 200ms | The `i = 0` term — delay before the first retry. | | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). | | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. | @@ -88,7 +88,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA). -If you're seeing duplicates after long handler waits — say, attempt 1 takes 28 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget. +If you're seeing duplicates after long handler waits — say, attempt 1 takes 20 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget. ## What your status codes mean to us @@ -103,7 +103,7 @@ If you're seeing duplicates after long handler waits — say, attempt 1 takes 28 | Any other `4xx` (e.g. `400`, `401`, `403`, `404`, `422`) | Fatal | Don't retry. The assumption is that the request will never succeed (auth bug, schema mismatch, missing route). | | Connection refused / TCP reset (after the URL guard passes) | Retriable | Wait, retry. | | Hostname doesn't resolve (DNS failure) | Fatal | Caught by the URL guard *before* the request — fail-closed, no retry. | -| Per-attempt timeout (>30s) | Retriable | Wait, retry. | +| Per-attempt timeout (>15s) | Retriable | Wait, retry. | **Return `4xx` deliberately.** Returning `400` or `401` from a real bug (e.g. signature verification failure) is correct — it tells us "stop retrying, this request will never work." Returning `500` for the same bug wastes our retry budget and your CPU cycles. @@ -140,7 +140,7 @@ app.post('/spectrum-webhook', async (c) => { }); ``` -If your handler takes >30 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice. +If your handler takes >15 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice. ### Be idempotent @@ -174,7 +174,7 @@ Returning `503` on overload is fine — we'll back off and retry. But it eats in | --- | --- | | Endpoint returns `2xx` on first try | Best case. One delivery, one process. | | Endpoint returns `503`, recovers within ~30s | Retried, eventually delivered. One process (assuming no `2xx` on the failed attempt). | -| Endpoint times out after 30s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. | +| Endpoint times out after 15s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. | | Endpoint returns `400` (signature bug, etc.) | Dropped immediately, no retry. Event lost. Logged on our side. | | Webhook URL is `http://` (not HTTPS) | Dropped immediately by the URL guard, no retry. Every event lost until you re-register an `https://` URL. | | Webhook URL resolves to a private/internal IP | Dropped immediately, no retry (SSRF guard). Logged. | diff --git a/webhooks/troubleshooting.mdx b/webhooks/troubleshooting.mdx index 532ef4c..6797d9b 100644 --- a/webhooks/troubleshooting.mdx +++ b/webhooks/troubleshooting.mdx @@ -136,7 +136,7 @@ A 24-48 hour TTL is plenty — our retry budget is bounded to a few minutes at m ## "Deliveries time out" -If you're seeing your endpoint logged as "took >30s," it triggers a retry on our side and a likely duplicate processing on yours. +If you're seeing your endpoint logged as "took >15s," it triggers a retry on our side and a likely duplicate processing on yours. ### Diagnosis @@ -151,7 +151,7 @@ app.post('/spectrum-webhook', async (c) => { }); ``` -Anything network-dependent in the request path can blow past 30s. +Anything network-dependent in the request path can blow past 15s. ### Fix