From eeb0fcc8292a2f5b2d2cc86ae919eed569db314b Mon Sep 17 00:00:00 2001
From: Yan Xue <y9xue@uwaterloo.ca>
Date: Fri, 12 Jun 2026 14:07:25 -0700
Subject: [PATCH 1/3] docs(webhooks): clarify duplicate causes, secret
 stability, timeout tunability
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- delivery: document the per-attempt request timeout as a tunable knob
  (30s default) so the contract reflects that a deployment can run it
  shorter, rather than asserting a flat 30s the worker can override
- troubleshooting: add the multi-registration cause to "I receive
  duplicates" — every registered URL gets every event, so extra/stale
  webhooks double output and dedupe can't fix it across independent
  backends; keep one canonical URL
- managing-webhooks: note the signing secret is stable for the life of
  the registration; app/relay/worker restarts never rotate it

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 webhooks/delivery.mdx          |  1 +
 webhooks/managing-webhooks.mdx |  4 ++++
 webhooks/troubleshooting.mdx   | 10 ++++++++--
 3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx
index d15ef2e..6a919df 100644
--- a/webhooks/delivery.mdx
+++ b/webhooks/delivery.mdx
@@ -79,6 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn
 
 | Knob | Default | Effect |
 | --- | --- | --- |
+| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry — the `>30s` ceiling from the section above. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. |
 | Initial delay | 200ms | The `i = 0` term — delay before the first retry. |
 | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). |
 | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. |
diff --git a/webhooks/managing-webhooks.mdx b/webhooks/managing-webhooks.mdx
index ed3ac22..ebcb17d 100644
--- a/webhooks/managing-webhooks.mdx
+++ b/webhooks/managing-webhooks.mdx
@@ -205,6 +205,10 @@ The delete is logical — the row is soft-deleted with a `deletedAt` timestamp o
 
 ## Rotating the signing secret
 
+<Note>
+Your signing secret is **stable for the life of the registration**. Restarting your app, your relay, or the Spectrum worker never rotates it — the only things that change a secret are an explicit delete + re-register (below) or registering a brand-new webhook. If you find yourself capturing a new secret on every restart, you're deleting and re-creating the webhook when you don't need to.
+</Note>
+
 There is no dedicated rotation endpoint. To rotate, **delete and re-register**:
 
 ```sh
diff --git a/webhooks/troubleshooting.mdx b/webhooks/troubleshooting.mdx
index c028865..532ef4c 100644
--- a/webhooks/troubleshooting.mdx
+++ b/webhooks/troubleshooting.mdx
@@ -110,12 +110,18 @@ All of these drop the event as **fatal** — no retry. There's no update endpoin
 
 ## "I receive duplicates"
 
-This is expected behavior under at-least-once delivery. The two scenarios that cause it:
+Two flavors, with different fixes. The first two are retry-driven and solved by deduping; the third is a registration problem that deduping **can't** fix.
+
+**Retry-driven (at-least-once delivery).** Expected under our delivery contract:
 
 1. **Your handler succeeded but timed out before responding.** We retried, you processed twice.
 2. **Your handler returned `5xx` after partially processing.** We retried, you re-ran the partial work.
 
-### Fix
+**Registration-driven.** Not a retry at all:
+
+3. **More than one webhook is registered and more than one acts on the event.** Every registered URL receives *every* event (see [Multiple webhooks per project](/webhooks/managing-webhooks#multiple-webhooks-per-project)) — including **stale registrations you forgot to delete** after a URL change. If two endpoints both act (e.g. both reply), every message is doubled at the source. Deduping won't help here: two independent backends don't share a dedupe store, so each processes its own copy exactly once and the user still sees two. The fix is to keep one canonical webhook and [delete the rest](/webhooks/managing-webhooks#delete-a-webhook). If your URL changes on every restart or deploy (ngrok, preview environments), delete the old registration each time you add the new one — see ["ngrok URL keeps changing"](#ngrok-url-keeps-changing).
+
+### Fix (scenarios 1 and 2)
 
 Dedupe at the top of your handler using `X-Spectrum-Webhook-Id` plus `payload.message.id` as a composite key:
 

From 28eb4a59294c9a5f3e0a255b97ca342cacca6d9e Mon Sep 17 00:00:00 2001
From: Yan Xue <y9xue@uwaterloo.ca>
Date: Fri, 12 Jun 2026 15:18:24 -0700
Subject: [PATCH 2/3] docs(webhooks): drop confusing ">30s above" cross-ref in
 timeout row

The row's description is self-contained; the ">30s" notation actually
appears below this table (status-code section), not above, and line 64
says "30 seconds" rather than ">30s". Addresses PR review feedback.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 webhooks/delivery.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx
index 6a919df..7230c83 100644
--- a/webhooks/delivery.mdx
+++ b/webhooks/delivery.mdx
@@ -79,7 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn
 
 | Knob | Default | Effect |
 | --- | --- | --- |
-| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry — the `>30s` ceiling from the section above. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. |
+| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. |
 | Initial delay | 200ms | The `i = 0` term — delay before the first retry. |
 | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). |
 | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. |

From 62a3aa794c6e07153400929ce4895e77ca0324f4 Mon Sep 17 00:00:00 2001
From: Yan Xue <y9xue@uwaterloo.ca>
Date: Fri, 12 Jun 2026 16:36:02 -0700
Subject: [PATCH 3/3] docs(webhooks): align documented per-attempt timeout to
 15s (prod value)

Prod runs DELIVERY_TIMEOUT_MS=15000; the docs asserted 30s throughout.
Swept every per-attempt-timeout reference to 15s and recomputed the
hang-to-timeout worst case (6x15s + ~39s backoff ~= ~2 min, was ~3.5).
Left the backoff/retry-window figures (~26-39s; the "~30s budget/window"
rows) unchanged -- those are the retry-sleep schedule, independent of
the per-attempt request timeout.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---
 webhooks/delivery.mdx        | 16 ++++++++--------
 webhooks/troubleshooting.mdx |  4 ++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/webhooks/delivery.mdx b/webhooks/delivery.mdx
index 7230c83..d11e4e6 100644
--- a/webhooks/delivery.mdx
+++ b/webhooks/delivery.mdx
@@ -10,7 +10,7 @@ You know what arrives ([Events](/webhooks/events)) and how to prove it's real ([
 - **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side.
 - **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok.
 - **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed.
-- **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today.
+- **Bounded budget.** 15-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today.
 - **At-least-once delivery.** A retry after your server timed out can re-deliver an event you already processed — always dedupe in your handler (see [Be idempotent](#be-idempotent) below).
 - **URL guard, fail-closed.** Before every attempt the worker validates the target URL: it must be `https://`, must resolve to a public address, and must not redirect. A URL that fails the check is dropped immediately — fatal, no retry — see [Where we won't deliver](#where-we-wont-deliver) below.
 
@@ -46,7 +46,7 @@ sequenceDiagram
   Note over W: ✓ delivered after retry
 ```
 
-The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.
+The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 15-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.
 
 ## Retry policy
 
@@ -61,7 +61,7 @@ Retries follow an exponential-backoff schedule with ±50% jitter applied to ever
 | 5 | 10 seconds after attempt 4 ends (clamped from a formula value of 25s by the per-attempt cap) | `[5s, 15s)` |
 | 6 | 10 seconds after attempt 5 ends (clamped from a formula value of 125s by the per-attempt cap) | `[5s, 15s)` |
 
-Per-attempt timeout: **30 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below).
+Per-attempt timeout: **15 seconds**. Treat it as a hard ceiling, not a target — acknowledge in well under a second and push slow work off the response path (see [Acknowledge fast](#acknowledge-fast-process-asynchronously) below).
 
 After attempt 6 fails, the event is logged and dropped. There is no persistent queue and no dead-letter destination — both are out of scope for v1.
 
@@ -79,7 +79,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn
 
 | Knob | Default | Effect |
 | --- | --- | --- |
-| Per-attempt timeout | 30 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. |
+| Per-attempt timeout | 15 seconds | How long a single attempt waits for your endpoint to respond before aborting and scheduling a retry. A shorter deployed value means a slow endpoint times out (and gets retried) sooner. |
 | Initial delay | 200ms | The `i = 0` term — delay before the first retry. |
 | Growth factor | 5× | Multiplier applied per retry index (`200ms → 1s → 5s → ...`). |
 | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. |
@@ -88,7 +88,7 @@ The retry schedule is operator-configurable. The Photon team can adjust these kn
 These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA).
 
 <Tip>
-If you're seeing duplicates after long handler waits — say, attempt 1 takes 28 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget.
+If you're seeing duplicates after long handler waits — say, attempt 1 takes 20 seconds and succeeds on your side, but our retry layer doesn't see the response in time — that's the per-attempt timeout, not the retry schedule. Tighten your handler (acknowledge first, process later) before asking us to widen our budget.
 </Tip>
 
 ## What your status codes mean to us
@@ -103,7 +103,7 @@ If you're seeing duplicates after long handler waits — say, attempt 1 takes 28
 | Any other `4xx` (e.g. `400`, `401`, `403`, `404`, `422`) | Fatal | Don't retry. The assumption is that the request will never succeed (auth bug, schema mismatch, missing route). |
 | Connection refused / TCP reset (after the URL guard passes) | Retriable | Wait, retry. |
 | Hostname doesn't resolve (DNS failure) | Fatal | Caught by the URL guard *before* the request — fail-closed, no retry. |
-| Per-attempt timeout (>30s) | Retriable | Wait, retry. |
+| Per-attempt timeout (>15s) | Retriable | Wait, retry. |
 
 <Tip>
 **Return `4xx` deliberately.** Returning `400` or `401` from a real bug (e.g. signature verification failure) is correct — it tells us "stop retrying, this request will never work." Returning `500` for the same bug wastes our retry budget and your CPU cycles.
@@ -140,7 +140,7 @@ app.post('/spectrum-webhook', async (c) => {
 });
 ```
 
-If your handler takes >30 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice.
+If your handler takes >15 seconds, the worker will time out the connection, mark it retriable, and `POST` again. Now you'll process the same event twice.
 
 ### Be idempotent
 
@@ -174,7 +174,7 @@ Returning `503` on overload is fine — we'll back off and retry. But it eats in
 | --- | --- |
 | Endpoint returns `2xx` on first try | Best case. One delivery, one process. |
 | Endpoint returns `503`, recovers within ~30s | Retried, eventually delivered. One process (assuming no `2xx` on the failed attempt). |
-| Endpoint times out after 30s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. |
+| Endpoint times out after 15s, then succeeds | Retried, eventually delivered. **Possibly processed twice** — your handler ran during the timeout and again on retry. Dedupe required. |
 | Endpoint returns `400` (signature bug, etc.) | Dropped immediately, no retry. Event lost. Logged on our side. |
 | Webhook URL is `http://` (not HTTPS) | Dropped immediately by the URL guard, no retry. Every event lost until you re-register an `https://` URL. |
 | Webhook URL resolves to a private/internal IP | Dropped immediately, no retry (SSRF guard). Logged. |
diff --git a/webhooks/troubleshooting.mdx b/webhooks/troubleshooting.mdx
index 532ef4c..6797d9b 100644
--- a/webhooks/troubleshooting.mdx
+++ b/webhooks/troubleshooting.mdx
@@ -136,7 +136,7 @@ A 24-48 hour TTL is plenty — our retry budget is bounded to a few minutes at m
 
 ## "Deliveries time out"
 
-If you're seeing your endpoint logged as "took >30s," it triggers a retry on our side and a likely duplicate processing on yours.
+If you're seeing your endpoint logged as "took >15s," it triggers a retry on our side and a likely duplicate processing on yours.
 
 ### Diagnosis
 
@@ -151,7 +151,7 @@ app.post('/spectrum-webhook', async (c) => {
 });
 ```
 
-Anything network-dependent in the request path can blow past 30s.
+Anything network-dependent in the request path can blow past 15s.
 
 ### Fix