OAuth monitor gives up on transient failures, leaving workloads dead
Bug description
When an OAuth token-refresh attempt returns an error that
isTransientNetworkError classifies as transient — 5xx, 429, or 4xx
without an RFC 6749 error code, per the rule established by #5170 —
pkg/auth/monitored_token_source.go runs an in-loop short retry (5
attempts with exponential backoff, ~1–2 minutes at defaults, bounded
by TOOLHIVE_TOKEN_REFRESH_MAX_TRIES and
TOOLHIVE_TOKEN_REFRESH_MAX_ELAPSED_TIME). If the error persists past
that window, the monitor marks the workload unauthenticated and
exits its goroutine. No further refresh is ever attempted, even
after the underlying condition clears. The workload stays
unauthenticated until manual intervention (thv restart, thv rm
The transient classification is correct: the response shape doesn't
carry a definitive "denied" verdict from the OAuth server (no RFC 6749
error code), so ToolHive can't conclude the credentials are bad. The
gap is at the next layer up — the in-loop retry window is too short
to cover realistic recovery time scales for the conditions that
produce these errors (see Additional context). The monitor should
keep trying on a longer cadence, not give up after ~2 minutes.
Steps to reproduce
Reproduction requires either (a) a real OAuth endpoint behind a network
control point that can be selectively dropped, or (b) the naturally-
occurring real-world trigger described in Additional context (e.g.,
client-side VPN disconnect routing requests through an IP-allowlisted
WAF or CDN).
- Run an OAuth-backed remote workload with
thv run --remote-url ...
and let it complete the initial OAuth flow.
- Wait for the cached access token to expire (typically 1 hour) so the
monitor will attempt a refresh.
- Just before refresh time, block traffic to the token endpoint
(pfctl on macOS / iptables on Linux) for several minutes — long
enough for the short retry to exhaust all 5 attempts (see Additional
context for the default backoff schedule; ~3 minutes is comfortable).
- Observe the workload transitions to
unauthenticated. Restore the
network — observe that ToolHive does NOT attempt to recover, even
after waiting an arbitrarily long time.
Note that all conditions for the bug must be met: (a) the failure
classifies as transient (5xx, 429, or 4xx without an RFC 6749 error
code — see #5170), (b) the failure persists past the short-retry
window. Permanent OAuth failures (invalid_grant, invalid_client)
correctly stop the monitor and are not affected.
Expected behavior
When a transient token-refresh failure exceeds the short-retry window,
the background monitor should keep attempting refresh on a longer
cadence until either the upstream recovers (→ running) or a
configurable ceiling is reached, at which point the workload is finally
marked unauthenticated. Workloads should not be permanently broken by
transient failures that resolve within a reasonable ceiling.
Actual behavior
After the short-retry window exhausts on a still-transient error:
- The retry exhaustion branch in
Token() (in the if err != nil
block following refresher.Refresh(...) in
pkg/auth/monitored_token_source.go) calls markAsUnauthenticated.
markAsUnauthenticated writes WorkloadStatusUnauthenticated and
closes the stopMonitoring channel.
- The monitor goroutine exits via the
monitorLoop's select on
stopMonitoring.
- No further refresh is ever attempted by this workload's monitor.
- The workload remains
unauthenticated indefinitely.
Environment (if relevant)
Additional context
Real-world trigger: the canonical scenario is a client-side
network-context change — disconnecting from a corporate VPN, putting
the laptop to sleep on one network and resuming on another, etc.
Token-refresh requests that previously traversed an IP-allowlisted
path now reach the OAuth server from a residential IP, where a WAF or
CDN consistently returns 403+HTML until the trusted path is restored.
The block isn't intermittent from the WAF's perspective — it's a
stable response to a different network origin — but from the
workload's perspective it's a transient failure window that resolves
on its own when the user reconnects.
In one such environment the bug surfaced every 1–3 days; each time the
underlying network state reverted on its own (e.g., morning VPN
reconnect), but the workload stayed unauthenticated until manual
recovery.
Production error shape (real recurrence):
oauth2: cannot fetch token: 403 Forbidden
Response: <!DOCTYPE html>...
This is a 4xx-without-RFC-6749-error-code response — correctly
classified as transient by isTransientNetworkError after #5170. The
short retry exhausts via the 5-try cap (typically within ~1–2 minutes
given the default backoff: 10s initial interval, 1.5 multiplier, ±50%
randomization, 5 max tries), well before the 5-minute
MAX_ELAPSED_TIME cap. The monitor exits at that point.
Affected code:
pkg/auth/monitored_token_source.go::Token — the refresher.Refresh
exhaustion branch (the if err != nil block right after the
refresher.Refresh call).
pkg/auth/monitored_token_source.go::onTick — calls Token(), so
inherits the same exit-on-exhaustion behavior. The monitor's only
response to a transient refresh error that outlasts the short retry
is to mark the workload unauthenticated and stop.
pkg/auth/monitored_token_source.go::markAsUnauthenticated is the
single-shot exit point both call into.
Related PRs:
OAuth monitor gives up on transient failures, leaving workloads dead
Bug description
When an OAuth token-refresh attempt returns an error that
isTransientNetworkErrorclassifies as transient — 5xx, 429, or 4xxwithout an RFC 6749
errorcode, per the rule established by #5170 —pkg/auth/monitored_token_source.goruns an in-loop short retry (5attempts with exponential backoff, ~1–2 minutes at defaults, bounded
by
TOOLHIVE_TOKEN_REFRESH_MAX_TRIESandTOOLHIVE_TOKEN_REFRESH_MAX_ELAPSED_TIME). If the error persists pastthat window, the monitor marks the workload
unauthenticatedandexits its goroutine. No further refresh is ever attempted, even
after the underlying condition clears. The workload stays
unauthenticateduntil manual intervention (thv restart,thv rmthv run, or similar).The transient classification is correct: the response shape doesn't
carry a definitive "denied" verdict from the OAuth server (no RFC 6749
error code), so ToolHive can't conclude the credentials are bad. The
gap is at the next layer up — the in-loop retry window is too short
to cover realistic recovery time scales for the conditions that
produce these errors (see Additional context). The monitor should
keep trying on a longer cadence, not give up after ~2 minutes.
Steps to reproduce
Reproduction requires either (a) a real OAuth endpoint behind a network
control point that can be selectively dropped, or (b) the naturally-
occurring real-world trigger described in Additional context (e.g.,
client-side VPN disconnect routing requests through an IP-allowlisted
WAF or CDN).
thv run --remote-url ...and let it complete the initial OAuth flow.
monitor will attempt a refresh.
(
pfctlon macOS /iptableson Linux) for several minutes — longenough for the short retry to exhaust all 5 attempts (see Additional
context for the default backoff schedule; ~3 minutes is comfortable).
unauthenticated. Restore thenetwork — observe that ToolHive does NOT attempt to recover, even
after waiting an arbitrarily long time.
Note that all conditions for the bug must be met: (a) the failure
classifies as transient (5xx, 429, or 4xx without an RFC 6749 error
code — see #5170), (b) the failure persists past the short-retry
window. Permanent OAuth failures (
invalid_grant,invalid_client)correctly stop the monitor and are not affected.
Expected behavior
When a transient token-refresh failure exceeds the short-retry window,
the background monitor should keep attempting refresh on a longer
cadence until either the upstream recovers (→
running) or aconfigurable ceiling is reached, at which point the workload is finally
marked
unauthenticated. Workloads should not be permanently broken bytransient failures that resolve within a reasonable ceiling.
Actual behavior
After the short-retry window exhausts on a still-transient error:
Token()(in theif err != nilblock following
refresher.Refresh(...)inpkg/auth/monitored_token_source.go) callsmarkAsUnauthenticated.markAsUnauthenticatedwritesWorkloadStatusUnauthenticatedandcloses the
stopMonitoringchannel.monitorLoop's select onstopMonitoring.unauthenticatedindefinitely.Environment (if relevant)
independent).
main(post-Retry OAuth token refresh on infrastructure 4xx #5170 and post-Wire authserver DCR resolver and add structured logs #5044). The affectedcode paths in
pkg/auth/monitored_token_source.go::Tokenand::onTickhave had this shape since the short-retry layer wasintroduced in Retry transient network errors in background token monitor #4281.
Additional context
Real-world trigger: the canonical scenario is a client-side
network-context change — disconnecting from a corporate VPN, putting
the laptop to sleep on one network and resuming on another, etc.
Token-refresh requests that previously traversed an IP-allowlisted
path now reach the OAuth server from a residential IP, where a WAF or
CDN consistently returns 403+HTML until the trusted path is restored.
The block isn't intermittent from the WAF's perspective — it's a
stable response to a different network origin — but from the
workload's perspective it's a transient failure window that resolves
on its own when the user reconnects.
In one such environment the bug surfaced every 1–3 days; each time the
underlying network state reverted on its own (e.g., morning VPN
reconnect), but the workload stayed
unauthenticateduntil manualrecovery.
Production error shape (real recurrence):
This is a 4xx-without-RFC-6749-error-code response — correctly
classified as transient by
isTransientNetworkErrorafter #5170. Theshort retry exhausts via the 5-try cap (typically within ~1–2 minutes
given the default backoff: 10s initial interval, 1.5 multiplier, ±50%
randomization, 5 max tries), well before the 5-minute
MAX_ELAPSED_TIMEcap. The monitor exits at that point.Affected code:
pkg/auth/monitored_token_source.go::Token— therefresher.Refreshexhaustion branch (the
if err != nilblock right after therefresher.Refreshcall).pkg/auth/monitored_token_source.go::onTick— callsToken(), soinherits the same exit-on-exhaustion behavior. The monitor's only
response to a transient refresh error that outlasts the short retry
is to mark the workload
unauthenticatedand stop.pkg/auth/monitored_token_source.go::markAsUnauthenticatedis thesingle-shot exit point both call into.
Related PRs:
precondition for this bug to manifest predictably; the short retry
now correctly retries WAF-shaped responses, then exhausts via the
5-try cap (~1–2 minutes default).
Refined the short-retry layer.
upstream/clientIDconstructor context(merged). A fix here must continue to gate the DCR remediation Warn
correctly (only on permanent errors, not on transient-ceiling
give-ups).