Feat/Provider Fallback Chain — Design Document (#2574)#2581
Conversation
|
No reviewable files after applying ignore patterns. |
|
Thanks @idling11 for taking the time to contribute. This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in Please read |
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
|
Thanks @idling11 for writing up the provider fallback-chain design. This was not something to squeeze into the corrective v0.8.52 publish, but it is still a useful planning artifact for #2574 and belongs in the v0.8.53+ review sweep. Sorry it sat under mostly bot comments during the release cleanup. |
Summary
Add an automatic provider fallback chain so that when the active provider
returns a non-recoverable error (429, selected 5xx, connection timeout),
CodeWhale switches to the next configured provider without interrupting
the user's workflow.
Motivation
Currently, users must manually run
/providerto switch when theirprimary provider fails. This is especially disruptive during long-running
agentic tasks. A fallback chain keeps the agent working without user
intervention.
Design
Configuration
fallback— ordered list of provider names to tryactive— the primary provider (existingproviderkey, renamed for clarity)Fallback triggers
Sequence
Transcript / UI
NVIDIA NIM unavailable — switched to DeepSeek[provider: nvidia-nim → deepseek]/providercommand shows current chain position:deepseek (fallback #1)active) provider is remembered so user can/provider resetto go backCapability awareness
Before switching, the engine checks that the fallback provider supports
the current turn's needs:
If no fallback provider meets capabilities, the error is surfaced directly.
Retry integration
Existing
[retry]settings apply per-provider before fallback triggers.A provider gets
max_retriesattempts withretry_delaybetween them.Only after retry exhaustion does fallback move to the next provider.
Config schema validation
On startup, validate:
fallbackentry is a known providerImplementation Plan (3 Draft PRs)
Phase 1: Config schema + validation
Branch:
feat/provider-fallback-chain-phase1Files:
crates/tui/src/config.rsfallback: Option<Vec<String>>field toProvidersConfig#[serde(default)]for backward compatibilityConfig::validate(): known provider, no duplicates, not same as activefallbackmerge logic inmerge_provider_config()Phase 2: Engine fallback logic
Branch:
feat/provider-fallback-chain-phase2Files:
crates/tui/src/client.rs,crates/tui/src/core/engine/turn_loop.rsActiveProviderTrackerto remember original provider and current positionis_fallback_eligible(error) -> booltry_with_fallback()inclient.rs: iterate fallback chain on eligible errors/provider resetProviderFallback { from, to, reason }Phase 3: UI feedback
Branch:
feat/provider-fallback-chain-phase3Files:
crates/tui/src/tui/ui.rs,crates/tui/src/commands/provider.rs/providershows fallback position and chain/provider resetto return to primary providerRejected alternatives
Open questions
→ Reset each launch (avoids silently staying on fallback forever)
/compactreset to primary provider?→ No — compaction changes context, not provider
→ Yes, same turn can span providers as long as capabilities match