Improve bulkhead resilience with retry and larger queue#1463
Merged
amitjoshi438 merged 8 commits intomainfrom Feb 11, 2026
Merged
Improve bulkhead resilience with retry and larger queue#1463amitjoshi438 merged 8 commits intomainfrom
amitjoshi438 merged 8 commits intomainfrom
Conversation
Based on telemetry analysis showing 74K bulkhead errors from 32 power users with sites containing 5,000-15,000 files: Changes: - Increase MAX_CONCURRENT_REQUEST_QUEUE_COUNT from 1000 to 6000 (covers P90) - Add retry policy with exponential backoff (2 attempts, 2-8s delay) - Add telemetry for retry events (WEB_EXTENSION_REQUEST_RETRY) - Add graceful 404 handling for webfiles (WEB_EXTENSION_WEBFILE_NOT_FOUND) This hybrid approach: - Queue increase covers 90% of affected sessions without retry - Retry handles remaining 10% (largest sites) + network failures - No impact on small site users (queue is lazy, retry only on rejection) - 404 handling prevents errors for deleted webfiles (49K errors) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR improves resilience for handling high-volume Power Pages sites that have been experiencing bulkhead queue overflow errors. Based on telemetry showing 74K errors from 32 users with large sites (5,000-15,000 files), the changes increase queue capacity and add retry logic to handle transient failures.
Changes:
- Increased bulkhead queue size from 1,000 to 6,000 to accommodate P90 request volumes
- Added retry policy with exponential backoff (2 attempts, 2-8s delay) using cockatiel library
- Added graceful 404 handling for deleted/moved webfiles with telemetry tracking
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/web/client/common/constants.ts | Increased MAX_CONCURRENT_REQUEST_QUEUE_COUNT to 6000 and added retry policy constants |
| src/web/client/dal/concurrencyHandler.ts | Implemented retry policy wrapped with bulkhead and added retry telemetry |
| src/web/client/dal/remoteFetchProvider.ts | Added 404 error handling for webfiles to gracefully handle deleted/moved files |
| src/common/OneDSLoggerTelemetry/web/client/webExtensionTelemetryEvents.ts | Added WEB_EXTENSION_REQUEST_RETRY and WEB_EXTENSION_WEBFILE_NOT_FOUND telemetry events |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Exclude BulkheadRejectedError from retry policy to prevent wasteful retries when the queue is full (the queue won't drain during backoff) - Add JSDoc comment to clarify RETRY_MAX_ATTEMPTS means total attempts - Add integration tests for ConcurrencyHandler retry behavior - Add integration test for webfile 404 handling with telemetry Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Fix concurrencyHandler tests: correct error message assertion, remove tests that trigger retry delays - Fix remoteFetchProvider tests: stub fetch before auth calls to avoid retry delays during authentication - Fix WebExtensionContext test: stub concurrencyHandler.handleRequest instead of fetch to bypass retry logic The retry policy (2-8 second backoff) was causing tests to exceed the 2000ms timeout when fetch calls failed and were retried. Co-Authored-By: Claude Opus 4.5 <[email protected]>
…es and improve error handling in the web client. - Translated error message for switching environments in Turkish, Simplified Chinese, Traditional Chinese, Czech, German, Spanish, French, Italian, Japanese, Korean, Brazilian Portuguese, and Russian. - Updated error handling in WebExtensionContext, remoteFetchProvider, remoteSaveProvider, etagHandlerService, graphClientService, and added a new utility for structured HTTP error handling. - Introduced createHttpResponseError and isHttpResponseError functions to standardize error responses and improve telemetry reporting. - Enhanced test cases to validate new error handling logic and ensure proper telemetry is sent for HTTP errors.
Previously SERVERLOGICS silently ignored all non-ok responses without telemetry, while WEBFILES only handled 404s with logging. Consolidate into a single map-driven block that gracefully handles 404s for both entity types with appropriate telemetry events. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
priyanshu92
approved these changes
Feb 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Based on telemetry analysis showing 74K bulkhead errors from 32 power users with sites containing 5,000-15,000 files, this PR improves the concurrency handler resilience.
Telemetry findings (30 days, sessions hitting bulkhead limits):
Changes
MAX_CONCURRENT_REQUEST_QUEUE_COUNTfrom 1000 to 6000 (covers P90)WEB_EXTENSION_REQUEST_RETRY)WEB_EXTENSION_WEBFILE_NOT_FOUND) - addresses 49K errorsBenefits
Test plan
🤖 Generated with Claude Code