Skip to content

Conversation

@iMicknl
Copy link
Owner

@iMicknl iMicknl commented Jan 25, 2026

This pull request introduces improvements to the reliability and robustness of the pyoverkiz/client.py client by enhancing retry logic for various network and API error scenarios. The changes mainly focus on increasing retry attempts and handling additional failure cases to reduce transient errors during client operations.

Enhanced retry logic and error handling:

  • Increased the maximum retry attempts for authentication and connection-related errors in the refresh_listener function from 2 to 5, and added a new retry policy for connection failures (timeouts and connector errors) with up to 10 retries.
  • Added a new retry policy for handling ExecutionQueueFullException with up to 5 retries.

Application of new retry policies to client methods:

  • Applied the new connection failure retry policy to the fetch_events method to improve resilience against network issues.
  • Applied the new execution queue full retry policy to the _execute_action_group_direct method to better handle API-side execution queue limitations.

@iMicknl iMicknl requested a review from tetienne as a code owner January 25, 2026 22:41
Copilot AI review requested due to automatic review settings January 25, 2026 22:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve client robustness by expanding the retry/backoff handling for transient network failures and API-side capacity limits.

Changes:

  • Increased authentication-related retry attempts (max tries from 2 to 5).
  • Added new retry policies for connection failures and ExecutionQueueFullException.
  • Applied the new retry policies to fetch_events and _execute_action_group_direct.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 81 to +92
retry_on_auth_error = backoff.on_exception(
backoff.expo,
(NotAuthenticatedException, ServerDisconnectedError, ClientConnectorError),
max_tries=2,
max_tries=5,
on_backoff=relogin,
)

retry_on_connection_failure = backoff.on_exception(
backoff.expo,
(TimeoutError, ClientConnectorError),
max_tries=10,
)
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientConnectorError is included in both retry_on_auth_error and the new retry_on_connection_failure. When both decorators are applied (e.g., fetch_events), retries become nested/multiplicative (up to 5 * 10 attempts) and relogin may run for pure connection failures. Consider making the exception sets disjoint (e.g., remove ClientConnectorError from retry_on_auth_error and handle it only via retry_on_connection_failure) so connection outages don't trigger re-auth and total retry budget is predictable.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed

Comment on lines 426 to 430
@retry_on_concurrent_requests
@retry_on_auth_error
@retry_on_listener_error
@retry_on_connection_failure
async def fetch_events(self) -> list[Event]:
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch_events is a polling-style method, but with the current decorator stack a persistent ClientConnectorError/timeout can result in very long blocking retry sequences (nested retries across retry_on_connection_failure + retry_on_auth_error). After making exception sets disjoint, consider bounding retries for this method (e.g., max_time or lower max_tries) so callers can recover/continue their polling loop promptly when the network is down.

Copilot uses AI. Check for mistakes.
Comment on lines 475 to 477
@retry_on_too_many_executions
@retry_on_execution_queue_full
@retry_on_auth_error
Copy link

Copilot AI Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New retry behavior for ExecutionQueueFullException is introduced here, but there doesn't appear to be unit coverage asserting that _execute_action_group_direct retries (and eventually succeeds/fails) when the API returns EXEC_QUEUE_FULL. Adding a test that forces __post to raise ExecutionQueueFullException a few times (and stubbing backoff sleep to keep the test fast) would prevent regressions and validate the retry budget.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@iMicknl iMicknl closed this Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants