Skip to content

fix(celery): soft-limit the display-power and telemetry pokes#3063

Merged
vpetersson merged 3 commits into
masterfrom
tmp-sentry
Jun 11, 2026
Merged

fix(celery): soft-limit the display-power and telemetry pokes#3063
vpetersson merged 3 commits into
masterfrom
tmp-sentry

Conversation

@vpetersson

Copy link
Copy Markdown
Contributor

Issues Fixed

Sentry ANTHIAS-A (Hard time limit (30s) exceeded), ANTHIAS-9 (TimeLimitExceeded), ANTHIAS-B (ForkPoolWorker … signal 9 (SIGKILL)) — one incident fanned into three groups.

Description

These three group by the worker-SIGKILL signature, not by task, so #3017 didn't close them: it only soft-limited the asset-probe tasks. Two periodic tasks still ran under a bare time_limit=30 and kept feeding the group — the latest occurrence was send_telemetry_task, whose requests(timeout=5) doesn't cover a getaddrinfo stall, and get_display_power, whose CEC subprocess can overrun.

This gives both the same treatment #3017 gave the probe:

  • soft_time_limit=30 / time_limit=60 on get_display_power and send_telemetry_task
  • SoftTimeLimitExceeded is caught inside each task, logged, and the tick is skipped (next beat retries) instead of running into the hard limit
  • the hard limit stays as the backstop for a call wedged in C code where the soft signal can't be delivered
  • regression tests for the limits and the soft-timeout skip

Note: a getaddrinfo truly stuck in glibc may still hit the (now 60s) hard backstop, but the common slow-network case is now caught cleanly.

Checklist

  • I have performed a self-review of my own code.
  • New and existing unit tests pass locally and on CI with my changes.
  • I have done an end-to-end test for Raspberry Pi devices.
  • I have tested my changes for x86 devices.
  • I added a documentation for the changes I have made (when necessary).

🤖 Generated with Claude Code

- ANTHIAS-A/9/B group by the worker-SIGKILL signature, not the task,
  so they survived #3017 — which only soft-limited the asset probe
- get_display_power and send_telemetry_task still ran under a bare
  time_limit=30; a wedged CEC query or a getaddrinfo stall (requests'
  timeout doesn't cover DNS) tripped the hard limit and SIGKILLed the
  pool child
- give both the #3017 treatment: soft_time_limit raises inside the
  task so it logs and skips the tick; the hard limit stays as the
  C-code backstop
- add regression tests for the limits and the soft-timeout skip

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vpetersson vpetersson requested a review from a team as a code owner June 11, 2026 09:26
@vpetersson vpetersson self-assigned this Jun 11, 2026
@vpetersson vpetersson requested a review from Copilot June 11, 2026 09:27

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens two lightweight periodic Celery tasks (get_display_power, send_telemetry_task) against SIGKILLs caused by hitting a 30s hard time limit, by introducing soft+hard time limits with headroom and explicitly catching SoftTimeLimitExceeded to skip the tick cleanly.

Changes:

  • Introduces PERIODIC_POKE_SOFT_TIME_LIMIT_S=30 and PERIODIC_POKE_TIME_LIMIT_S=60 and applies them to the two periodic “poke” tasks.
  • Catches SoftTimeLimitExceeded inside each task and logs a warning instead of allowing the task to run into the hard limit.
  • Adds regression tests asserting task limits and verifying soft-timeout behavior results in a successful (non-failing) tick.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/anthias_server/celery_tasks.py Adds periodic-poke soft/hard limits and catches SoftTimeLimitExceeded in get_display_power and send_telemetry_task.
tests/test_celery_tasks.py Adds tests for periodic poke time limits and “soft timeout skips tick” behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/anthias_server/celery_tasks.py Outdated
- a soft-limit signal between SET and EXPIRE could leave display_power
  without a TTL (stale value that never expires)
- use a single SET with ex= so the write and TTL are atomic

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread src/anthias_server/celery_tasks.py
- send_telemetry_task now runs under a soft time limit, so a
  SoftTimeLimitExceeded landing between the cooldown SET and EXPIRE
  would leave telemetry-cooldown without a TTL — silencing telemetry
  permanently (same class as the display_power fix in f5e9466)
- collapse the SET + EXPIRE into one SET … ex= so the value and TTL
  are written atomically
- update test_telemetry to assert the atomic SET and accept ex= in
  the fake redis client

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@vpetersson vpetersson merged commit f67aa1e into master Jun 11, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants