fix(celery): soft-limit the display-power and telemetry pokes#3063
Merged
Conversation
- ANTHIAS-A/9/B group by the worker-SIGKILL signature, not the task, so they survived #3017 — which only soft-limited the asset probe - get_display_power and send_telemetry_task still ran under a bare time_limit=30; a wedged CEC query or a getaddrinfo stall (requests' timeout doesn't cover DNS) tripped the hard limit and SIGKILLed the pool child - give both the #3017 treatment: soft_time_limit raises inside the task so it logs and skips the tick; the hard limit stays as the C-code backstop - add regression tests for the limits and the soft-timeout skip Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens two lightweight periodic Celery tasks (get_display_power, send_telemetry_task) against SIGKILLs caused by hitting a 30s hard time limit, by introducing soft+hard time limits with headroom and explicitly catching SoftTimeLimitExceeded to skip the tick cleanly.
Changes:
- Introduces
PERIODIC_POKE_SOFT_TIME_LIMIT_S=30andPERIODIC_POKE_TIME_LIMIT_S=60and applies them to the two periodic “poke” tasks. - Catches
SoftTimeLimitExceededinside each task and logs a warning instead of allowing the task to run into the hard limit. - Adds regression tests asserting task limits and verifying soft-timeout behavior results in a successful (non-failing) tick.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/anthias_server/celery_tasks.py |
Adds periodic-poke soft/hard limits and catches SoftTimeLimitExceeded in get_display_power and send_telemetry_task. |
tests/test_celery_tasks.py |
Adds tests for periodic poke time limits and “soft timeout skips tick” behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- a soft-limit signal between SET and EXPIRE could leave display_power without a TTL (stale value that never expires) - use a single SET with ex= so the write and TTL are atomic Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- send_telemetry_task now runs under a soft time limit, so a SoftTimeLimitExceeded landing between the cooldown SET and EXPIRE would leave telemetry-cooldown without a TTL — silencing telemetry permanently (same class as the display_power fix in f5e9466) - collapse the SET + EXPIRE into one SET … ex= so the value and TTL are written atomically - update test_telemetry to assert the atomic SET and accept ex= in the fake redis client Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Issues Fixed
Sentry ANTHIAS-A (
Hard time limit (30s) exceeded), ANTHIAS-9 (TimeLimitExceeded), ANTHIAS-B (ForkPoolWorker … signal 9 (SIGKILL)) — one incident fanned into three groups.Description
These three group by the worker-SIGKILL signature, not by task, so #3017 didn't close them: it only soft-limited the asset-probe tasks. Two periodic tasks still ran under a bare
time_limit=30and kept feeding the group — the latest occurrence wassend_telemetry_task, whoserequests(timeout=5)doesn't cover agetaddrinfostall, andget_display_power, whose CEC subprocess can overrun.This gives both the same treatment #3017 gave the probe:
soft_time_limit=30/time_limit=60onget_display_powerandsend_telemetry_taskSoftTimeLimitExceededis caught inside each task, logged, and the tick is skipped (next beat retries) instead of running into the hard limitNote: a
getaddrinfotruly stuck in glibc may still hit the (now 60s) hard backstop, but the common slow-network case is now caught cleanly.Checklist
🤖 Generated with Claude Code