Skip to content

Cap OTLP exporter retry deadline to prevent shutdown SIGKILLs#223

Open
NumericalAdvantage wants to merge 1 commit intomainfrom
fix-otel-shutdown-timeout
Open

Cap OTLP exporter retry deadline to prevent shutdown SIGKILLs#223
NumericalAdvantage wants to merge 1 commit intomainfrom
fix-otel-shutdown-timeout

Conversation

@NumericalAdvantage
Copy link
Copy Markdown
Collaborator

Summary

  • Set OTEL_EXPORTER_OTLP_TIMEOUT=2 in docker-compose.override.yml.example
  • Prevents init/web/worker containers from being SIGKILLed (exit 137) on redeploy

Background

While auditing crashed containers on the openradx host, several exit-137 RADIS containers (radis_prod_init, etc.) all ended with the same OpenTelemetry exporter traceback in their final log lines:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='otel-collector.local', port=4318):
  Max retries exceeded with url: /v1/logs

When otel-collector is shut down before the apps during a stack redeploy, the OTel batch processors call force_flush() on shutdown, the underlying OTLPLogExporter retries with exponential backoff up to its default 10-second deadline, and the worker can't reach its own SIGTERM handler in time. Docker's default 10s stop_grace_period expires and the container is SIGKILLed.

The 2s cap bounds the retry deadline at opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py:187 so the flush gives up well within the grace period.

The matching change for ADIT is openradx/adit#344.

Test plan

  • After next swarm redeploy, confirm no exit-137 containers in docker ps -a --filter status=exited for radis_* services
  • Logs of a deliberately-killed app no longer end with the OTLP Max retries exceeded traceback

🤖 Generated with Claude Code

…k shutdown

Set OTEL_EXPORTER_OTLP_TIMEOUT=2 in the observability override template.

The default OTLP HTTP exporter timeout is 10s with exponential backoff. When
the otel-collector was shut down before the apps during a swarm redeploy,
the OTel batch processors blocked on flush() trying to reach an unreachable
collector for the full 10s — exceeding Docker's 10s stop_grace_period and
causing init/web/worker containers to be SIGKILLed (exit 137) instead of
shutting down cleanly.

Capping the deadline at 2s lets the exporter abort retries quickly so the
worker can finish its own SIGTERM handler within the grace period.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Warning

Rate limit exceeded

@NumericalAdvantage has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 59 minutes and 36 seconds before requesting another review.

To continue reviewing without waiting, purchase usage credits in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2b08c24-3c22-4a63-933c-3d281b6ee3f6

📥 Commits

Reviewing files that changed from the base of the PR and between b01f94c and b726801.

📒 Files selected for processing (1)
  • docker-compose.override.yml.example
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-otel-shutdown-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the docker-compose.override.yml.example file to set a 2-second timeout for the OTLP HTTP exporter. This adjustment prevents application processes from being forcefully terminated (SIGKILL) during redeploys when the OpenTelemetry collector is shut down before the applications. The review feedback suggests making this timeout value overridable via an environment variable to maintain consistency with other configurations and allow for easier tuning.

# before the apps, the OTel batch processors blocked on flush() trying
# to reach an unreachable collector, causing apps to be SIGKILLed
# (exit 137) instead of shutting down cleanly.
OTEL_EXPORTER_OTLP_TIMEOUT: "2"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To maintain consistency with other environment variables in this file (such as OTEL_EXPORTER_OTLP_ENDPOINT) and allow for environment-specific tuning without modifying the template, consider making this value overridable via an environment variable. Additionally, note that while the OpenTelemetry specification defines this value in milliseconds, the current OpenTelemetry Python SDK interprets it as seconds. The value 2 is correct for the current SDK to achieve a 2-second timeout, but this discrepancy is worth noting for future SDK updates or if switching to non-Python components.

    OTEL_EXPORTER_OTLP_TIMEOUT: ${OTEL_EXPORTER_OTLP_TIMEOUT:-2}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant