Cap OTLP exporter retry deadline to prevent shutdown SIGKILLs#223
Cap OTLP exporter retry deadline to prevent shutdown SIGKILLs#223NumericalAdvantage wants to merge 1 commit intomainfrom
Conversation
…k shutdown Set OTEL_EXPORTER_OTLP_TIMEOUT=2 in the observability override template. The default OTLP HTTP exporter timeout is 10s with exponential backoff. When the otel-collector was shut down before the apps during a swarm redeploy, the OTel batch processors blocked on flush() trying to reach an unreachable collector for the full 10s — exceeding Docker's 10s stop_grace_period and causing init/web/worker containers to be SIGKILLed (exit 137) instead of shutting down cleanly. Capping the deadline at 2s lets the exporter abort retries quickly so the worker can finish its own SIGTERM handler within the grace period. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
To continue reviewing without waiting, purchase usage credits in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the docker-compose.override.yml.example file to set a 2-second timeout for the OTLP HTTP exporter. This adjustment prevents application processes from being forcefully terminated (SIGKILL) during redeploys when the OpenTelemetry collector is shut down before the applications. The review feedback suggests making this timeout value overridable via an environment variable to maintain consistency with other configurations and allow for easier tuning.
| # before the apps, the OTel batch processors blocked on flush() trying | ||
| # to reach an unreachable collector, causing apps to be SIGKILLed | ||
| # (exit 137) instead of shutting down cleanly. | ||
| OTEL_EXPORTER_OTLP_TIMEOUT: "2" |
There was a problem hiding this comment.
To maintain consistency with other environment variables in this file (such as OTEL_EXPORTER_OTLP_ENDPOINT) and allow for environment-specific tuning without modifying the template, consider making this value overridable via an environment variable. Additionally, note that while the OpenTelemetry specification defines this value in milliseconds, the current OpenTelemetry Python SDK interprets it as seconds. The value 2 is correct for the current SDK to achieve a 2-second timeout, but this discrepancy is worth noting for future SDK updates or if switching to non-Python components.
OTEL_EXPORTER_OTLP_TIMEOUT: ${OTEL_EXPORTER_OTLP_TIMEOUT:-2}
Summary
OTEL_EXPORTER_OTLP_TIMEOUT=2indocker-compose.override.yml.exampleBackground
While auditing crashed containers on the openradx host, several exit-137 RADIS containers (
radis_prod_init, etc.) all ended with the same OpenTelemetry exporter traceback in their final log lines:When
otel-collectoris shut down before the apps during a stack redeploy, the OTel batch processors callforce_flush()on shutdown, the underlyingOTLPLogExporterretries with exponential backoff up to its default 10-second deadline, and the worker can't reach its own SIGTERM handler in time. Docker's default 10sstop_grace_periodexpires and the container is SIGKILLed.The 2s cap bounds the retry deadline at
opentelemetry/exporter/otlp/proto/http/_log_exporter/__init__.py:187so the flush gives up well within the grace period.The matching change for ADIT is openradx/adit#344.
Test plan
docker ps -a --filter status=exitedforradis_*servicesMax retries exceededtraceback🤖 Generated with Claude Code