Skip to content

Mimir integration#42

Merged
gsanchietti merged 62 commits intomainfrom
mimir-integration
Apr 21, 2026
Merged

Mimir integration#42
gsanchietti merged 62 commits intomainfrom
mimir-integration

Conversation

@gsanchietti
Copy link
Copy Markdown
Member

@gsanchietti gsanchietti commented Feb 20, 2026

📋 Description

This pull request adds Alertmanager integration based on Grafana Mimir, backend APIs for alert configuration and inspection, resolved-alert history persistence, automatic LinkFailed monitoring, Telegram notification support, and system-level silence actions for active alerts.

Backend API (/api/alerts)

  • GET /api/alerts/config — retrieve the current alerting configuration from Mimir as structured JSON or redacted YAML
  • POST /api/alerts/config — apply a new alerting configuration
  • DELETE /api/alerts/config — replace the tenant configuration with a blackhole-only config while keeping the built-in history webhook active
  • GET /api/alerts — list active alerts with optional filters (state, severity, system_key)
  • GET /api/alerts/totals — return active alert counters plus resolved-history totals
  • GET /api/alerts/trend — return resolved-alert trend data for the selected period
  • GET /api/systems/:id/alerts — list active alerts for a single system
  • POST /api/systems/:id/alerts/silences — create a silence for a single active system alert
  • GET /api/systems/:id/alerts/history — return paginated resolved-alert history for a single system

Alerting configuration

  • AlertingConfig supports global settings, per-severity overrides, per-system overrides, and Telegram integration
  • SMTP settings are injected server-side
  • Telegram bot token is stored encrypted and managed per-organization
  • The built-in history webhook is always included in the generated Alertmanager config
  • Email templates are available in English and Italian
  • Telegram message templates support HTML formatting
  • Backend access to alerting configuration and active-alert APIs is scoped through the authenticated user plus the organization_id query parameter where required by the current handlers

Collect service

  • POST /api/alert_history receives Alertmanager webhooks and stores resolved alerts in PostgreSQL
  • Bearer-token authentication is enforced through ALERTING_HISTORY_WEBHOOK_SECRET
  • POST /api/services/mimir/alertmanager/api/v2/alerts proxies authenticated systems to Alertmanager with X-Scope-OrgID derived server-side
  • When a system posts alerts through the collect proxy, labels.system_key is always overwritten with the authenticated system value
  • Additional system and organization context labels are injected when missing
  • POST /api/services/mimir/alertmanager/api/v2/silences proxies authenticated systems to Alertmanager with tenant scoping enforced by the server

Frontend

  • The system detail active-alerts card exposes a silence action for users with manage:systems
  • The silence flow uses a small confirmation modal with an optional comment and refreshes the active-alerts card after success
  • Alert history is displayed with pagination and filtering

LinkFailed monitoring

  • The heartbeat monitor checks every 60 seconds
  • Systems move to inactive after exceeding HEARTBEAT_TIMEOUT_MINUTES
  • A LinkFailed alert is automatically posted when a system fails to communicate
  • The alert is resolved automatically when the system becomes active again
  • Race conditions between heartbeat ingestion and system status updates are handled defensively

Notification channels

  • Email notifications (SMTP) with configurable per-severity recipients
  • Telegram notifications with HTML-formatted messages
  • Webhook notifications with customizable payloads
  • Per-organization configuration and per-severity overrides

Tooling and docs

  • services/mimir/scripts/alerting_config.py manages alerting config and alert queries through the MY API
  • services/mimir/scripts/alert.py fires, resolves, silences, and lists alerts through the collect proxy
  • OpenAPI, database schema, migrations, tests, and docs cover the new alerting surface
  • Telegram bot configuration guide added to alerting documentation

🧪 Validation

  • cd backend && make pre-commit
  • cd collect && make pre-commit
  • cd frontend && npm run pre-commit

Related issue

Implements requirements from #72 (Alarm Management - Alertmanager Integration)

@github-actions
Copy link
Copy Markdown
Contributor

🔗 Redirect URIs Added to Logto

The following redirect URIs have been automatically added to the Logto application configuration:

Redirect URIs:

  • https://my-frontend-qa-pr-42.onrender.com/login-redirect
  • https://my-proxy-qa-pr-42.onrender.com/login-redirect

Post-logout redirect URIs:

  • https://my-frontend-qa-pr-42.onrender.com/login
  • https://my-proxy-qa-pr-42.onrender.com/login

These will be automatically removed when the PR is closed or merged.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 20, 2026

🤖 My API structural change detected

Preview documentation

Structural change details

Added (20)

  • DELETE /alerts/config
  • DELETE /services/mimir/alertmanager/api/v2/silences/{silence_id}
  • DELETE /systems/{id}/alerts/silences/{silence_id}
  • GET /alerts
  • GET /alerts/config
  • GET /alerts/totals
  • GET /alerts/trend
  • GET /services/mimir/alertmanager/api/v2/alerts
  • GET /services/mimir/alertmanager/api/v2/silences
  • GET /services/mimir/alertmanager/api/v2/silences/{silence_id}
  • GET /systems/{id}/alerts
  • GET /systems/{id}/alerts/history
  • GET /systems/{id}/alerts/silences
  • GET /systems/{id}/alerts/silences/{silence_id}
  • POST /alert_history
  • POST /alerts/config
  • POST /services/mimir/alertmanager/api/v2/alerts
  • POST /services/mimir/alertmanager/api/v2/silences
  • POST /systems/{id}/alerts/silences
  • PUT /systems/{id}/alerts/silences/{silence_id}
Powered by Bump.sh

@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 20, 2026 11:00 — with Render Active
@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:13 — with Render Active
@edospadoni edospadoni temporarily deployed to mimir-integration - my-collect-qa PR #42 February 24, 2026 16:13 — with Render Destroyed
@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:15 — with Render Active
@edospadoni edospadoni requested a deployment to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:38 — with Render In progress
@edospadoni edospadoni deployed to mimir-integration - my-mimir-qa PR #42 February 24, 2026 16:40 — with Render Active
@edospadoni edospadoni had a problem deploying to mimir-integration - my-mimir-qa PR #42 February 25, 2026 06:59 — with Render Failure
@gsanchietti
Copy link
Copy Markdown
Member Author

update deploy

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Build triggers updated!

All .render-build-trigger files have been automatically updated to ensure fresh deployments of all services in the PR preview environment.

gsanchietti and others added 27 commits April 16, 2026 11:44
backend:
- add GetSystemAlertSilences handler (lists active/pending silences per system)
- add GetSystemAlertSilence handler (get single silence with ownership check)
- add UpdateSystemAlertSilence handler (PUT - preserves matchers, updates endsAt/comment)
- add end_at field to CreateSystemAlertSilence (takes precedence over duration_minutes)
- add GetSilences() to alerting client
- register GET/PUT silence routes under /systems/:id/alerts/silences[/:silence_id]
- update AlertmanagerSilence model with Status, UpdatedAt fields
- add UpdateSystemAlertSilenceRequest, AlertmanagerSilenceStatus models
- fix buildSystemAlertSilenceRequest parameter order (now before endsAt)
- fix duplicate response keys in openapi.yaml
- document new endpoints in openapi.yaml including AlertmanagerSilence schema

frontend:
- add AlertmanagerSilence, AlertmanagerMatcher, AlertmanagerSilenceStatus types
- add getSystemAlertSilences() API function
- update createSystemAlertSilence() to accept optional endAt param
- add SYSTEM_ALERT_SILENCES_KEY constant
- replace hardcoded 60-min notice in SilenceSystemAlertModal with datetime-local picker
- add SystemAlertSilencesCard component (lists silences with delete action)
- insert SystemAlertSilencesCard in SystemAlertHistoryPanel between active alerts and history
- add i18n keys (en/it): silences card, silence_end_at, status labels, delete notifications

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- rename alertname from HostDown to LinkFailed in heartbeat monitor
- rename internal functions: fireHostDownAlert -> fireLinkFailedAlert,
  resolveHostDownAlert -> resolveLinkFailedAlert,
  postHostDownAlert -> postLinkFailedAlert
- update summary/description annotations to focus on missed heartbeat
  communication rather than host being down, avoiding alarm fatigue
- update all log messages, comments, and variable names consistently
- update backend and collect test fixtures from HostDown to LinkFailed
- update AGENTS.md references

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ProvisionDefaultConfig always sets MailEnabled and WebhookEnabled to false
regardless of whether a default email is present. The email address is still
stored in MailAddresses so it appears pre-filled in the UI, but the user must
explicitly enable notifications after creation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… 4h default

- wrap alert history table in NeCard to match active alerts and silences cards
- gray out suppressed (silenced) alert rows in the active alerts table with opacity-50
- remove action button for suppressed alerts; silence management is in the silences card
- add alertname column to silences card, derived from the alertname matcher value
- change default silence end time from 1 hour to 4 hours

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add telegram_enabled / telegram_receivers to global settings table
- Add telegram_enabled / telegram_receivers to per-severity and per-system
  override fields
- Update example JSON to include Telegram receiver
- Add 'Telegram notifications' section with 3-step setup guide:
  Step 1: create bot via @Botfather
  Step 2: obtain chat_id via getUpdates (private and group/channel)
  Step 3: configure JSON with telegram_enabled + telegram_receivers
- Document 4096-char message limit caveat
- Applied in both English and Italian

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
that appear in alert names, system keys, and annotations. Any
unescaped reserved character causes Alertmanager to fail the
notification with a 400 Bad Request from Telegram.

HTML mode is simpler and reliable: only <, >, & require escaping,
which Alertmanager handles automatically. Switch parse_mode to HTML
and rewrite the telegram_en/it templates to use <b> and <code> tags
instead of MarkdownV2 *bold* and `code` syntax.

Fixes: ts=... err="telegram: Bad Request: can't parse entities:
Character '-' is reserved and must be escaped"

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tting

- Use italics for severity labels (Critical, Warning, Info)
- Add server emoji (🖥️) before system name
- Separate severity info with pipe character for visual clarity
- Use header/body/detail structure with blank lines between sections
- Bolder alert name emphasis with visual hierarchy
- Cleaner, more professional appearance while respecting 4096 char limit

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the diff-sync approach (list Alertmanager → fire missing → resolve
orphaned) with a TTL-refresh model: every sync cycle, post alerts for ALL
DB-inactive systems with a short TTL (3× sync interval = 15 min).
Alertmanager deduplicates existing alerts. When a system becomes active,
posting stops and the alert auto-expires.

This eliminates the fire→resolve→fire oscillation caused by ~1700 systems
whose heartbeats arrive near the timeout boundary, toggling their DB
status every minute while the monitor snapshots every 5 minutes.

HeartbeatMonitor: log individual system_key on every active↔inactive
transition at debug level for troubleshooting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…led monitor

- Revert heartbeat_monitor.go to original (no per-system RETURNING logging)
- Keep TTL-refresh model: post firing with 15min TTL for all inactive systems
  each sync cycle to prevent flapping
- Re-add listAlerts: after posting refreshes, query Alertmanager for managed
  LinkFailed alerts and explicitly resolve any whose system is no longer inactive
  (rather than waiting up to 1h for Alertmanager to auto-resolve stale alerts)
- Log each refresh and each resolve at debug level with system_key
- Log summary at info level: alerts_refreshed + alerts_resolved counts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
resolve_timeout: 1h in Alertmanager does not apply here — our alerts carry
an explicit EndsAt (TTL = 15 min), which takes precedence. Active systems
recover within one TTL window without any explicit resolve call, so the
listAlerts round-trip and cloneStringMap helper are dead weight.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a magnifier icon (info circle) button next to each alert in the
system detail page, active alerts section. When clicked, the button opens
a tooltip showing the full alert details including labels, annotations,
fingerprint, and generator URL. The tooltip is styled with a dark
background and breaks long lines for readability.

- Add faCircleInfo icon import for the action button
- Add formatAlertDetails() function to format alert properties
- Modify summary cell to include the detail button and tooltip
- Add NeTooltip component import
- Add translations for view_alert_details in en/it locales

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@andre8244 andre8244 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC pagination is missing in GET /api/alerts and GET /api/systems/:id/alerts

Dead code removal:
- collect/alerting/mimir.go: drop unused ListAlerts plus its private
  alertAPIResponse/alertAPIStatus types, listAlertsBodySize constant
  and orphaned net/url import
- collect/methods/mimir.go: drop injectLabels and
  processAnnotationTemplates wrappers that only forwarded to the
  shared alerting helpers; callers already use the shared helpers
  directly via collectalerting.EnrichAlertPayload

Simplifications (no behavior change):
- backend/methods/alerting.go: use slices.Contains for template
  language validation, replace repeated type-asserting severity
  counters with typed locals, drop unreachable dedupe map in
  buildSystemAlertSilenceRequest (map iteration already yields
  unique keys and the system_key append is guarded)
- collect/alerting/mimir.go: drop else-after-return in PostAlerts

Tests:
- collect/methods/mimir_test.go: call collectalerting.InjectLabels
  and collectalerting.ProcessAnnotationTemplates directly so test
  coverage follows the deleted wrappers
- collect/cron/linkfailed_monitor_test.go: align test with current
  linkFailedAlertTTL constant (the multiplier was changed from 3x
  to 2x in ccfe65a but the test still hard-coded 3x)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

🗑️ Redirect URIs Removed from Logto

The following redirect URIs have been automatically removed from the Logto application configuration:

Redirect URIs:

  • https://my-proxy-qa-pr-42.onrender.com/login-redirect

Post-logout redirect URIs:

  • https://my-proxy-qa-pr-42.onrender.com/login

Cleanup completed for PR #42.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants