Skip to content

feat: add metrics API with Postgres and ClickHouse support#734

Merged
alexluong merged 42 commits intomainfrom
metrics
Mar 13, 2026
Merged

feat: add metrics API with Postgres and ClickHouse support#734
alexluong merged 42 commits intomainfrom
metrics

Conversation

@alexluong
Copy link
Collaborator

@alexluong alexluong commented Mar 8, 2026

implements #210

@vercel
Copy link

vercel bot commented Mar 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
outpost-docs Ready Ready Preview, Comment Mar 13, 2026 3:54pm
outpost-website Ready Ready Preview, Comment Mar 13, 2026 3:54pm

Request Review

alexluong and others added 18 commits March 11, 2026 01:54
Split LogStore into Records + Metrics sub-interfaces. LogStore is now
the combined interface so all existing consumers are unaffected.

Typed responses per endpoint (EventMetricsResponse, AttemptMetricsResponse)
with all fields as pointers. Stub implementations for CH, PG, and mem
drivers return errNotImplemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a comprehensive metrics test dataset spanning January 2000 with
300 tenant-1 events (50 sparse across 5 days + 250 dense bell-curve
on Jan 15) and 5 tenant-2 events for isolation. All dimension cycling
produces round numbers (100/topic, 150/dest, 180 success, 120 failed,
error_rate=0.4, etc). Covers all granularities (1m, 1h, 1d, 1w, 1M),
dimensions, filters, and measures for both event and attempt metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dynamic SQL builder for event and attempt metrics with parameterized
queries, time bucketing (date_bin/date_trunc), dimension grouping,
conditional aggregates (FILTER WHERE), 30s query timeout fallback,
and row limit enforcement with truncation flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Naive implementation using uniqExact/uniqExactIf for dedup-safe
aggregation over ReplacingMergeTree without FINAL. Includes 30s
query timeout fallback and 100k row limit with truncation detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All three drivers (mem, pg, ch) now implement the metrics interface.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wire up QueryEventMetrics and QueryAttemptMetrics as GET /api/v1/metrics/events
and GET /api/v1/metrics/attempts with query param parsing, allowlist validation,
JWT tenant scoping, and response transformation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…re, add metrics OpenAPI spec

memlogstore QueryEventMetrics was missing tenant_id dimension causing
empty results when grouping by tenant. QueryAttemptMetrics was missing
both tenant_id and attempt_number dimensions. Also adds MetricsResponse
schemas and /metrics/events, /metrics/attempts paths to the OpenAPI spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split metrics drivertest into DataCorrectness (existing value assertions)
and Characteristics (structural contract tests for dense bucket filling,
ordering, alignment, deterministic count, zero measures, no-data ranges,
and dimension × time filling). Shared dataset setup, single provisioning.

Characteristics tests will fail until bucket filling is implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enables new(expr) syntax for cleaner pointer initialization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sparse time-series responses caused dashboard charts to render fat bars
instead of slim bars across all time slots. This adds a shared bucket
filling layer (internal/logstore/bucket/) called by all 3 backends
after query, producing dense responses with zero-filled gaps.

- Extract TruncateTime into shared bucket package
- FillEventBuckets / FillAttemptBuckets with dimension-aware filling
- Update drivertest assertions for dense bucket counts
- All characteristics tests now pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The API allowlist accepted these filters but all three query builders
(pg, ch, mem) silently ignored them, causing filters like
attempt_number=0 to return unfiltered results. Add WHERE clauses in
all drivers and conformance tests to prevent regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ateRange to TimeRange in Go code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge main (PR #732) and update metrics handlers to use
ParseArrayQueryParam and resolveTenantIDsFilter. Update OpenAPI
spec filters to oneOf string/array schema with bracket notation.
Update test query strings to indexed bracket format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add per-second throughput rate measures computed as count / bucket_duration_seconds.
Events endpoint gets `rate`, attempts gets `rate`, `successful_rate`, `failed_rate`.

Rate computation lives in shared driver/rate.go, called by each driver after
bucket filling. Dependency measures are auto-enriched (e.g. requesting `rate`
without `count` internally adds `count` for SQL but omits it from API response).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
alexluong and others added 9 commits March 11, 2026 16:54
6-chart metrics dashboard: delivery volume (stacked bar), error rate
(line, 0-100%), retries (multi-line), avg attempt number, status code
breakdown, and topic breakdown. Shared timeframe selector (1h/24h/7d/30d),
all charts use /metrics/attempts endpoint. Includes dataviz CSS vars for
info (blue) and warning (orange) themes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hart

Restructure the metrics grid to 3 rows: events/deliveries, error
breakdown (3-col), and retry pressure. Add new "Events / count" chart
using attempt_number=0 filter. Support title/subtitle pattern with
muted subtitles and add filters param to useMetrics hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add sparkline + event count to each destination row using 4h granularity
with attempt_number=0 filter. Includes Sparkline component with stacked
success/failed bars, empty-bar rendering, and granularity override for
useMetrics hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move Loading import to top of MetricsChart.tsx. Use arrays/objects
directly in useMetrics instead of serializing to strings and re-splitting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- seed_metrics.sh: generates realistic event→attempt chains with
  configurable error rates, retry chains, and time distribution
- qa_metrics.sh: 11 named scenarios (healthy, failing, spike, empty,
  single, all-fail, all-success, recent, many-topics, many-codes,
  retry-heavy) with verification checklists
- README documenting usage and scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexluong
Copy link
Collaborator Author

alexluong commented Mar 11, 2026

Added seed & QA scripts in scripts/metrics/:

  • seed_metrics.sh — generates event→attempt chains with configurable error rates, retries, time distribution
  • qa_metrics.sh — 11 scenarios (healthy, failing, spike, empty, single, all-fail, all-success, recent, many-topics, many-codes, retry-heavy)

Ran full manual QA against Postgres. All 11 scenarios passing across all timeframes (1h/24h/7d/30d). Dashboard renders correctly: error rate Y-axis shows percentages, stacked bars work, breakdowns sort correctly, empty states render, single-event edge case works, attempt_number=0 filter correctly separates event count from delivery count.

Add manual=false filter alongside attempt_number=0 to prevent manual
retries (which also start at attempt_number=0) from inflating event
counts in the destination metrics chart and destinations list sparkline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Manual retries start a new chain at attempt_number=0, inflating
first_attempt_count. Add AND NOT manual to CH, PG, and memlogstore
queries. Add FIXME for test dataset which assigns manual and
attempt_number independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
alexluong and others added 2 commits March 13, 2026 22:40
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…anularity, filters

Migrate bench suite to current metrics API (TimeRange, tenant via filters)
and add bench cases for rate measures, multi-value granularities (2d/w/M),
new dimensions (code, attempt_number), and new filters (code, manual,
attempt_number, multi-filter).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
alexluong and others added 2 commits March 13, 2026 22:48
PR #740 changed attempt_number from 0-based to 1-based. Update all
metrics query logic, test data, seed scripts, and bench seeds to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexluong alexluong merged commit 67ccfea into main Mar 13, 2026
4 checks passed
@alexluong alexluong deleted the metrics branch March 13, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants