Skip to content

fix: Change attempt_number to 1-based indexing#740

Merged
alexluong merged 2 commits intomainfrom
attempt-number
Mar 13, 2026
Merged

fix: Change attempt_number to 1-based indexing#740
alexluong merged 2 commits intomainfrom
attempt-number

Conversation

@alexluong
Copy link
Collaborator

@alexluong alexluong commented Mar 10, 2026

Summary

attempt_number was 0-based (first attempt = 0, retries = 1, 2, 3...). This changes it to 1-based (first attempt = 1, retries = 2, 3, 4...) to align with the OpenAPI spec which already documents 1-based indexing.

Side effects

No data migration is included. This means:

  • Historical data stays as-is. Existing attempts in Postgres still have 0-based attempt_number values. New attempts will be 1-based. Users querying across old and new data will see mixed values. We are ok with this.
  • In-flight tasks with Attempt=0 that hit retry after deployment will pass through a max(attempt-1, 0) guard to prevent panics/infinite loops in backoff calculation.
  • Old in-flight tasks get one extra retry since 0 <= retryMaxLimit passes the retry check. Only affects tasks in-flight during deployment.

Optional: data migration for users who care about consistency

Users who want clean 1-based data across the board can run the following after upgrading. We encourage taking a timestamp of when the upgrade happens and scoping the update to records created before that point:

UPDATE attempts
SET attempt_number = attempt_number + 1
WHERE created_at < '<upgrade-timestamp>';

This does not cover in-flight tasks at the time of upgrade. Users who care about full data correctness can determine the appropriate handling for their setup.

alexluong and others added 2 commits March 10, 2026 22:32
First attempt now starts at 1 (retries: 2, 3, 4...) to align with the
OpenAPI spec which already documents 1-based indexing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
outpost-docs Ready Ready Preview, Comment Mar 10, 2026 3:46pm
outpost-website Ready Ready Preview, Comment Mar 10, 2026 3:46pm

Request Review

@alexluong alexluong merged commit 6775114 into main Mar 13, 2026
5 checks passed
@alexluong alexluong deleted the attempt-number branch March 13, 2026 15:40
alexluong added a commit that referenced this pull request Mar 13, 2026
PR #740 changed attempt_number from 0-based to 1-based. Update all
metrics query logic, test data, seed scripts, and bench seeds to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
alexluong added a commit that referenced this pull request Mar 13, 2026
* feat: add Metrics interface to logstore driver

Split LogStore into Records + Metrics sub-interfaces. LogStore is now
the combined interface so all existing consumers are unaffected.

Typed responses per endpoint (EventMetricsResponse, AttemptMetricsResponse)
with all fields as pointers. Stub implementations for CH, PG, and mem
drivers return errNotImplemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add metrics conformance tests with 300-row dataset

Add a comprehensive metrics test dataset spanning January 2000 with
300 tenant-1 events (50 sparse across 5 days + 250 dense bell-curve
on Jan 15) and 5 tenant-2 events for isolation. All dimension cycling
produces round numbers (100/topic, 150/dest, 180 success, 120 failed,
error_rate=0.4, etc). Covers all granularities (1m, 1h, 1d, 1w, 1M),
dimensions, filters, and measures for both event and attempt metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: implement Postgres logstore metrics queries

Dynamic SQL builder for event and attempt metrics with parameterized
queries, time bucketing (date_bin/date_trunc), dimension grouping,
conditional aggregates (FILTER WHERE), 30s query timeout fallback,
and row limit enforcement with truncation flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: implement ClickHouse logstore metrics queries

Naive implementation using uniqExact/uniqExactIf for dedup-safe
aggregation over ReplacingMergeTree without FINAL. Includes 30s
query timeout fallback and 100k row limit with truncation detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove unimplemented check from metrics conformance tests

All three drivers (mem, pg, ch) now implement the metrics interface.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add metrics API endpoints for events and attempts

Wire up QueryEventMetrics and QueryAttemptMetrics as GET /api/v1/metrics/events
and GET /api/v1/metrics/attempts with query param parsing, allowlist validation,
JWT tenant scoping, and response transformation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: gofmt metrics handlers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove duplicate derefIntFromIntPtr helper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing tenant_id and attempt_number dimensions to memlogstore, add metrics OpenAPI spec

memlogstore QueryEventMetrics was missing tenant_id dimension causing
empty results when grouping by tenant. QueryAttemptMetrics was missing
both tenant_id and attempt_number dimensions. Also adds MetricsResponse
schemas and /metrics/events, /metrics/attempts paths to the OpenAPI spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add metrics time-series characteristics conformance suite

Split metrics drivertest into DataCorrectness (existing value assertions)
and Characteristics (structural contract tests for dense bucket filling,
ordering, alignment, deterministic count, zero measures, no-data ranges,
and dimension × time filling). Shared dataset setup, single provisioning.

Characteristics tests will fail until bucket filling is implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump Go from 1.24 to 1.26

Enables new(expr) syntax for cleaner pointer initialization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add post-query bucket filling for metrics time series

Sparse time-series responses caused dashboard charts to render fat bars
instead of slim bars across all time slots. This adds a shared bucket
filling layer (internal/logstore/bucket/) called by all 3 backends
after query, producing dense responses with zero-filled gaps.

- Extract TruncateTime into shared bucket package
- FillEventBuckets / FillAttemptBuckets with dimension-aware filling
- Update drivertest assertions for dense bucket counts
- All characteristics tests now pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: wire missing attempt metric filters (code, manual, attempt_number)

The API allowlist accepted these filters but all three query builders
(pg, ch, mem) silently ignored them, causing filters like
attempt_number=0 to return unfiltered results. Add WHERE clauses in
all drivers and conformance tests to prevent regression.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: rename date_range to time in metrics API query params and DateRange to TimeRange in Go code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: remove eligible_for_retry from event metrics dimensions and filters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: adopt bracket notation array params for metrics API

Merge main (PR #732) and update metrics handlers to use
ParseArrayQueryParam and resolveTenantIDsFilter. Update OpenAPI
spec filters to oneOf string/array schema with bracket notation.
Update test query strings to indexed bracket format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add rate throughput measures to metrics API

Add per-second throughput rate measures computed as count / bucket_duration_seconds.
Events endpoint gets `rate`, attempts gets `rate`, `successful_rate`, `failed_rate`.

Rate computation lives in shared driver/rate.go, called by each driver after
bucket filling. Dependency measures are auto-enriched (e.g. requesting `rate`
without `count` internally adds `count` for SQL but omits it from API response).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add rate measure coverage to drivertest

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add ClickHouse SETTINGS constraints to metrics queries

Adds server-side resource limits (max_execution_time, max_rows_to_group_by)
to CH metrics queries and surfaces overflow errors as 400 instead of 500.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: validate time range in metrics queries to prevent div-by-zero in rate computation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add missing tenant_id and attempt_number dimensions to CH and PG metrics queries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove dead MetricsMetadata.Granularity field

No driver ever sets this field — the handler already has the request
granularity. Removes the dead if-branch in buildAPIMetricsResponse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: move tenant scoping from TenantID field to filters[tenant_id]

Replace MetricsRequest.TenantID with filters[tenant_id] across all 3
drivers so tenant scoping uses the same filter mechanism as other
dimensions. Handler now validates/injects filters[tenant_id] for JWT
callers instead of using resolveTenantIDsFilter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add defensive guards and validation to metrics time bucketing

- Add forward-progress guard in GenerateTimeBuckets to prevent infinite
  loops if AdvanceTime doesn't advance
- Cap bucket count at 100k to prevent OOM; returns ErrTooManyBuckets
  wrapped as ErrResourceLimit (→ 400 "query too broad")
- Validate granularity value ranges per spec (s:1-60, m:1-60, h:1-24,
  d:1-31, w:1-4, M:1-12)
- Document Granularity alignment behavior for calendar units (d/w/M)
- Update GenerateTimeBuckets and Fill*Buckets signatures to return error
- Update all driver callers (ch, pg, mem) to propagate bucket errors
- Add drivertest characteristics for too-many-buckets → ErrResourceLimit
- Add handler tests for query-too-broad and granularity-out-of-range

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add failing test for 2d granularity data loss

2d granularity silently drops events on non-slot days. SQL groups by
individual day but bucket fill generates every-2-day slots, so data
on days that don't align with a slot boundary is orphaned.

Expected 300, actual 280 (Jan 22 and Jan 28 events lost).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use epoch-anchored alignment for multi-value d/w/M granularities

TruncateTime, CH SQL, and PG SQL all ignored g.Value for d/w/M units,
always truncating to the natural period (day/week/month). This caused
FillEventBuckets to silently drop data on non-slot days when Value > 1
(e.g. 2d granularity lost ~7% of events).

Fix: epoch-anchored N-interval alignment for all three backends:
- d: anchor to 1970-01-01, align to N-day boundaries
- w: anchor to 1970-01-04 (Sunday), align to N*7-day boundaries
- M: align month number to nearest multiple of N from Jan 1970
- Value=1 retains existing behavior (no functional change)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(portal): add metrics dashboard to destination overview

6-chart metrics dashboard: delivery volume (stacked bar), error rate
(line, 0-100%), retries (multi-line), avg attempt number, status code
breakdown, and topic breakdown. Shared timeframe selector (1h/24h/7d/30d),
all charts use /metrics/attempts endpoint. Includes dataviz CSS vars for
info (blue) and warning (orange) themes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(portal): replace ResponsiveContainer to fix -1px dimension warnings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(portal): redesign metrics dashboard layout and add event count chart

Restructure the metrics grid to 3 rows: events/deliveries, error
breakdown (3-col), and retry pressure. Add new "Events / count" chart
using attempt_number=0 filter. Support title/subtitle pattern with
muted subtitles and add filters param to useMetrics hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(portal): add events 24h sparkline column to destinations list

Add sparkline + event count to each destination row using 4h granularity
with attempt_number=0 filter. Includes Sparkline component with stacked
success/failed bars, empty-bar rendering, and granularity override for
useMetrics hook.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(portal): clean up import order and simplify useMetrics URL building

Move Loading import to top of MetricsChart.tsx. Use arrays/objects
directly in useMetrics instead of serializing to strings and re-splitting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: Fix metrics table histogram display

* chore: Destination page metrics improvements

* chore: Update time range picker styling

* fix: Fix metric breakdown height and dimension ellipsis

* chore: add metrics seed & QA scripts

- seed_metrics.sh: generates realistic event→attempt chains with
  configurable error rates, retry chains, and time distribution
- qa_metrics.sh: 11 named scenarios (healthy, failing, spike, empty,
  single, all-fail, all-success, recent, many-topics, many-codes,
  retry-heavy) with verification checklists
- README documenting usage and scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(portal): exclude manual retries from event count metrics

Add manual=false filter alongside attempt_number=0 to prevent manual
retries (which also start at attempt_number=0) from inflating event
counts in the destination metrics chart and destinations list sparkline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: exclude manual retries from first_attempt_count measure

Manual retries start a new chain at attempt_number=0, inflating
first_attempt_count. Add AND NOT manual to CH, PG, and memlogstore
queries. Add FIXME for test dataset which assigns manual and
attempt_number independently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* bench: add metrics benchmarking suite for ClickHouse and PostgreSQL

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* bench: update metrics bench for new API and add coverage for rate, granularity, filters

Migrate bench suite to current metrics API (TimeRange, tenant via filters)
and add bench cases for rate measures, multi-value granularities (2d/w/M),
new dimensions (code, attempt_number), and new filters (code, manual,
attempt_number, multi-filter).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: align metrics with 1-based attempt_number indexing

PR #740 changed attempt_number from 0-based to 1-based. Update all
metrics query logic, test data, seed scripts, and bench seeds to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Alexandre Bouchard <alexbouchardd@gmail.com>
alexluong added a commit that referenced this pull request Mar 13, 2026
Missed in #740 — portal still filtered for attempt_number=0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
alexluong added a commit that referenced this pull request Mar 15, 2026
* feat: batch destination metrics requests in portal list view

Collapse N per-row metrics API calls into a single batched request
using dimensions[]=destination_id grouping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: update attempt_number filter to 1-based indexing in portal

Missed in #740 — portal still filtered for attempt_number=0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants