fix: Change attempt_number to 1-based indexing#740
Merged
Conversation
First attempt now starts at 1 (retries: 2, 3, 4...) to align with the OpenAPI spec which already documents 1-based indexing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
alexbouchardd
approved these changes
Mar 10, 2026
alexluong
added a commit
that referenced
this pull request
Mar 13, 2026
PR #740 changed attempt_number from 0-based to 1-based. Update all metrics query logic, test data, seed scripts, and bench seeds to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
alexluong
added a commit
that referenced
this pull request
Mar 13, 2026
* feat: add Metrics interface to logstore driver Split LogStore into Records + Metrics sub-interfaces. LogStore is now the combined interface so all existing consumers are unaffected. Typed responses per endpoint (EventMetricsResponse, AttemptMetricsResponse) with all fields as pointers. Stub implementations for CH, PG, and mem drivers return errNotImplemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add metrics conformance tests with 300-row dataset Add a comprehensive metrics test dataset spanning January 2000 with 300 tenant-1 events (50 sparse across 5 days + 250 dense bell-curve on Jan 15) and 5 tenant-2 events for isolation. All dimension cycling produces round numbers (100/topic, 150/dest, 180 success, 120 failed, error_rate=0.4, etc). Covers all granularities (1m, 1h, 1d, 1w, 1M), dimensions, filters, and measures for both event and attempt metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: implement Postgres logstore metrics queries Dynamic SQL builder for event and attempt metrics with parameterized queries, time bucketing (date_bin/date_trunc), dimension grouping, conditional aggregates (FILTER WHERE), 30s query timeout fallback, and row limit enforcement with truncation flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: implement ClickHouse logstore metrics queries Naive implementation using uniqExact/uniqExactIf for dedup-safe aggregation over ReplacingMergeTree without FINAL. Includes 30s query timeout fallback and 100k row limit with truncation detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: remove unimplemented check from metrics conformance tests All three drivers (mem, pg, ch) now implement the metrics interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add metrics API endpoints for events and attempts Wire up QueryEventMetrics and QueryAttemptMetrics as GET /api/v1/metrics/events and GET /api/v1/metrics/attempts with query param parsing, allowlist validation, JWT tenant scoping, and response transformation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: gofmt metrics handlers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: remove duplicate derefIntFromIntPtr helper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add missing tenant_id and attempt_number dimensions to memlogstore, add metrics OpenAPI spec memlogstore QueryEventMetrics was missing tenant_id dimension causing empty results when grouping by tenant. QueryAttemptMetrics was missing both tenant_id and attempt_number dimensions. Also adds MetricsResponse schemas and /metrics/events, /metrics/attempts paths to the OpenAPI spec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add metrics time-series characteristics conformance suite Split metrics drivertest into DataCorrectness (existing value assertions) and Characteristics (structural contract tests for dense bucket filling, ordering, alignment, deterministic count, zero measures, no-data ranges, and dimension × time filling). Shared dataset setup, single provisioning. Characteristics tests will fail until bucket filling is implemented. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump Go from 1.24 to 1.26 Enables new(expr) syntax for cleaner pointer initialization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add post-query bucket filling for metrics time series Sparse time-series responses caused dashboard charts to render fat bars instead of slim bars across all time slots. This adds a shared bucket filling layer (internal/logstore/bucket/) called by all 3 backends after query, producing dense responses with zero-filled gaps. - Extract TruncateTime into shared bucket package - FillEventBuckets / FillAttemptBuckets with dimension-aware filling - Update drivertest assertions for dense bucket counts - All characteristics tests now pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: wire missing attempt metric filters (code, manual, attempt_number) The API allowlist accepted these filters but all three query builders (pg, ch, mem) silently ignored them, causing filters like attempt_number=0 to return unfiltered results. Add WHERE clauses in all drivers and conformance tests to prevent regression. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: rename date_range to time in metrics API query params and DateRange to TimeRange in Go code Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove eligible_for_retry from event metrics dimensions and filters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: adopt bracket notation array params for metrics API Merge main (PR #732) and update metrics handlers to use ParseArrayQueryParam and resolveTenantIDsFilter. Update OpenAPI spec filters to oneOf string/array schema with bracket notation. Update test query strings to indexed bracket format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add rate throughput measures to metrics API Add per-second throughput rate measures computed as count / bucket_duration_seconds. Events endpoint gets `rate`, attempts gets `rate`, `successful_rate`, `failed_rate`. Rate computation lives in shared driver/rate.go, called by each driver after bucket filling. Dependency measures are auto-enriched (e.g. requesting `rate` without `count` internally adds `count` for SQL but omits it from API response). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add rate measure coverage to drivertest Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add ClickHouse SETTINGS constraints to metrics queries Adds server-side resource limits (max_execution_time, max_rows_to_group_by) to CH metrics queries and surfaces overflow errors as 400 instead of 500. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: validate time range in metrics queries to prevent div-by-zero in rate computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add missing tenant_id and attempt_number dimensions to CH and PG metrics queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: remove dead MetricsMetadata.Granularity field No driver ever sets this field — the handler already has the request granularity. Removes the dead if-branch in buildAPIMetricsResponse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: move tenant scoping from TenantID field to filters[tenant_id] Replace MetricsRequest.TenantID with filters[tenant_id] across all 3 drivers so tenant scoping uses the same filter mechanism as other dimensions. Handler now validates/injects filters[tenant_id] for JWT callers instead of using resolveTenantIDsFilter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add defensive guards and validation to metrics time bucketing - Add forward-progress guard in GenerateTimeBuckets to prevent infinite loops if AdvanceTime doesn't advance - Cap bucket count at 100k to prevent OOM; returns ErrTooManyBuckets wrapped as ErrResourceLimit (→ 400 "query too broad") - Validate granularity value ranges per spec (s:1-60, m:1-60, h:1-24, d:1-31, w:1-4, M:1-12) - Document Granularity alignment behavior for calendar units (d/w/M) - Update GenerateTimeBuckets and Fill*Buckets signatures to return error - Update all driver callers (ch, pg, mem) to propagate bucket errors - Add drivertest characteristics for too-many-buckets → ErrResourceLimit - Add handler tests for query-too-broad and granularity-out-of-range Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add failing test for 2d granularity data loss 2d granularity silently drops events on non-slot days. SQL groups by individual day but bucket fill generates every-2-day slots, so data on days that don't align with a slot boundary is orphaned. Expected 300, actual 280 (Jan 22 and Jan 28 events lost). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use epoch-anchored alignment for multi-value d/w/M granularities TruncateTime, CH SQL, and PG SQL all ignored g.Value for d/w/M units, always truncating to the natural period (day/week/month). This caused FillEventBuckets to silently drop data on non-slot days when Value > 1 (e.g. 2d granularity lost ~7% of events). Fix: epoch-anchored N-interval alignment for all three backends: - d: anchor to 1970-01-01, align to N-day boundaries - w: anchor to 1970-01-04 (Sunday), align to N*7-day boundaries - M: align month number to nearest multiple of N from Jan 1970 - Value=1 retains existing behavior (no functional change) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(portal): add metrics dashboard to destination overview 6-chart metrics dashboard: delivery volume (stacked bar), error rate (line, 0-100%), retries (multi-line), avg attempt number, status code breakdown, and topic breakdown. Shared timeframe selector (1h/24h/7d/30d), all charts use /metrics/attempts endpoint. Includes dataviz CSS vars for info (blue) and warning (orange) themes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(portal): replace ResponsiveContainer to fix -1px dimension warnings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(portal): redesign metrics dashboard layout and add event count chart Restructure the metrics grid to 3 rows: events/deliveries, error breakdown (3-col), and retry pressure. Add new "Events / count" chart using attempt_number=0 filter. Support title/subtitle pattern with muted subtitles and add filters param to useMetrics hook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(portal): add events 24h sparkline column to destinations list Add sparkline + event count to each destination row using 4h granularity with attempt_number=0 filter. Includes Sparkline component with stacked success/failed bars, empty-bar rendering, and granularity override for useMetrics hook. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(portal): clean up import order and simplify useMetrics URL building Move Loading import to top of MetricsChart.tsx. Use arrays/objects directly in useMetrics instead of serializing to strings and re-splitting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: Fix metrics table histogram display * chore: Destination page metrics improvements * chore: Update time range picker styling * fix: Fix metric breakdown height and dimension ellipsis * chore: add metrics seed & QA scripts - seed_metrics.sh: generates realistic event→attempt chains with configurable error rates, retry chains, and time distribution - qa_metrics.sh: 11 named scenarios (healthy, failing, spike, empty, single, all-fail, all-success, recent, many-topics, many-codes, retry-heavy) with verification checklists - README documenting usage and scenarios Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(portal): exclude manual retries from event count metrics Add manual=false filter alongside attempt_number=0 to prevent manual retries (which also start at attempt_number=0) from inflating event counts in the destination metrics chart and destinations list sparkline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: exclude manual retries from first_attempt_count measure Manual retries start a new chain at attempt_number=0, inflating first_attempt_count. Add AND NOT manual to CH, PG, and memlogstore queries. Add FIXME for test dataset which assigns manual and attempt_number independently. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * bench: add metrics benchmarking suite for ClickHouse and PostgreSQL Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * bench: update metrics bench for new API and add coverage for rate, granularity, filters Migrate bench suite to current metrics API (TimeRange, tenant via filters) and add bench cases for rate measures, multi-value granularities (2d/w/M), new dimensions (code, attempt_number), and new filters (code, manual, attempt_number, multi-filter). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: align metrics with 1-based attempt_number indexing PR #740 changed attempt_number from 0-based to 1-based. Update all metrics query logic, test data, seed scripts, and bench seeds to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Alexandre Bouchard <alexbouchardd@gmail.com>
alexluong
added a commit
that referenced
this pull request
Mar 13, 2026
Missed in #740 — portal still filtered for attempt_number=0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 tasks
alexluong
added a commit
that referenced
this pull request
Mar 15, 2026
* feat: batch destination metrics requests in portal list view Collapse N per-row metrics API calls into a single batched request using dimensions[]=destination_id grouping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update attempt_number filter to 1-based indexing in portal Missed in #740 — portal still filtered for attempt_number=0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
attempt_numberwas 0-based (first attempt = 0, retries = 1, 2, 3...). This changes it to 1-based (first attempt = 1, retries = 2, 3, 4...) to align with the OpenAPI spec which already documents 1-based indexing.Side effects
No data migration is included. This means:
attempt_numbervalues. New attempts will be 1-based. Users querying across old and new data will see mixed values. We are ok with this.Attempt=0that hit retry after deployment will pass through amax(attempt-1, 0)guard to prevent panics/infinite loops in backoff calculation.0 <= retryMaxLimitpasses the retry check. Only affects tasks in-flight during deployment.Optional: data migration for users who care about consistency
Users who want clean 1-based data across the board can run the following after upgrading. We encourage taking a timestamp of when the upgrade happens and scoping the update to records created before that point:
This does not cover in-flight tasks at the time of upgrade. Users who care about full data correctness can determine the appropriate handling for their setup.