Skip to content

fix: updated the sql query for dataset stats#1555

Open
parmesant wants to merge 4 commits intoparseablehq:mainfrom
parmesant:dataset-stats-update
Open

fix: updated the sql query for dataset stats#1555
parmesant wants to merge 4 commits intoparseablehq:mainfrom
parmesant:dataset-stats-update

Conversation

@parmesant
Copy link
Contributor

@parmesant parmesant commented Feb 25, 2026

Fixes #XXXX.

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • Style

    • Minor formatting tweaks to HTTP handlers and reduced log severity for a noisy warning.
  • Refactor

    • Reworked field statistics query generation to improve pagination and stabilize ordering/tiebreaks.
  • Bug Fix

    • Deserialization now tolerates a missing "protected" flag, preventing account-loading errors.
  • Permissions

    • Super-admin role granted broader access rights, expanding available actions.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Refactors SQL generation for field statistics into consolidated CTEs with per-field pagination and tie-breaker ordering; expands super-admin actions and adds a serde default to a User field; changes a log level from warn to trace; and applies minor formatting (trailing commas) to HTTP routing service registrations. No public API signatures changed.

Changes

Cohort / File(s) Summary
HTTP Routing Formatting
src/handlers/http/modal/query_server.rs, src/handlers/http/modal/server.rs
Added trailing commas to .service(...) calls inside web::scope builder chains — purely formatting.
Field Statistics SQL Refactor
src/storage/field_stats.rs
Rewrote build_stats_sql: replaced multi-CTE layout with ranked_values (adds rn and distinct_count via window), top_values (applies rn-based pagination), and field_totals; final SELECT joins top_values to field_totals and orders by field_name, distinct_value_count DESC. Pagination moved into CTEs; function signature unchanged.
RBAC change
src/rbac/role.rs
super_admin_perm_builder actions vector now includes Action::All alongside Action::SuperAdmin and sets resource_type to All.
User model tweak
src/rbac/user.rs
Added #[serde(default)] to pub protected: bool to allow missing field during deserialization.
Ingest utils log level
src/handlers/http/modal/utils/ingest_utils.rs
Logging changed from tracing::warn! to tracing::trace! when dataset field count exceeds threshold (severity lowered).

Sequence Diagram(s)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • nitisht
  • nikhilsinhaparseable
  • praveen5959

Poem

🐇 A tiny comma hopped aboard,
CTEs aligned and pages stored,
Counts ranked high, ties sorted fair,
Permissions widened, gentle care,
I nibble code — a quiet reward.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely blank—containing only the template placeholders with no actual information filled in (goal, solutions, key changes, or testing details). Fill in the Description section with: the goal of the SQL query refactor, the rationale for restructuring the CTEs, key changes made to the query logic, and clarify which checklist items apply. Provide or remove the 'Fixes #XXXX' placeholder by adding a real issue number or removing the line.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: updated the sql query for dataset stats' accurately reflects the main change: a SQL query update in src/storage/field_stats.rs with restructured CTEs and pagination logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/storage/field_stats.rs (2)

637-649: ⚠️ Potential issue | 🔴 Critical

rv.field_name alias doesn't exist inside top_values — field filter will always fail at runtime.

The fields_filter string is interpolated directly into the body of top_values:

top_values AS (
  SELECT * FROM ranked_values
  WHERE rn > ... AND rn <= ...
  AND  rv.field_name IN (...)   -- ← no alias 'rv' here
)

rv is only introduced in the final SELECT ... FROM top_values tv JOIN field_totals ft ..., not inside top_values. Any request with a non-empty fields list will produce a SQL execution error.

🐛 Proposed fix
-            format!("AND  rv.field_name IN ({})", quoted_fields.join(", "))
+            format!("AND field_name IN ({})", quoted_fields.join(", "))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 637 - 649, The generated SQL filter
uses the nonexistent alias rv inside the top_values CTE causing runtime errors;
update the fields_filter construction so it references the correct column name
as used in ranked_values/top_values (e.g., use "field_name" without the rv
prefix) or build the filter to apply against ranked_values (e.g., "field_name IN
(...)") so the WHERE inside top_values matches available columns; modify the
code that creates fields_filter (the block assigning fields_filter) to remove
the "rv." prefix when interpolating into the top_values CTE so the filter uses
the actual column name present in top_values/ranked_values.

650-650: ⚠️ Potential issue | 🟠 Major

dataset_name is escaped for double quotes but is used inside single-quoted SQL string literals — correct escaping is for single quotes.

Line 650:

let dataset_name = dataset_name.replace('"', "\"\"");

dataset_name is then interpolated at lines 677 and 700 as a SQL string literal:

WHERE dataset_name = '{dataset_name}'

Double-quote escaping does nothing to protect the single-quoted literal. A dataset name containing ' (e.g. "customer's_data") produces broken SQL WHERE dataset_name = 'customer's_data' — either a parse error or a potential injection point.

🐛 Proposed fix
-    let dataset_name = dataset_name.replace('"', "\"\"");
+    let dataset_name = dataset_name.replace('\'', "''");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` at line 650, The code currently replaces double
quotes in dataset_name (let dataset_name = dataset_name.replace('"', "\"\""))
but the value is later embedded into single-quoted SQL literals (WHERE
dataset_name = '{dataset_name}'), so replace the unsafe approach by either using
a parameterized query/binding for dataset_name when building the SQL or, if
parameterization is not possible in this context, escape single quotes correctly
by replacing ' with '' (e.g., dataset_name.replace('\'', "''")); update the code
that constructs the SQL strings (the sites referencing dataset_name at the WHERE
clauses) to use the safe/escaped value or the parameter placeholder to eliminate
parse errors and injection risk.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/storage/field_stats.rs`:
- Around line 637-649: The generated SQL filter uses the nonexistent alias rv
inside the top_values CTE causing runtime errors; update the fields_filter
construction so it references the correct column name as used in
ranked_values/top_values (e.g., use "field_name" without the rv prefix) or build
the filter to apply against ranked_values (e.g., "field_name IN (...)") so the
WHERE inside top_values matches available columns; modify the code that creates
fields_filter (the block assigning fields_filter) to remove the "rv." prefix
when interpolating into the top_values CTE so the filter uses the actual column
name present in top_values/ranked_values.
- Line 650: The code currently replaces double quotes in dataset_name (let
dataset_name = dataset_name.replace('"', "\"\"")) but the value is later
embedded into single-quoted SQL literals (WHERE dataset_name =
'{dataset_name}'), so replace the unsafe approach by either using a
parameterized query/binding for dataset_name when building the SQL or, if
parameterization is not possible in this context, escape single quotes correctly
by replacing ' with '' (e.g., dataset_name.replace('\'', "''")); update the code
that constructs the SQL strings (the sites referencing dataset_name at the WHERE
clauses) to use the safe/escaped value or the parameter placeholder to eliminate
parse errors and injection risk.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8859147 and fe47d78.

📒 Files selected for processing (3)
  • src/handlers/http/modal/query_server.rs
  • src/handlers/http/modal/server.rs
  • src/storage/field_stats.rs

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 25, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/storage/field_stats.rs (1)

713-715: Make final ordering deterministic for equal counts.

The final sort only uses distinct_value_count DESC. Add distinct_value ASC as a tie-breaker to keep output stable across executions/pages.

Suggested fix
 ORDER BY
   tv.field_name,
-  tv.distinct_value_count DESC"
+  tv.distinct_value_count DESC,
+  tv.distinct_value ASC"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 713 - 715, The ORDER BY clause in
the SQL built in src/storage/field_stats.rs currently sorts only by
tv.field_name and tv.distinct_value_count DESC which can yield non-deterministic
ordering when counts tie; update the ORDER BY to add tv.distinct_value ASC as a
tie-breaker (i.e., ORDER BY tv.field_name, tv.distinct_value_count DESC,
tv.distinct_value ASC) in the query string where it's defined so results are
stable across executions/pages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/field_stats.rs`:
- Around line 677-680: The injected fields predicate uses an undefined alias
"rv" (rv.field_name IN (...)) causing runtime SQL errors; update the code that
builds fields_filter to reference the correct column in this CTE scope (either
plain field_name or the CTE/table alias used here, e.g. field_stats.field_name)
instead of rv.field_name so the predicate resolves in the same SELECT/CTE where
it is injected.

---

Nitpick comments:
In `@src/storage/field_stats.rs`:
- Around line 713-715: The ORDER BY clause in the SQL built in
src/storage/field_stats.rs currently sorts only by tv.field_name and
tv.distinct_value_count DESC which can yield non-deterministic ordering when
counts tie; update the ORDER BY to add tv.distinct_value ASC as a tie-breaker
(i.e., ORDER BY tv.field_name, tv.distinct_value_count DESC, tv.distinct_value
ASC) in the query string where it's defined so results are stable across
executions/pages.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe47d78 and 885a954.

📒 Files selected for processing (3)
  • src/handlers/http/modal/query_server.rs
  • src/handlers/http/modal/server.rs
  • src/storage/field_stats.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/handlers/http/modal/server.rs
  • src/handlers/http/modal/query_server.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
src/storage/field_stats.rs (2)

716-718: Add a tie-breaker to final ordering for stable output.

Rows with equal distinct_value_count can be returned in non-deterministic order. Adding tv.distinct_value ASC keeps results stable and consistent with ranking behavior.

Proposed diff
 ORDER BY
   tv.field_name,
-  tv.distinct_value_count DESC"
+  tv.distinct_value_count DESC,
+  tv.distinct_value ASC"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 716 - 718, The final ORDER BY in the
SQL string currently sorts by tv.field_name and tv.distinct_value_count DESC but
lacks a tie-breaker; update the ORDER BY clause in the SQL (the string
containing "ORDER BY tv.field_name, tv.distinct_value_count DESC") to append a
stable tie-breaker "tv.distinct_value ASC" so rows with equal
distinct_value_count are returned deterministically; locate the SQL in
src/storage/field_stats.rs and modify that ORDER BY clause to include
tv.distinct_value ASC.

696-706: Filter field_totals by requested fields to avoid extra aggregation work.

When fields is provided, ranked_values is filtered but field_totals still aggregates all fields for the dataset. That adds avoidable scan/grouping cost.

Proposed diff
   field_totals AS (
     SELECT
       field_stats_field_name,
       SUM(field_stats_count) AS total_field_count
     FROM
       {DATASET_STATS_STREAM_NAME}
     WHERE
       dataset_name = '{dataset_name}'
+      {fields_filter}
     GROUP BY
       field_stats_field_name
   )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 696 - 706, The field_totals CTE is
aggregating across all fields even when a specific set is requested; modify the
field_totals query (the CTE named field_totals that reads from
DATASET_STATS_STREAM_NAME) to apply the same filter used by ranked_values by
adding a WHERE clause limiting field_stats_field_name to the requested fields
(i.e., field_stats_field_name IN ({fields}) or the equivalent parameterized
condition) so aggregation/grouping only processes the requested fields; keep the
existing dataset_name = '{dataset_name}' predicate and ensure the
placeholder/parameter format matches how fields are passed in the surrounding
code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/storage/field_stats.rs`:
- Around line 716-718: The final ORDER BY in the SQL string currently sorts by
tv.field_name and tv.distinct_value_count DESC but lacks a tie-breaker; update
the ORDER BY clause in the SQL (the string containing "ORDER BY tv.field_name,
tv.distinct_value_count DESC") to append a stable tie-breaker "tv.distinct_value
ASC" so rows with equal distinct_value_count are returned deterministically;
locate the SQL in src/storage/field_stats.rs and modify that ORDER BY clause to
include tv.distinct_value ASC.
- Around line 696-706: The field_totals CTE is aggregating across all fields
even when a specific set is requested; modify the field_totals query (the CTE
named field_totals that reads from DATASET_STATS_STREAM_NAME) to apply the same
filter used by ranked_values by adding a WHERE clause limiting
field_stats_field_name to the requested fields (i.e., field_stats_field_name IN
({fields}) or the equivalent parameterized condition) so aggregation/grouping
only processes the requested fields; keep the existing dataset_name =
'{dataset_name}' predicate and ensure the placeholder/parameter format matches
how fields are passed in the surrounding code.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 885a954 and 41112ca.

📒 Files selected for processing (1)
  • src/storage/field_stats.rs

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 25, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/storage/field_stats.rs (1)

643-646: Previously raised issue — confirmed addressed.

field_stats_field_name is correctly referenced in the fields filter (no undefined alias). This resolves the prior review comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 643 - 646, The previous
undefined-alias bug in the fields filter has been resolved: ensure the SQL
fragment uses the correct column name field_stats_field_name (as shown in the
format! call) and keep the quoted_fields.join(", ") usage; no further code
changes are needed for the fields filter in the function generating this SQL
fragment.
🧹 Nitpick comments (1)
src/storage/field_stats.rs (1)

696-715: field_totals is not filtered by fields_filter — confirm this is intentional.

When a fields_filter is provided (e.g., field_stats_field_name IN ('a', 'b')), field_totals scans all fields for dataset_name. This is harmless because the JOIN ON tv.field_name = ft.field_stats_field_name restricts the final result to only the filtered fields, but it does cause an unnecessary full scan of all fields for field_totals. If you want to avoid the extra work, you can propagate {fields_filter} (replacing field_stats_field_name with the same column) into field_totals as well.

♻️ Proposed optimization to propagate fields_filter into field_totals
  field_totals AS (
    SELECT
      field_stats_field_name,
      SUM(field_stats_count) AS total_field_count
    FROM
      {DATASET_STATS_STREAM_NAME}
    WHERE
      dataset_name = '{dataset_name}'
+     {fields_filter}
    GROUP BY
      field_stats_field_name
  )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 696 - 715, The CTE field_totals
currently sums field_stats_count for the dataset without applying the provided
fields_filter, causing an unnecessary scan; update the field_totals CTE
(referencing field_totals, field_stats_field_name, DATASET_STATS_STREAM_NAME,
dataset_name) to include the same {fields_filter} condition (matching
field_stats_field_name to the filtered column) so it only aggregates over the
requested fields, ensuring the JOIN with top_values remains correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/field_stats.rs`:
- Around line 660-686: The CTE ranked_values uses ROW_NUMBER() OVER (... ORDER
BY SUM(field_stats_distinct_stats_count) ...) which DataFusion 51 doesn't allow
because ORDER BY in window functions must be row-level expressions; fix by
computing the SUM(field_stats_distinct_stats_count) in the GROUP BY step (e.g.,
produce a column like distinct_value_count in the aggregation) and then
reference that aggregated column in a separate windowing layer/CTE where
ROW_NUMBER() is applied (update the CTE chain so ranked_values emits field_name,
distinct_value, distinct_value_count, distinct_count and the subsequent CTE
applies ROW_NUMBER() OVER (PARTITION BY field_name ORDER BY distinct_value_count
DESC, distinct_value ASC)).

---

Duplicate comments:
In `@src/storage/field_stats.rs`:
- Around line 643-646: The previous undefined-alias bug in the fields filter has
been resolved: ensure the SQL fragment uses the correct column name
field_stats_field_name (as shown in the format! call) and keep the
quoted_fields.join(", ") usage; no further code changes are needed for the
fields filter in the function generating this SQL fragment.

---

Nitpick comments:
In `@src/storage/field_stats.rs`:
- Around line 696-715: The CTE field_totals currently sums field_stats_count for
the dataset without applying the provided fields_filter, causing an unnecessary
scan; update the field_totals CTE (referencing field_totals,
field_stats_field_name, DATASET_STATS_STREAM_NAME, dataset_name) to include the
same {fields_filter} condition (matching field_stats_field_name to the filtered
column) so it only aggregates over the requested fields, ensuring the JOIN with
top_values remains correct.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41112ca and bc86d2f.

📒 Files selected for processing (3)
  • src/rbac/role.rs
  • src/rbac/user.rs
  • src/storage/field_stats.rs

@parmesant parmesant force-pushed the dataset-stats-update branch from bc86d2f to 4c13fdc Compare February 25, 2026 11:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/handlers/http/modal/utils/ingest_utils.rs (1)

270-276: ⚠️ Potential issue | 🟠 Major

Keep near-limit dataset field alert at warning level.

Line 270 now logs threshold breach at trace, which is usually disabled in production; this hides an important pre-failure signal before ingestion gets blocked at the limit.

Suggested fix
-        tracing::trace!(
+        tracing::warn!(
             "Dataset {0} has {1} fields, which exceeds the warning threshold of {2}. Ingestion will not be possible after reaching {3} fields. We recommend creating a new dataset.",
             stream_name,
             fields_count,
             dataset_fields_warn_threshold as usize,
             PARSEABLE.options.dataset_fields_allowed_limit
         );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/handlers/http/modal/utils/ingest_utils.rs` around lines 270 - 276, Change
the logging call that reports a dataset approaching the field limit from trace
to warn so the near-limit alert is visible in production; update the
tracing::trace!(...) invocation that uses stream_name, fields_count,
dataset_fields_warn_threshold, and
PARSEABLE.options.dataset_fields_allowed_limit to tracing::warn!(...) while
keeping the existing message and interpolation intact.
♻️ Duplicate comments (1)
src/storage/field_stats.rs (1)

660-718: ⚠️ Potential issue | 🔴 Critical

Window ORDER BY still uses an aggregate expression and can fail at runtime.

Line 670 orders ROW_NUMBER() by SUM(field_stats_distinct_stats_count), which should be computed in a prior aggregation layer and then referenced as a row-level column.

Suggested fix
-WITH
-  ranked_values AS (
+WITH
+  aggregated_values AS (
     SELECT
       field_stats_field_name AS field_name,
       field_stats_distinct_stats_distinct_value AS distinct_value,
-      SUM(field_stats_distinct_stats_count) AS distinct_value_count,
+      SUM(field_stats_distinct_stats_count) AS distinct_value_count
+    FROM
+      {DATASET_STATS_STREAM_NAME}
+    WHERE
+      dataset_name = '{dataset_name}'
+      AND field_stats_distinct_stats_distinct_value IS NOT NULL
+      {fields_filter}
+    GROUP BY
+      field_stats_field_name,
+      field_stats_distinct_stats_distinct_value
+  ),
+  ranked_values AS (
+    SELECT
+      field_name,
+      distinct_value,
+      distinct_value_count,
       ROW_NUMBER() OVER (
         PARTITION BY
-          field_stats_field_name
+          field_name
         ORDER BY
-          SUM(field_stats_distinct_stats_count) DESC,
-          field_stats_distinct_stats_distinct_value ASC
+          distinct_value_count DESC,
+          distinct_value ASC
       ) AS rn,
       COUNT(*) OVER (
         PARTITION BY
-          field_stats_field_name
+          field_name
       ) AS distinct_count
-    FROM
-      {DATASET_STATS_STREAM_NAME}
-    WHERE
-      dataset_name = '{dataset_name}'
-      AND field_stats_distinct_stats_distinct_value IS NOT NULL
-      {fields_filter}
-    GROUP BY
-      field_stats_field_name,
-      field_stats_distinct_stats_distinct_value
+    FROM aggregated_values
   ),

Run this read-only check to confirm the risky pattern and dependency version:

#!/bin/bash
set -euo pipefail

echo "DataFusion dependency versions:"
rg -n --glob 'Cargo.toml' --glob 'Cargo.lock' '\bdatafusion\b'

echo
echo "Window ORDER BY with aggregate in src/storage/field_stats.rs:"
rg -nP -C3 'ROW_NUMBER\(\)\s+OVER[\s\S]*ORDER BY[\s\S]*SUM\(field_stats_distinct_stats_count\)' src/storage/field_stats.rs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 660 - 718, The window ORDER BY is
using an aggregate expression (SUM(field_stats_distinct_stats_count)) inside
ROW_NUMBER() which can fail; refactor the query in src/storage/field_stats.rs so
the aggregation is computed in a prior GROUP BY layer and given an alias (e.g.,
distinct_value_count) and then the ROW_NUMBER() OVER clause in the ranked_values
CTE orders by that alias (ORDER BY distinct_value_count DESC,
field_stats_distinct_stats_distinct_value ASC). Concretely, modify the CTE that
produces ranked_values (and its GROUP BY) to SELECT
SUM(field_stats_distinct_stats_count) AS distinct_value_count and then use
ROW_NUMBER() OVER (PARTITION BY field_stats_field_name ORDER BY
distinct_value_count DESC, field_stats_distinct_stats_distinct_value ASC);
ensure downstream references (top_values, joins to field_totals) use the new
distinct_value_count column name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/handlers/http/modal/utils/ingest_utils.rs`:
- Around line 270-276: Change the logging call that reports a dataset
approaching the field limit from trace to warn so the near-limit alert is
visible in production; update the tracing::trace!(...) invocation that uses
stream_name, fields_count, dataset_fields_warn_threshold, and
PARSEABLE.options.dataset_fields_allowed_limit to tracing::warn!(...) while
keeping the existing message and interpolation intact.

---

Duplicate comments:
In `@src/storage/field_stats.rs`:
- Around line 660-718: The window ORDER BY is using an aggregate expression
(SUM(field_stats_distinct_stats_count)) inside ROW_NUMBER() which can fail;
refactor the query in src/storage/field_stats.rs so the aggregation is computed
in a prior GROUP BY layer and given an alias (e.g., distinct_value_count) and
then the ROW_NUMBER() OVER clause in the ranked_values CTE orders by that alias
(ORDER BY distinct_value_count DESC, field_stats_distinct_stats_distinct_value
ASC). Concretely, modify the CTE that produces ranked_values (and its GROUP BY)
to SELECT SUM(field_stats_distinct_stats_count) AS distinct_value_count and then
use ROW_NUMBER() OVER (PARTITION BY field_stats_field_name ORDER BY
distinct_value_count DESC, field_stats_distinct_stats_distinct_value ASC);
ensure downstream references (top_values, joins to field_totals) use the new
distinct_value_count column name.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc86d2f and 4c13fdc.

📒 Files selected for processing (4)
  • src/handlers/http/modal/utils/ingest_utils.rs
  • src/rbac/role.rs
  • src/rbac/user.rs
  • src/storage/field_stats.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/rbac/user.rs
  • src/rbac/role.rs

@nikhilsinhaparseable
Copy link
Contributor

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants