fix: updated the sql query for dataset stats by parmesant · Pull Request #1555 · parseablehq/parseable

parmesant · 2026-02-25T08:42:56Z

Fixes #XXXX.

Description

This PR has:

been tested to ensure log ingestion and log query works.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added documentation for new or modified features or behaviors.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved backward compatibility for user records with missing fields.
Enhancements
- Optimized field statistics query performance.
- Expanded SuperAdmin role permissions.
Other
- Adjusted logging verbosity for diagnostics.

coderabbitai · 2026-02-25T08:43:15Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR refactors the field statistics SQL generation logic by reordering CTEs for improved pagination handling, adds the All action to the SuperAdmin role, enables backward-compatible deserialization of the User protected field, and adjusts a log level.

Changes

Cohort / File(s)	Summary
Field Statistics SQL Refactoring `src/storage/field_stats.rs`	Rewrites CTE structure from field_totals/ranked_values/field_distinct_counts to ranked_values/top_values/field_totals. Moves pagination logic into top_values CTE, restructures distinct value computation with window functions, and adjusts final SELECT columns and ordering. Field filter placeholder updated but public API remains unchanged.
RBAC Configuration Updates `src/rbac/role.rs`, `src/rbac/user.rs`	Expands SuperAdmin role actions to include Action::All with ParseableResourceType::All. Adds `#[serde(default)]` attribute to User struct protected field for backward-compatible deserialization when field is absent.
Logging Level Adjustment `src/handlers/http/modal/utils/ingest_utils.rs`	Changes dataset field count warning from `tracing::warn!` to `tracing::trace!` for more verbose logging output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

feat: enable dataset stats collection and retrieval #1529: Both PRs directly modify field_stats.rs's build_stats_sql implementation, with this PR restructuring CTE ordering and pagination while the prior PR introduced the dataset-stats SQL builder.
fix: add support for list type data in field stats #1369: Both PRs modify field_stats.rs logic for computing distinct values and field statistics.
feat: add stats for each field #1340: Both PRs affect SQL generation and retrieval logic for per-field statistics.

Suggested reviewers

nitisht

Poem

🐰 SQL hops through CTEs with grace,
Top values ranked in the right place,
SuperAdmin gains All with style,
Serde defaults make users smile,
A trace of logs to light the way! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only the template with placeholders and no concrete information. The description lacks the goal, rationale, chosen solution, and key changes that should replace the template comments.	Fill in the Description section with the actual goal, chosen solution rationale, and key changes made. Link the relevant issue number, specify what was tested, and confirm checklist items if applicable.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'fix: updated the sql query for dataset stats' accurately describes the main change in the changeset - specifically the SQL query rewrite in src/storage/field_stats.rs for computing dataset statistics.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/storage/field_stats.rs (2)
637-649: ⚠️ Potential issue | 🔴 Critical

rv.field_name alias doesn't exist inside top_values — field filter will always fail at runtime.

The fields_filter string is interpolated directly into the body of top_values:
top_values AS (
  SELECT * FROM ranked_values
  WHERE rn > ... AND rn <= ...
  AND  rv.field_name IN (...)   -- ← no alias 'rv' here
)
rv is only introduced in the final SELECT ... FROM top_values tv JOIN field_totals ft ..., not inside top_values. Any request with a non-empty fields list will produce a SQL execution error.
🐛 Proposed fix
-            format!("AND  rv.field_name IN ({})", quoted_fields.join(", "))
+            format!("AND field_name IN ({})", quoted_fields.join(", "))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 637 - 649, The generated SQL filter
uses the nonexistent alias rv inside the top_values CTE causing runtime errors;
update the fields_filter construction so it references the correct column name
as used in ranked_values/top_values (e.g., use "field_name" without the rv
prefix) or build the filter to apply against ranked_values (e.g., "field_name IN
(...)") so the WHERE inside top_values matches available columns; modify the
code that creates fields_filter (the block assigning fields_filter) to remove
the "rv." prefix when interpolating into the top_values CTE so the filter uses
the actual column name present in top_values/ranked_values.
650-650: ⚠️ Potential issue | 🟠 Major

dataset_name is escaped for double quotes but is used inside single-quoted SQL string literals — correct escaping is for single quotes.

Line 650:
let dataset_name = dataset_name.replace('"', "\"\"");
dataset_name is then interpolated at lines 677 and 700 as a SQL string literal:
WHERE dataset_name = '{dataset_name}'
Double-quote escaping does nothing to protect the single-quoted literal. A dataset name containing ' (e.g. "customer's_data") produces broken SQL WHERE dataset_name = 'customer's_data' — either a parse error or a potential injection point.
🐛 Proposed fix
-    let dataset_name = dataset_name.replace('"', "\"\"");
+    let dataset_name = dataset_name.replace('\'', "''");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` at line 650, The code currently replaces double
quotes in dataset_name (let dataset_name = dataset_name.replace('"', "\"\""))
but the value is later embedded into single-quoted SQL literals (WHERE
dataset_name = '{dataset_name}'), so replace the unsafe approach by either using
a parameterized query/binding for dataset_name when building the SQL or, if
parameterization is not possible in this context, escape single quotes correctly
by replacing ' with '' (e.g., dataset_name.replace('\'', "''")); update the code
that constructs the SQL strings (the sites referencing dataset_name at the WHERE
clauses) to use the safe/escaped value or the parameter placeholder to eliminate
parse errors and injection risk.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/storage/field_stats.rs`:
- Around line 637-649: The generated SQL filter uses the nonexistent alias rv
inside the top_values CTE causing runtime errors; update the fields_filter
construction so it references the correct column name as used in
ranked_values/top_values (e.g., use "field_name" without the rv prefix) or build
the filter to apply against ranked_values (e.g., "field_name IN (...)") so the
WHERE inside top_values matches available columns; modify the code that creates
fields_filter (the block assigning fields_filter) to remove the "rv." prefix
when interpolating into the top_values CTE so the filter uses the actual column
name present in top_values/ranked_values.
- Line 650: The code currently replaces double quotes in dataset_name (let
dataset_name = dataset_name.replace('"', "\"\"")) but the value is later
embedded into single-quoted SQL literals (WHERE dataset_name =
'{dataset_name}'), so replace the unsafe approach by either using a
parameterized query/binding for dataset_name when building the SQL or, if
parameterization is not possible in this context, escape single quotes correctly
by replacing ' with '' (e.g., dataset_name.replace('\'', "''")); update the code
that constructs the SQL strings (the sites referencing dataset_name at the WHERE
clauses) to use the safe/escaped value or the parameter placeholder to eliminate
parse errors and injection risk.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8859147 and fe47d78.

📒 Files selected for processing (3)

src/handlers/http/modal/query_server.rs
src/handlers/http/modal/server.rs
src/storage/field_stats.rs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/storage/field_stats.rs (1)

713-715: Make final ordering deterministic for equal counts.

The final sort only uses distinct_value_count DESC. Add distinct_value ASC as a tie-breaker to keep output stable across executions/pages.

Suggested fix

 ORDER BY
   tv.field_name,
-  tv.distinct_value_count DESC"
+  tv.distinct_value_count DESC,
+  tv.distinct_value ASC"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 713 - 715, The ORDER BY clause in
the SQL built in src/storage/field_stats.rs currently sorts only by
tv.field_name and tv.distinct_value_count DESC which can yield non-deterministic
ordering when counts tie; update the ORDER BY to add tv.distinct_value ASC as a
tie-breaker (i.e., ORDER BY tv.field_name, tv.distinct_value_count DESC,
tv.distinct_value ASC) in the query string where it's defined so results are
stable across executions/pages.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/field_stats.rs`:
- Around line 677-680: The injected fields predicate uses an undefined alias
"rv" (rv.field_name IN (...)) causing runtime SQL errors; update the code that
builds fields_filter to reference the correct column in this CTE scope (either
plain field_name or the CTE/table alias used here, e.g. field_stats.field_name)
instead of rv.field_name so the predicate resolves in the same SELECT/CTE where
it is injected.

---

Nitpick comments:
In `@src/storage/field_stats.rs`:
- Around line 713-715: The ORDER BY clause in the SQL built in
src/storage/field_stats.rs currently sorts only by tv.field_name and
tv.distinct_value_count DESC which can yield non-deterministic ordering when
counts tie; update the ORDER BY to add tv.distinct_value ASC as a tie-breaker
(i.e., ORDER BY tv.field_name, tv.distinct_value_count DESC, tv.distinct_value
ASC) in the query string where it's defined so results are stable across
executions/pages.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe47d78 and 885a954.

📒 Files selected for processing (3)

src/handlers/http/modal/query_server.rs
src/handlers/http/modal/server.rs
src/storage/field_stats.rs

🚧 Files skipped from review as they are similar to previous changes (2)

src/handlers/http/modal/server.rs
src/handlers/http/modal/query_server.rs

src/storage/field_stats.rs

coderabbitai

🧹 Nitpick comments (2)

src/storage/field_stats.rs (2)

716-718: Add a tie-breaker to final ordering for stable output.

Rows with equal distinct_value_count can be returned in non-deterministic order. Adding tv.distinct_value ASC keeps results stable and consistent with ranking behavior.

Proposed diff

 ORDER BY
   tv.field_name,
-  tv.distinct_value_count DESC"
+  tv.distinct_value_count DESC,
+  tv.distinct_value ASC"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 716 - 718, The final ORDER BY in the
SQL string currently sorts by tv.field_name and tv.distinct_value_count DESC but
lacks a tie-breaker; update the ORDER BY clause in the SQL (the string
containing "ORDER BY tv.field_name, tv.distinct_value_count DESC") to append a
stable tie-breaker "tv.distinct_value ASC" so rows with equal
distinct_value_count are returned deterministically; locate the SQL in
src/storage/field_stats.rs and modify that ORDER BY clause to include
tv.distinct_value ASC.

696-706: Filter field_totals by requested fields to avoid extra aggregation work.

When fields is provided, ranked_values is filtered but field_totals still aggregates all fields for the dataset. That adds avoidable scan/grouping cost.

Proposed diff

   field_totals AS (
     SELECT
       field_stats_field_name,
       SUM(field_stats_count) AS total_field_count
     FROM
       {DATASET_STATS_STREAM_NAME}
     WHERE
       dataset_name = '{dataset_name}'
+      {fields_filter}
     GROUP BY
       field_stats_field_name
   )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 696 - 706, The field_totals CTE is
aggregating across all fields even when a specific set is requested; modify the
field_totals query (the CTE named field_totals that reads from
DATASET_STATS_STREAM_NAME) to apply the same filter used by ranked_values by
adding a WHERE clause limiting field_stats_field_name to the requested fields
(i.e., field_stats_field_name IN ({fields}) or the equivalent parameterized
condition) so aggregation/grouping only processes the requested fields; keep the
existing dataset_name = '{dataset_name}' predicate and ensure the
placeholder/parameter format matches how fields are passed in the surrounding
code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/storage/field_stats.rs`:
- Around line 716-718: The final ORDER BY in the SQL string currently sorts by
tv.field_name and tv.distinct_value_count DESC but lacks a tie-breaker; update
the ORDER BY clause in the SQL (the string containing "ORDER BY tv.field_name,
tv.distinct_value_count DESC") to append a stable tie-breaker "tv.distinct_value
ASC" so rows with equal distinct_value_count are returned deterministically;
locate the SQL in src/storage/field_stats.rs and modify that ORDER BY clause to
include tv.distinct_value ASC.
- Around line 696-706: The field_totals CTE is aggregating across all fields
even when a specific set is requested; modify the field_totals query (the CTE
named field_totals that reads from DATASET_STATS_STREAM_NAME) to apply the same
filter used by ranked_values by adding a WHERE clause limiting
field_stats_field_name to the requested fields (i.e., field_stats_field_name IN
({fields}) or the equivalent parameterized condition) so aggregation/grouping
only processes the requested fields; keep the existing dataset_name =
'{dataset_name}' predicate and ensure the placeholder/parameter format matches
how fields are passed in the surrounding code.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 885a954 and 41112ca.

📒 Files selected for processing (1)

src/storage/field_stats.rs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/storage/field_stats.rs (1)
643-646: Previously raised issue — confirmed addressed.

field_stats_field_name is correctly referenced in the fields filter (no undefined alias). This resolves the prior review comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 643 - 646, The previous
undefined-alias bug in the fields filter has been resolved: ensure the SQL
fragment uses the correct column name field_stats_field_name (as shown in the
format! call) and keep the quoted_fields.join(", ") usage; no further code
changes are needed for the fields filter in the function generating this SQL
fragment.

🧹 Nitpick comments (1)

src/storage/field_stats.rs (1)
696-715: field_totals is not filtered by fields_filter — confirm this is intentional.

When a fields_filter is provided (e.g., field_stats_field_name IN ('a', 'b')), field_totals scans all fields for dataset_name. This is harmless because the JOIN ON tv.field_name = ft.field_stats_field_name restricts the final result to only the filtered fields, but it does cause an unnecessary full scan of all fields for field_totals. If you want to avoid the extra work, you can propagate {fields_filter} (replacing field_stats_field_name with the same column) into field_totals as well.
♻️ Proposed optimization to propagate fields_filter into field_totals
  field_totals AS (
    SELECT
      field_stats_field_name,
      SUM(field_stats_count) AS total_field_count
    FROM
      {DATASET_STATS_STREAM_NAME}
    WHERE
      dataset_name = '{dataset_name}'
+     {fields_filter}
    GROUP BY
      field_stats_field_name
  )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 696 - 715, The CTE field_totals
currently sums field_stats_count for the dataset without applying the provided
fields_filter, causing an unnecessary scan; update the field_totals CTE
(referencing field_totals, field_stats_field_name, DATASET_STATS_STREAM_NAME,
dataset_name) to include the same {fields_filter} condition (matching
field_stats_field_name to the filtered column) so it only aggregates over the
requested fields, ensuring the JOIN with top_values remains correct.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/storage/field_stats.rs`:
- Around line 660-686: The CTE ranked_values uses ROW_NUMBER() OVER (... ORDER
BY SUM(field_stats_distinct_stats_count) ...) which DataFusion 51 doesn't allow
because ORDER BY in window functions must be row-level expressions; fix by
computing the SUM(field_stats_distinct_stats_count) in the GROUP BY step (e.g.,
produce a column like distinct_value_count in the aggregation) and then
reference that aggregated column in a separate windowing layer/CTE where
ROW_NUMBER() is applied (update the CTE chain so ranked_values emits field_name,
distinct_value, distinct_value_count, distinct_count and the subsequent CTE
applies ROW_NUMBER() OVER (PARTITION BY field_name ORDER BY distinct_value_count
DESC, distinct_value ASC)).

---

Duplicate comments:
In `@src/storage/field_stats.rs`:
- Around line 643-646: The previous undefined-alias bug in the fields filter has
been resolved: ensure the SQL fragment uses the correct column name
field_stats_field_name (as shown in the format! call) and keep the
quoted_fields.join(", ") usage; no further code changes are needed for the
fields filter in the function generating this SQL fragment.

---

Nitpick comments:
In `@src/storage/field_stats.rs`:
- Around line 696-715: The CTE field_totals currently sums field_stats_count for
the dataset without applying the provided fields_filter, causing an unnecessary
scan; update the field_totals CTE (referencing field_totals,
field_stats_field_name, DATASET_STATS_STREAM_NAME, dataset_name) to include the
same {fields_filter} condition (matching field_stats_field_name to the filtered
column) so it only aggregates over the requested fields, ensuring the JOIN with
top_values remains correct.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41112ca and bc86d2f.

📒 Files selected for processing (3)

src/rbac/role.rs
src/rbac/user.rs
src/storage/field_stats.rs

src/storage/field_stats.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/handlers/http/modal/utils/ingest_utils.rs (1)

270-276: ⚠️ Potential issue | 🟠 Major

Keep near-limit dataset field alert at warning level.

Line 270 now logs threshold breach at trace, which is usually disabled in production; this hides an important pre-failure signal before ingestion gets blocked at the limit.

Suggested fix

-        tracing::trace!(
+        tracing::warn!(
             "Dataset {0} has {1} fields, which exceeds the warning threshold of {2}. Ingestion will not be possible after reaching {3} fields. We recommend creating a new dataset.",
             stream_name,
             fields_count,
             dataset_fields_warn_threshold as usize,
             PARSEABLE.options.dataset_fields_allowed_limit
         );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/handlers/http/modal/utils/ingest_utils.rs` around lines 270 - 276, Change
the logging call that reports a dataset approaching the field limit from trace
to warn so the near-limit alert is visible in production; update the
tracing::trace!(...) invocation that uses stream_name, fields_count,
dataset_fields_warn_threshold, and
PARSEABLE.options.dataset_fields_allowed_limit to tracing::warn!(...) while
keeping the existing message and interpolation intact.

♻️ Duplicate comments (1)

src/storage/field_stats.rs (1)

660-718: ⚠️ Potential issue | 🔴 Critical

Window ORDER BY still uses an aggregate expression and can fail at runtime.

Line 670 orders ROW_NUMBER() by SUM(field_stats_distinct_stats_count), which should be computed in a prior aggregation layer and then referenced as a row-level column.

Suggested fix

-WITH
-  ranked_values AS (
+WITH
+  aggregated_values AS (
     SELECT
       field_stats_field_name AS field_name,
       field_stats_distinct_stats_distinct_value AS distinct_value,
-      SUM(field_stats_distinct_stats_count) AS distinct_value_count,
+      SUM(field_stats_distinct_stats_count) AS distinct_value_count
+    FROM
+      {DATASET_STATS_STREAM_NAME}
+    WHERE
+      dataset_name = '{dataset_name}'
+      AND field_stats_distinct_stats_distinct_value IS NOT NULL
+      {fields_filter}
+    GROUP BY
+      field_stats_field_name,
+      field_stats_distinct_stats_distinct_value
+  ),
+  ranked_values AS (
+    SELECT
+      field_name,
+      distinct_value,
+      distinct_value_count,
       ROW_NUMBER() OVER (
         PARTITION BY
-          field_stats_field_name
+          field_name
         ORDER BY
-          SUM(field_stats_distinct_stats_count) DESC,
-          field_stats_distinct_stats_distinct_value ASC
+          distinct_value_count DESC,
+          distinct_value ASC
       ) AS rn,
       COUNT(*) OVER (
         PARTITION BY
-          field_stats_field_name
+          field_name
       ) AS distinct_count
-    FROM
-      {DATASET_STATS_STREAM_NAME}
-    WHERE
-      dataset_name = '{dataset_name}'
-      AND field_stats_distinct_stats_distinct_value IS NOT NULL
-      {fields_filter}
-    GROUP BY
-      field_stats_field_name,
-      field_stats_distinct_stats_distinct_value
+    FROM aggregated_values
   ),

Run this read-only check to confirm the risky pattern and dependency version:

#!/bin/bash
set -euo pipefail

echo "DataFusion dependency versions:"
rg -n --glob 'Cargo.toml' --glob 'Cargo.lock' '\bdatafusion\b'

echo
echo "Window ORDER BY with aggregate in src/storage/field_stats.rs:"
rg -nP -C3 'ROW_NUMBER\(\)\s+OVER[\s\S]*ORDER BY[\s\S]*SUM\(field_stats_distinct_stats_count\)' src/storage/field_stats.rs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 660 - 718, The window ORDER BY is
using an aggregate expression (SUM(field_stats_distinct_stats_count)) inside
ROW_NUMBER() which can fail; refactor the query in src/storage/field_stats.rs so
the aggregation is computed in a prior GROUP BY layer and given an alias (e.g.,
distinct_value_count) and then the ROW_NUMBER() OVER clause in the ranked_values
CTE orders by that alias (ORDER BY distinct_value_count DESC,
field_stats_distinct_stats_distinct_value ASC). Concretely, modify the CTE that
produces ranked_values (and its GROUP BY) to SELECT
SUM(field_stats_distinct_stats_count) AS distinct_value_count and then use
ROW_NUMBER() OVER (PARTITION BY field_stats_field_name ORDER BY
distinct_value_count DESC, field_stats_distinct_stats_distinct_value ASC);
ensure downstream references (top_values, joins to field_totals) use the new
distinct_value_count column name.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/handlers/http/modal/utils/ingest_utils.rs`:
- Around line 270-276: Change the logging call that reports a dataset
approaching the field limit from trace to warn so the near-limit alert is
visible in production; update the tracing::trace!(...) invocation that uses
stream_name, fields_count, dataset_fields_warn_threshold, and
PARSEABLE.options.dataset_fields_allowed_limit to tracing::warn!(...) while
keeping the existing message and interpolation intact.

---

Duplicate comments:
In `@src/storage/field_stats.rs`:
- Around line 660-718: The window ORDER BY is using an aggregate expression
(SUM(field_stats_distinct_stats_count)) inside ROW_NUMBER() which can fail;
refactor the query in src/storage/field_stats.rs so the aggregation is computed
in a prior GROUP BY layer and given an alias (e.g., distinct_value_count) and
then the ROW_NUMBER() OVER clause in the ranked_values CTE orders by that alias
(ORDER BY distinct_value_count DESC, field_stats_distinct_stats_distinct_value
ASC). Concretely, modify the CTE that produces ranked_values (and its GROUP BY)
to SELECT SUM(field_stats_distinct_stats_count) AS distinct_value_count and then
use ROW_NUMBER() OVER (PARTITION BY field_stats_field_name ORDER BY
distinct_value_count DESC, field_stats_distinct_stats_distinct_value ASC);
ensure downstream references (top_values, joins to field_totals) use the new
distinct_value_count column name.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc86d2f and 4c13fdc.

📒 Files selected for processing (4)

src/handlers/http/modal/utils/ingest_utils.rs
src/rbac/role.rs
src/rbac/user.rs
src/storage/field_stats.rs

🚧 Files skipped from review as they are similar to previous changes (2)

src/rbac/user.rs
src/rbac/role.rs

nikhilsinhaparseable · 2026-02-25T15:25:03Z

@coderabbitai review

coderabbitai · 2026-02-25T15:25:11Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

nikhilsinhaparseable · 2026-02-26T06:19:47Z

@coderabbitai full review

coderabbitai · 2026-02-26T06:19:57Z

✅ Actions performed

Full review triggered.

coderabbitai

🧹 Nitpick comments (2)

src/storage/field_stats.rs (1)

716-718: Add a deterministic tie-breaker in final ordering.

At Lines 716-718, equal distinct_value_count rows can be returned in unstable order. Adding tv.distinct_value ASC keeps output deterministic.

Suggested change

 ORDER BY
   tv.field_name,
-  tv.distinct_value_count DESC"
+  tv.distinct_value_count DESC,
+  tv.distinct_value ASC"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/storage/field_stats.rs` around lines 716 - 718, The final SQL ORDER BY in
the query using alias tv currently sorts by tv.field_name and
tv.distinct_value_count DESC but lacks a deterministic tie-breaker; update the
ORDER BY clause (where tv is defined) to append a secondary sort on
tv.distinct_value ASC so rows with equal distinct_value_count are returned in a
stable, deterministic order.

src/handlers/http/modal/utils/ingest_utils.rs (1)

270-276: Keep the pre-limit dataset signal at a visible log level.

At Line 270, using trace! makes this threshold warning effectively invisible in most production setups, so operators lose early notice before ingestion starts failing at the hard limit.

Suggested change

-        tracing::trace!(
+        tracing::warn!(
             "Dataset {0} has {1} fields, which exceeds the warning threshold of {2}. Ingestion will not be possible after reaching {3} fields. We recommend creating a new dataset.",
             stream_name,
             fields_count,
             dataset_fields_warn_threshold as usize,
             PARSEABLE.options.dataset_fields_allowed_limit
         );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/handlers/http/modal/utils/ingest_utils.rs` around lines 270 - 276, The
log about datasets exceeding the warning threshold is currently emitted with
tracing::trace!, which is too low for operators to see; change that
tracing::trace! invocation to tracing::warn! (keeping the same formatted message
and parameters: stream_name, fields_count, dataset_fields_warn_threshold as
usize, PARSEABLE.options.dataset_fields_allowed_limit) so the pre-limit dataset
signal is visible in production logging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/handlers/http/modal/utils/ingest_utils.rs`:
- Around line 270-276: The log about datasets exceeding the warning threshold is
currently emitted with tracing::trace!, which is too low for operators to see;
change that tracing::trace! invocation to tracing::warn! (keeping the same
formatted message and parameters: stream_name, fields_count,
dataset_fields_warn_threshold as usize,
PARSEABLE.options.dataset_fields_allowed_limit) so the pre-limit dataset signal
is visible in production logging.

In `@src/storage/field_stats.rs`:
- Around line 716-718: The final SQL ORDER BY in the query using alias tv
currently sorts by tv.field_name and tv.distinct_value_count DESC but lacks a
deterministic tie-breaker; update the ORDER BY clause (where tv is defined) to
append a secondary sort on tv.distinct_value ASC so rows with equal
distinct_value_count are returned in a stable, deterministic order.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ccf8da and afd33f8.

📒 Files selected for processing (4)

src/handlers/http/modal/utils/ingest_utils.rs
src/rbac/role.rs
src/rbac/user.rs
src/storage/field_stats.rs

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

coderabbitai bot previously approved these changes Feb 25, 2026

View reviewed changes

fix: updated the sql query for dataset stats

885a954

parmesant dismissed coderabbitai[bot]’s stale review via 885a954 February 25, 2026 09:17

parmesant force-pushed the dataset-stats-update branch from fe47d78 to 885a954 Compare February 25, 2026 09:17

coderabbitai bot requested changes Feb 25, 2026

View reviewed changes

src/storage/field_stats.rs Show resolved Hide resolved

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

coderabbitai bot previously approved these changes Feb 25, 2026

View reviewed changes

parmesant dismissed coderabbitai[bot]’s stale review via bc86d2f February 25, 2026 10:43

parmesant force-pushed the dataset-stats-update branch from 41112ca to bc86d2f Compare February 25, 2026 10:43

coderabbitai bot requested changes Feb 25, 2026

View reviewed changes

src/storage/field_stats.rs Show resolved Hide resolved

fix query error

4c13fdc

parmesant force-pushed the dataset-stats-update branch from bc86d2f to 4c13fdc Compare February 25, 2026 11:21

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

nikhilsinhaparseable added 2 commits February 25, 2026 18:20

Merge branch 'main' into dataset-stats-update

4790263

Merge branch 'main' into dataset-stats-update

afd33f8

nikhilsinhaparseable approved these changes Feb 25, 2026

View reviewed changes

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

coderabbitai bot approved these changes Feb 26, 2026

View reviewed changes

nikhilsinhaparseable merged commit f1278d2 into parseablehq:main Feb 26, 2026
12 checks passed

Uh oh!

Conversation

parmesant commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nikhilsinhaparseable commented Feb 25, 2026

Uh oh!

coderabbitai bot commented Feb 25, 2026

Uh oh!

nikhilsinhaparseable commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

parmesant commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading