add system defined tags and free form labels to datasets by nikhilsinhaparseable · Pull Request #1553 · parseablehq/parseable

nikhilsinhaparseable · 2026-02-21T13:48:18Z

PUT /api/v1/logstream/{name} accepts X-P-Dataset-Tags
and X-P-Dataset-Labels headers (comma-separated) on stream creation
PUT /api/prism/v1/datasets/{name}- create or update dataset tags and labels(empty list clears all)
GET /api/prism/v1/datasets/{name}/correlated - find datasets sharing tags or labels
GET /api/prism/v1/datasets/tags/{tag} - find all datasets with a specific tag
include tags and labels in home api response

Summary by CodeRabbit

New Features
- New datasets HTTP APIs: fetch by tag, fetch correlated datasets, and update a dataset's tags/labels.
- Persistent storage API to update dataset tags and labels, with in-memory state synced after storage updates.
User-facing Data Model
- Streams, metadata, headers and UI now support multiple dataset tags and labels (lists instead of a single optional tag).
Refactor
- Renamed tag conventions from "observability" → "monitoring" and added new tag types (ApplicationMonitoring, ServiceMap).

coderabbitai · 2026-02-21T13:48:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Replaces a single optional dataset_tag with dataset_tags: Vec<DatasetTag> and dataset_labels: Vec<String> across metadata, parseable APIs, storage, HTTP handlers, and stream-creation call sites; adds HTTP dataset endpoints, modal routes, header parsing, and an ObjectStorage API to update tags/labels.

Changes

Cohort / File(s)	Summary
Dataset Tag Model & Parsing `src/handlers/mod.rs`	Renamed/extended `DatasetTag` variants, added `Hash` derive, new constants `DATASET_TAGS_KEY`/`DATASET_LABELS_KEY`, and added `parse_dataset_tags()`/`parse_dataset_labels()` with validation and deduplication.
HTTP Handler Surface `src/handlers/http/mod.rs`, `src/handlers/http/ingest.rs`	Exported `datasets` module; updated PARSEABLE.create_stream_if_not_exists call sites to pass `vec![]` for new `dataset_tags`/`dataset_labels` params.
Datasets HTTP API & Modal Routes `src/handlers/http/datasets.rs`, `src/handlers/http/modal/server.rs`	Added datasets API handlers (GET correlated, GET by tag, PUT update metadata), payload/error types, and wired modal routes under /datasets with appropriate authorizations.
Header / Stream Utilities `src/handlers/http/modal/utils/logstream_utils.rs`	Replaced `dataset_tag` with `dataset_tags: Vec<DatasetTag>` and `dataset_labels: Vec<String>` in `PutStreamHeaders`; updated header parsing to use new helpers.
Parseable Stream APIs & Streams `src/parseable/mod.rs`, `src/parseable/streams.rs`	Updated `create_stream_if_not_exists`, `create_stream`, and related APIs to accept `dataset_tags` and `dataset_labels`; added getters/setters `get_dataset_tags`/`get_dataset_labels` and mutators; propagated vectors through creation/update flows.
Metadata & Storage Formats `src/metadata.rs`, `src/storage/mod.rs`, `src/storage/object_storage.rs`	Replaced `dataset_tag: Option` with `dataset_tags: Vec` and `dataset_labels: Vec` in `LogStreamMetadata`, `ObjectStoreFormat`, `StreamInfo`; added `update_dataset_tags_and_labels_in_stream` API and implementation.
Home / Response Generation `src/prism/home/mod.rs`	Replaced single tag field with `tags: Vec<DatasetTag>` and `labels: Vec<String>` on response structs; updated serde defaults and response construction.
Migration & Call Sites `src/migration/mod.rs`, `src/connectors/kafka/processor.rs`, `src/storage/field_stats.rs`	Adjusted destructuring and call sites to use `dataset_tags`/`dataset_labels`; call sites updated to pass `vec![]` where `None` was previously used.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HTTP as HTTP API
    participant Parseable
    participant ObjectStore as ObjectStorage/Metastore
    participant Storage as PersistedStore

    Client->>HTTP: PUT /api/v1/datasets/{name} (tags/labels)
    HTTP->>Parseable: validate stream exists & request update
    Parseable->>ObjectStore: update_dataset_tags_and_labels_in_stream(name, tags, labels)
    ObjectStore->>Storage: put_stream_json(updated ObjectStoreFormat)
    Storage-->>ObjectStore: ack
    ObjectStore-->>Parseable: success
    Parseable-->>HTTP: updated tags/labels
    HTTP-->>Client: 200 OK / error

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

add dataset name and type to home api response #1271: Modifies src/prism/home/mod.rs home/response structures—overlaps dataset response changes.
allow tags at dataset creation #1540: Evolves stream-creation API and dataset-tag handling—directly related to signature and call-site changes.
feat: add stats for each field #1340: Adjusts create_stream_if_not_exists call sites—overlaps argument arity updates.

Suggested labels

for next release

Suggested reviewers

parmesant
praveen5959

Poem

🐇 I hopped through headers, tags in tow,
Turned one small tag into labels that grow.
Streams now wear lists, both bold and fair,
I wiggled my nose and nudged them there.
Cheers — metadata blooms everywhere!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.35% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description lists key feature changes but lacks detail on design rationale, testing status, and documentation additions required by the template.	Add sections addressing design decisions, confirm testing completeness (log ingestion/query), document new APIs, and confirm code comments explain intent throughout.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'add system defined tags and free form labels to datasets' accurately and concisely describes the main feature addition, reflecting the core change across the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/parseable/mod.rs (1)
480-489: Consider a builder pattern for stream creation parameters.

The #[allow(clippy::too_many_arguments)] annotation is a reasonable workaround, but with 7+ parameters, this function is becoming unwieldy. For a future improvement, consider introducing a StreamCreationParams struct or builder pattern to group these related parameters.

This is a minor suggestion and doesn't block the current PR.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/parseable/mod.rs` around lines 480 - 489, The function
create_stream_if_not_exists has many parameters; refactor by introducing a
StreamCreationParams struct (or a builder StreamCreationParamsBuilder) to group
stream_name, stream_type, custom_partition, log_source, telemetry_type,
dataset_tags, and dataset_labels; update create_stream_if_not_exists signature
to accept a single StreamCreationParams (or builder output) and adapt internal
usage accordingly, add a conversion constructor or builder methods for easy
construction where callers currently pass those seven+ args, and preserve
existing behavior and validation inside the new struct or builder.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/parseable/mod.rs`:
- Around line 480-489: The function create_stream_if_not_exists has many
parameters; refactor by introducing a StreamCreationParams struct (or a builder
StreamCreationParamsBuilder) to group stream_name, stream_type,
custom_partition, log_source, telemetry_type, dataset_tags, and dataset_labels;
update create_stream_if_not_exists signature to accept a single
StreamCreationParams (or builder output) and adapt internal usage accordingly,
add a conversion constructor or builder methods for easy construction where
callers currently pass those seven+ args, and preserve existing behavior and
validation inside the new struct or builder.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/parseable/mod.rs (1)

480-527: Consider bundling stream creation parameters into a struct.

The function now accepts 8 parameters (including self), which triggers the clippy::too_many_arguments lint. While the #[allow] attribute suppresses it, this pattern repeats across create_stream_if_not_exists, create_stream, and create_update_stream.

A dedicated struct (e.g., StreamCreationParams or StreamConfig) would:

Improve readability at call sites
Make future parameter additions easier
Provide a natural place for default values

💡 Example struct-based approach

pub struct StreamCreationParams {
    pub stream_type: StreamType,
    pub custom_partition: Option<String>,
    pub log_source: Vec<LogSourceEntry>,
    pub telemetry_type: TelemetryType,
    pub dataset_tags: Vec<DatasetTag>,
    pub dataset_labels: Vec<String>,
}

impl Default for StreamCreationParams {
    fn default() -> Self {
        Self {
            stream_type: StreamType::UserDefined,
            custom_partition: None,
            log_source: vec![],
            telemetry_type: TelemetryType::Logs,
            dataset_tags: vec![],
            dataset_labels: vec![],
        }
    }
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/parseable/mod.rs` around lines 480 - 527, Create a StreamCreationParams
(or StreamConfig) struct to bundle the multiple stream creation arguments and
refactor create_stream_if_not_exists, create_stream, and create_update_stream to
accept that struct instead of the long parameter list; update the call site in
create_stream_if_not_exists to build a StreamCreationParams (using Default for
sensible defaults) and pass it to create_stream, adjust create_stream signature
to destructure or reference the struct fields (stream_type, custom_partition,
log_source, telemetry_type, dataset_tags, dataset_labels), remove the
#[allow(clippy::too_many_arguments)] on those functions, and add
conversions/constructors where needed so existing call sites can migrate with
minimal changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/handlers/http/datasets.rs`:
- Around line 163-169: The labels vector built in the PUT handler (where
new_labels is created from body.into_inner().labels) must trim each label and
filter out empty or whitespace-only strings before deduplicating; update the
pipeline that builds new_labels to map each label through .trim(), filter out
strings that are empty after trimming, then collect into a HashSet to dedupe and
back into a Vec (preserving whatever ordering you need). Ensure you modify the
code around the new_labels construction so labels are normalized (trimmed) and
empty/whitespace-only entries are removed prior to deduplication.
- Around line 31-186: Handlers get_correlated_datasets, get_datasets_by_tag,
put_dataset_tags and put_dataset_labels currently use PARSEABLE.streams/list,
PARSEABLE.get_stream and storage APIs without tenant context; update each
handler to resolve the tenant from the request (e.g., call
get_tenant_id_from_request or read the normalized tenant header injected by
middleware), then scope all stream listing and lookups to that tenant (filter
list results or call tenant-aware APIs) and ensure any
storage.update_dataset_tags_and_labels_in_stream and PARSEABLE.get_stream calls
include/are called with the resolved tenant or have the tenant header
overwritten server-side so clients cannot spoof it; ensure the same tenant
resolution is applied to both read (listing/get) and write
(put_dataset_tags/put_dataset_labels) flows.

---

Nitpick comments:
In `@src/parseable/mod.rs`:
- Around line 480-527: Create a StreamCreationParams (or StreamConfig) struct to
bundle the multiple stream creation arguments and refactor
create_stream_if_not_exists, create_stream, and create_update_stream to accept
that struct instead of the long parameter list; update the call site in
create_stream_if_not_exists to build a StreamCreationParams (using Default for
sensible defaults) and pass it to create_stream, adjust create_stream signature
to destructure or reference the struct fields (stream_type, custom_partition,
log_source, telemetry_type, dataset_tags, dataset_labels), remove the
#[allow(clippy::too_many_arguments)] on those functions, and add
conversions/constructors where needed so existing call sites can migrate with
minimal changes.

src/handlers/http/datasets.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/handlers/http/datasets.rs`:
- Around line 139-151: The current read-modify-write in put_dataset_tags uses
PARSEABLE.get_stream(...), calls stream.get_dataset_labels(), then
storage.update_dataset_tags_and_labels_in_stream(...) and finally
stream.set_dataset_tags(...), which allows TOCTOU race with concurrent
put_dataset_labels; fix by serializing updates per-stream (or by providing a
combined tags+labels atomic endpoint). Add a per-stream mutex/async lock on the
stream (e.g., a metadata_lock() or similar on the stream returned by
PARSEABLE.get_stream) and acquire it around the read-modify-write sequence that
uses stream.get_dataset_labels,
storage.update_dataset_tags_and_labels_in_stream, and stream.set_dataset_tags so
concurrent tag/label updates for the same stream are executed sequentially.

---

Duplicate comments:
In `@src/handlers/http/datasets.rs`:
- Around line 41-88: get_correlated_datasets currently iterates the global
PARSEABLE.streams.list() and PARSEABLE.get_stream(...) without restricting to
the tenant, causing cross-tenant exposure; update the handler to derive the
tenant context (e.g., from request auth/headers or a tenant path param) and then
restrict both the initial stream lookup (PARSEABLE.get_stream(&dataset_name))
and the loop over streams to the same tenant: either call a tenant-scoped
listing API (e.g., PARSEABLE.streams.list_for_tenant(tenant) if available) or
filter the results of PARSEABLE.streams.list() by a tenant identity exposed on
each stream (e.g., s.get_tenant() == tenant) before skipping and comparing;
ensure the check for the target dataset also validates it belongs to the tenant
so only tenant-scoped streams are considered.

src/handlers/http/datasets.rs

PUT /api/v1/logstream/{name} accepts X-P-Dataset-Tags and X-P-Dataset-Labels headers (comma-separated) on stream creation PUT /api/prism/v1/datasets/{name} - update tags and labels GET /api/prism/v1/datasets/{name}/correlated - find datasets sharing tags or labels GET /api/prism/v1/datasets/tags/{tag} - find all datasets with a specific tag include tags and labels in home api response

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/handlers/http/datasets.rs (2)

150-156: ⚠️ Potential issue | 🟡 Minor

Normalize label input in PUT body before dedupe.

Line 150-Line 156 deduplicates raw strings but does not trim/filter whitespace-only labels, so empty labels can still be persisted.

Suggested fix

     let final_labels = match body.labels {
         Some(labels) => labels
             .into_iter()
+            .map(|label| label.trim().to_string())
+            .filter(|label| !label.is_empty())
             .collect::<HashSet<_>>()
             .into_iter()
             .collect(),
         None => stream.get_dataset_labels(),
     };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/handlers/http/datasets.rs` around lines 150 - 156, Normalize and filter
incoming labels before deduplication: when building final_labels from
body.labels (the match arm handling Some(labels)), map each label to trimmed
string and filter out labels that are empty or whitespace-only (e.g.,
labels.into_iter().map(|s| s.trim().to_string()).filter(|s| !s.is_empty())...),
then collect into a HashSet to dedupe and back into the desired collection; keep
the None branch using stream.get_dataset_labels() unchanged.

142-172: ⚠️ Potential issue | 🔴 Critical

Concurrent partial metadata updates can still lose data.

Line 142-Line 172 does a read-modify-write using current in-memory values for absent fields, so two concurrent requests (tags-only and labels-only) can overwrite each other with stale counterparts.

Suggested direction

 pub async fn put_dataset_metadata(...) -> Result<HttpResponse, DatasetsError> {
     ...
     let stream = PARSEABLE
         .get_stream(&dataset_name, &tenant_id)
         .map_err(|_| DatasetsError::DatasetNotFound(dataset_name.clone()))?;
+
+    // Serialize metadata updates for this stream (or use atomic storage merge).
+    // let _guard = stream.metadata_lock().await;

     let final_tags = match body.tags { ... };
     let final_labels = match body.labels { ... };

     storage
         .update_dataset_tags_and_labels_in_stream(...)
         .await
         .map_err(DatasetsError::Storage)?;

     stream.set_dataset_tags(final_tags.clone());
     stream.set_dataset_labels(final_labels.clone());

If a stream-scoped lock is unavailable, prefer a storage-level atomic patch API that updates only provided fields server-side.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/handlers/http/datasets.rs` around lines 142 - 172, Current code computes
final_tags/final_labels from in-memory stream values and then does a
read-modify-write, which lets concurrent requests overwrite each other; instead
either acquire the stream-scoped lock around the read-modify-write (use
stream.get_dataset_tags/get_dataset_labels and
stream.set_dataset_tags/set_dataset_labels inside the lock) or, preferably,
change the update to an atomic storage-side patch that accepts
Option<HashSet<_>> and only updates provided fields (modify the call to
storage.update_dataset_tags_and_labels_in_stream to pass Option types and
implement server-side merge), removing the reliance on in-memory reads to
prevent lost updates.

🧹 Nitpick comments (1)

src/parseable/streams.rs (1)

982-988: Prefer atomic update for tags+labels to avoid transient mixed state.

Line 982 and Line 986 update related metadata in two separate writes. A single setter that updates both fields under one lock would avoid short-lived inconsistent reads.

♻️ Suggested refactor

+    pub fn set_dataset_metadata(&self, tags: Vec<DatasetTag>, labels: Vec<String>) {
+        let mut metadata = self.metadata.write().expect(LOCK_EXPECT);
+        metadata.dataset_tags = tags;
+        metadata.dataset_labels = labels;
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/parseable/streams.rs` around lines 982 - 988, The two setters
set_dataset_tags and set_dataset_labels perform separate metadata.write() calls
causing transient inconsistent state; add a new atomic setter (e.g.,
set_dataset_tags_and_labels(&self, tags: Vec<DatasetTag>, labels: Vec<String>))
that acquires metadata.write() once and assigns both dataset_tags and
dataset_labels inside the same lock, then update call sites to use the new
method (or have the existing setters delegate to it) so modifications never
occur in two separate writes; reference the metadata RwLock field and the
existing set_dataset_tags/set_dataset_labels method names when making the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/handlers/http/modal/server.rs`:
- Around line 222-227: The route currently mounts only PUT /{name} using
http::datasets::put_dataset_metadata (authorized with Action::CreateStream), but
the API requires distinct replacement endpoints for tags and labels; add two
separate routes for "/{name}/tags" and "/{name}/labels" each using web::put()
and wired to the appropriate handlers (e.g., http::datasets::put_dataset_tags
and http::datasets::put_dataset_labels) with the same
authorize_for_resource(Action::CreateStream) policy, or if those handlers do not
exist, split http::datasets::put_dataset_metadata into two functions and route
them accordingly so clients can call PUT /datasets/{name}/tags and PUT
/datasets/{name}/labels.

In `@src/handlers/mod.rs`:
- Around line 95-114: The TryFrom<&str> for DatasetTag currently only accepts
the new literals (e.g., "agent-monitoring", "k8s-monitoring") causing older
values to fail; update the DatasetTag enum and its impl TryFrom to accept legacy
names by adding serde aliases on the enum variants (e.g., add #[serde(alias =
"agent-observability")] and #[serde(alias = "k8s-observability")] to
DatasetTag::AgentMonitoring and ::K8sMonitoring) and extend the match in impl
TryFrom (the function try_from) to also match the old strings
("agent-observability", "k8s-observability") mapping them to the corresponding
variants, keeping the existing error message unchanged.

In `@src/storage/mod.rs`:
- Around line 134-137: The struct currently only reads dataset_tags so legacy
single-key dataset_tag values are dropped; update deserialization/migration to
map dataset_tag -> dataset_tags by adding a compatibility conversion: in
src/storage/mod.rs ensure deserialization recognizes a legacy dataset_tag (alias
or custom Deserialize) and appends it into the dataset_tags Vec, or add a
migration step that, when bumping
CURRENT_OBJECT_STORE_VERSION/CURRENT_SCHEMA_VERSION, transforms any legacy
dataset_tag value into dataset_tags before validation; target the symbols
dataset_tags, dataset_tag, CURRENT_OBJECT_STORE_VERSION and
CURRENT_SCHEMA_VERSION when making this change so older metadata is preserved
and participates in tag-based APIs.

---

Duplicate comments:
In `@src/handlers/http/datasets.rs`:
- Around line 150-156: Normalize and filter incoming labels before
deduplication: when building final_labels from body.labels (the match arm
handling Some(labels)), map each label to trimmed string and filter out labels
that are empty or whitespace-only (e.g., labels.into_iter().map(|s|
s.trim().to_string()).filter(|s| !s.is_empty())...), then collect into a HashSet
to dedupe and back into the desired collection; keep the None branch using
stream.get_dataset_labels() unchanged.
- Around line 142-172: Current code computes final_tags/final_labels from
in-memory stream values and then does a read-modify-write, which lets concurrent
requests overwrite each other; instead either acquire the stream-scoped lock
around the read-modify-write (use stream.get_dataset_tags/get_dataset_labels and
stream.set_dataset_tags/set_dataset_labels inside the lock) or, preferably,
change the update to an atomic storage-side patch that accepts
Option<HashSet<_>> and only updates provided fields (modify the call to
storage.update_dataset_tags_and_labels_in_stream to pass Option types and
implement server-side merge), removing the reliance on in-memory reads to
prevent lost updates.

---

Nitpick comments:
In `@src/parseable/streams.rs`:
- Around line 982-988: The two setters set_dataset_tags and set_dataset_labels
perform separate metadata.write() calls causing transient inconsistent state;
add a new atomic setter (e.g., set_dataset_tags_and_labels(&self, tags:
Vec<DatasetTag>, labels: Vec<String>)) that acquires metadata.write() once and
assigns both dataset_tags and dataset_labels inside the same lock, then update
call sites to use the new method (or have the existing setters delegate to it)
so modifications never occur in two separate writes; reference the metadata
RwLock field and the existing set_dataset_tags/set_dataset_labels method names
when making the change.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 419f53d and 4a3e23c.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

src/connectors/kafka/processor.rs
src/handlers/http/datasets.rs
src/handlers/http/ingest.rs
src/handlers/http/mod.rs
src/handlers/http/modal/server.rs
src/handlers/http/modal/utils/logstream_utils.rs
src/handlers/mod.rs
src/metadata.rs
src/migration/mod.rs
src/parseable/mod.rs
src/parseable/streams.rs
src/prism/home/mod.rs
src/storage/field_stats.rs
src/storage/mod.rs
src/storage/object_storage.rs

🚧 Files skipped from review as they are similar to previous changes (3)

src/handlers/http/mod.rs
src/storage/field_stats.rs
src/handlers/http/ingest.rs

src/handlers/http/modal/server.rs

src/handlers/mod.rs

src/storage/mod.rs

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/handlers/http/modal/server.rs`:
- Around line 210-220: The two route registrations conflict because
"/tags/{tag}" will match "/tags/correlated" before "/{name}/correlated", causing
get_correlated_datasets to be unreachable for a dataset named "tags"; change the
correlated endpoint path to avoid structural collision (for example replace
"/{name}/correlated" with "/correlated/{name}" in the route registration where
web::get().to(http::datasets::get_correlated_datasets).authorize_for_resource(Action::GetStreamInfo)
is set) and update the path extractor in the get_correlated_datasets handler and
any clients/docs referencing the old path so they use the new
"/correlated/{name}" form.
- Around line 200-227: The post_datasets handler currently derives tenant from
BasicAuth via extract_session_key_from_req; change it to read the tenant using
get_tenant_id_from_request (the same middleware-normalized header used by
get_info and the other dataset handlers) to enforce tenant scoping and prevent
header spoofing: locate the post_datasets function in http::prism_logstream,
remove or stop using extract_session_key_from_req for tenant resolution, call
get_tenant_id_from_request(req) (or equivalent helper used elsewhere) and pass
that tenant downstream to any dataset creation/validation logic so it matches
the behavior of get_datasets_by_tag, get_correlated_datasets, and
put_dataset_metadata.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a3e23c and 4a2b764.

📒 Files selected for processing (1)

src/handlers/http/modal/server.rs

coderabbitai · 2026-02-25T14:01:22Z

src/handlers/http/modal/server.rs

    pub fn get_prism_datasets() -> Scope {
-        web::scope("/datasets").route(
-            "",
-            web::post()
-                .to(http::prism_logstream::post_datasets)
-                .authorize_for_resource(Action::GetStreamInfo)
-                .authorize_for_resource(Action::GetStats)
-                .authorize_for_resource(Action::GetRetention),
-        )
+        web::scope("/datasets")
+            .route(
+                "",
+                web::post()
+                    .to(http::prism_logstream::post_datasets)
+                    .authorize_for_resource(Action::GetStreamInfo)
+                    .authorize_for_resource(Action::GetStats)
+                    .authorize_for_resource(Action::GetRetention),
+            )
+            .route(
+                "/tags/{tag}",
+                web::get()
+                    .to(http::datasets::get_datasets_by_tag)
+                    .authorize_for_resource(Action::GetStreamInfo),
+            )
+            .route(
+                "/{name}/correlated",
+                web::get()
+                    .to(http::datasets::get_correlated_datasets)
+                    .authorize_for_resource(Action::GetStreamInfo),
+            )
+            .route(
+                "/{name}",
+                web::put()
+                    .to(http::datasets::put_dataset_metadata)
+                    .authorize_for_resource(Action::CreateStream),
+            )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for the four handler implementations and how they extract tenant_id echo "=== Searching for handler implementations ===" echo -e "\n--- 1. prism_logstream::post_datasets ---" rg -n "fn post_datasets" --type rust -A 10 src/handlers/http/ | head -40 echo -e "\n--- 2. datasets::get_datasets_by_tag ---" rg -n "fn get_datasets_by_tag" --type rust -A 10 src/handlers/http/ | head -40 echo -e "\n--- 3. datasets::get_correlated_datasets ---" rg -n "fn get_correlated_datasets" --type rust -A 10 src/handlers/http/ | head -40 echo -e "\n--- 4. datasets::put_dataset_metadata ---" rg -n "fn put_dataset_metadata" --type rust -A 10 src/handlers/http/ | head -40 echo -e "\n--- Check for get_tenant_id_from_request usage in datasets.rs ---" rg -n "get_tenant_id_from_request" --type rust src/handlers/http/datasets.rs

Repository: parseablehq/parseable

Length of output: 1362

🏁 Script executed:

#!/bin/bash # Read datasets.rs around the get_tenant_id_from_request usages echo "=== datasets.rs: lines 20-60 (first usage) ===" sed -n '20,60p' src/handlers/http/datasets.rs echo -e "\n=== datasets.rs: lines 95-110 (second usage) ===" sed -n '95,110p' src/handlers/http/datasets.rs echo -e "\n=== datasets.rs: lines 133-150 (third usage) ===" sed -n '133,150p' src/handlers/http/datasets.rs echo -e "\n=== prism_logstream.rs: full post_datasets function ===" sed -n '41,80p' src/handlers/http/prism_logstream.rs echo -e "\n=== Check if handlers exist in datasets.rs ===" rg -n "pub async fn (get_datasets_by_tag|get_correlated_datasets|put_dataset_metadata)" --type rust src/handlers/http/datasets.rs

Repository: parseablehq/parseable

Length of output: 3385

🏁 Script executed:

#!/bin/bash # Check extract_session_key_from_req and understand if it provides tenant scoping echo "=== Search for extract_session_key_from_req definition ===" rg -n "fn extract_session_key_from_req" --type rust -A 5 src/ echo -e "\n=== Check PrismDatasetRequest and get_datasets ===" rg -n "impl.*PrismDatasetRequest\|fn get_datasets" --type rust -A 10 src/handlers/http/prism_logstream.rs | head -50 echo -e "\n=== Check PrismDatasetRequest struct ===" rg -n "struct PrismDatasetRequest" --type rust -A 15 src/handlers/http/prism_logstream.rs echo -e "\n=== Verify tenant scoping context for prism endpoints ===" rg -n "get_prism_datasets|prism_logstream" --type rust src/handlers/http/modal/server.rs | head -20

Repository: parseablehq/parseable

Length of output: 1138

🏁 Script executed:

#!/bin/bash # Check the exact route registration and actions for post_datasets echo "=== server.rs post_datasets route registration (lines 200-227) ===" sed -n '200,227p' src/handlers/http/modal/server.rs echo -e "\n=== Search for PrismDatasetRequest and get_datasets method ===" rg -n "struct PrismDatasetRequest\|fn get_datasets" --type rust -A 20 src/handlers/http/prism_logstream.rs | head -80 echo -e "\n=== Check what get_datasets does (full method) ===" rg -n "async fn get_datasets\|async fn get_info" --type rust -A 30 src/handlers/http/prism_logstream.rs

Repository: parseablehq/parseable

Length of output: 1332

🏁 Script executed:

#!/bin/bash # Broader search for PrismDatasetRequest and related types echo "=== Count lines in prism_logstream.rs ===" wc -l src/handlers/http/prism_logstream.rs echo -e "\n=== Search for 'Request' in prism_logstream.rs ===" rg -n "Request\|request" --type rust src/handlers/http/prism_logstream.rs | head -20 echo -e "\n=== Full prism_logstream.rs file (if small) ===" wc -l src/handlers/http/prism_logstream.rs && cat -n src/handlers/http/prism_logstream.rs echo -e "\n=== Check if PrismDatasetRequest is defined elsewhere ===" rg -n "PrismDatasetRequest" --type rust src/

Repository: parseablehq/parseable

Length of output: 2896

🏁 Script executed:

#!/bin/bash # Check the PrismDatasetRequest.get_datasets implementation echo "=== PrismDatasetRequest struct and impl ===" sed -n '211,250p' src/prism/logstream/mod.rs echo -e "\n=== Full get_datasets method ===" rg -n "fn get_datasets" --type rust -A 50 src/prism/logstream/mod.rs | head -80

Repository: parseablehq/parseable

Length of output: 3457

Use get_tenant_id_from_request in post_datasets to enforce middleware-normalized tenant scoping.

The handler at src/handlers/http/prism_logstream.rs:41 derives tenant from BasicAuth credentials via extract_session_key_from_req, but non-ingest handlers must extract tenant from the middleware-normalized request header using get_tenant_id_from_request. The get_info handler in the same file (line 30) correctly uses this pattern; post_datasets should do the same to prevent tenant header spoofing and remain consistent with the three other dataset handlers (get_datasets_by_tag, get_correlated_datasets, put_dataset_metadata).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/handlers/http/modal/server.rs` around lines 200 - 227, The post_datasets handler currently derives tenant from BasicAuth via extract_session_key_from_req; change it to read the tenant using get_tenant_id_from_request (the same middleware-normalized header used by get_info and the other dataset handlers) to enforce tenant scoping and prevent header spoofing: locate the post_datasets function in http::prism_logstream, remove or stop using extract_session_key_from_req for tenant resolution, call get_tenant_id_from_request(req) (or equivalent helper used elsewhere) and pass that tenant downstream to any dataset creation/validation logic so it matches the behavior of get_datasets_by_tag, get_correlated_datasets, and put_dataset_metadata.

coderabbitai · 2026-02-25T14:01:22Z

src/handlers/http/modal/server.rs

+            .route(
+                "/tags/{tag}",
+                web::get()
+                    .to(http::datasets::get_datasets_by_tag)
+                    .authorize_for_resource(Action::GetStreamInfo),
+            )
+            .route(
+                "/{name}/correlated",
+                web::get()
+                    .to(http::datasets::get_correlated_datasets)
+                    .authorize_for_resource(Action::GetStreamInfo),


⚠️ Potential issue | 🟡 Minor

Route shadowing: GET /tags/correlated will never reach get_correlated_datasets for a dataset named "tags".

actix-web evaluates .route() registrations in order. /tags/{tag} (line 211) is matched before /{name}/correlated (line 217), so a request for GET /tags/correlated binds to get_datasets_by_tag with tag = "correlated" rather than get_correlated_datasets with name = "tags". In practice this only bites if a dataset is named "tags", but it is an invisible constraint on valid dataset names.

Reordering the routes so the more-specific two-segment pattern comes first does not help here because both are two-segment paths; the real fix is to use a path that avoids the structural collision, e.g. /correlated/{name} for the correlated endpoint:

🔀 Proposed fix: reorder / rename to eliminate shadowing

- .route( - "/tags/{tag}", - web::get() - .to(http::datasets::get_datasets_by_tag) - .authorize_for_resource(Action::GetStreamInfo), - ) - .route( - "/{name}/correlated", - web::get() - .to(http::datasets::get_correlated_datasets) - .authorize_for_resource(Action::GetStreamInfo), - ) + .route( + "/correlated/{name}", + web::get() + .to(http::datasets::get_correlated_datasets) + .authorize_for_resource(Action::GetStreamInfo), + ) + .route( + "/tags/{tag}", + web::get() + .to(http::datasets::get_datasets_by_tag) + .authorize_for_resource(Action::GetStreamInfo), + )

(Remember to update the corresponding path extractor in get_correlated_datasets and any client/docs that reference the old path.)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/handlers/http/modal/server.rs` around lines 210 - 220, The two route registrations conflict because "/tags/{tag}" will match "/tags/correlated" before "/{name}/correlated", causing get_correlated_datasets to be unreachable for a dataset named "tags"; change the correlated endpoint path to avoid structural collision (for example replace "/{name}/correlated" with "/correlated/{name}" in the route registration where web::get().to(http::datasets::get_correlated_datasets).authorize_for_resource(Action::GetStreamInfo) is set) and update the path extractor in the get_correlated_datasets handler and any clients/docs referencing the old path so they use the new "/correlated/{name}" form.

coderabbitai bot reviewed Feb 21, 2026

View reviewed changes

coderabbitai bot previously approved these changes Feb 21, 2026

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 4fb9620 February 21, 2026 14:18

nikhilsinhaparseable force-pushed the dataset-tags-lables branch from 12ae8ee to 4fb9620 Compare February 21, 2026 14:18

coderabbitai bot requested changes Feb 21, 2026

View reviewed changes

src/handlers/http/datasets.rs Show resolved Hide resolved

src/handlers/http/datasets.rs Outdated Show resolved Hide resolved

coderabbitai bot requested changes Feb 21, 2026

View reviewed changes

src/handlers/http/datasets.rs Outdated Show resolved Hide resolved

nikhilsinhaparseable added 4 commits February 25, 2026 17:36

add dataset handler file

3d69b6c

single endpoint for tags and labels

87ff8e0

add tenant_id params in new handlers

4a3e23c

nikhilsinhaparseable force-pushed the dataset-tags-lables branch from 419f53d to 4a3e23c Compare February 25, 2026 06:50

coderabbitai bot requested changes Feb 25, 2026

View reviewed changes

src/handlers/http/modal/server.rs Show resolved Hide resolved

src/handlers/mod.rs Show resolved Hide resolved

src/storage/mod.rs Show resolved Hide resolved

coderabbitai bot approved these changes Feb 25, 2026

View reviewed changes

nikhilsinhaparseable added 2 commits February 25, 2026 18:21

Merge branch 'main' into dataset-tags-lables

7448c27

Merge branch 'main' into dataset-tags-lables

4a2b764

coderabbitai bot requested changes Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add system defined tags and free form labels to datasets#1553

add system defined tags and free form labels to datasets#1553
nikhilsinhaparseable wants to merge 6 commits intoparseablehq:mainfrom
nikhilsinhaparseable:dataset-tags-lables

nikhilsinhaparseable commented Feb 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 21, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

coderabbitai bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nikhilsinhaparseable commented Feb 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikhilsinhaparseable commented Feb 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 21, 2026 •

edited

Loading