Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
327 changes: 327 additions & 0 deletions develop-docs/sdk/foundations/client/data-collection/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,327 @@
---
title: Data Collection
description: Configuration for what data SDKs collect by default — technical context, PII, and sensitive data.
Comment thread
s1gr1d marked this conversation as resolved.
Outdated
spec_id: sdk/foundations/client/data-collection
spec_version: 1.0.0
Comment thread
s1gr1d marked this conversation as resolved.
Outdated
spec_status: draft
spec_depends_on:
- id: sdk/foundations/client
version: ">=1.0.0"
spec_changelog:
- version: 1.0.0
date: 2025-03-05
summary: Initial spec; dataCollection config, three data tiers, cookies/headers denylist, replace sendDefaultPii.
sidebar_order: 1
---

<SpecRfcAlert />

<SpecMeta />

## Overview

This spec defines how SDKs control **what data is collected automatically** from the runtime (device, requests, responses, user context). It replaces the single `sendDefaultPii` (or platform-equivalent) flag with a structured `dataCollection` configuration so users can enable or restrict collection by category and by field.

Related specs:

- [Data Handling](/sdk/expected-features/data-handling/) — structuring data for scrubbing (spans, breadcrumbs), variable size limits
- [Client](/sdk/foundations/client/) — client lifecycle and event pipeline
- [Configuration](/sdk/foundations/client/configuration/) — top-level init options including `send_default_pii` (deprecated in favor of this spec)

---

## Concepts

<SpecSection id="data-tiers" status="draft" since="1.0.0">

### Data Tiers
Comment thread
s1gr1d marked this conversation as resolved.
Outdated

Collected data is grouped into three tiers. SDKs **MUST** treat these tiers consistently when applying defaults and user configuration.

#### 1. Technical Context Data

Non-identifying context used for debugging and performance:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: I think it would be great to clearly state that these here are just examples, but not a complete list. I think it would be great to add links to references where you can find a complete list, or we add a complete list here and then link from the other places in the doc that this is now here, the spec for the complete list. This applies to all three data tiers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is missing?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event context payload has way more properties: https://develop.sentry.dev/sdk/foundations/transport/event-payloads/contexts/

For example: culture context, GPU context, app context (version, permissions, view names), cloud resource context, memory info context, ...


- Device and environment context (OS, runtime, non-PII identifiers)
- Performance and error context (stack frames, breadcrumbs, span metadata)
- Framework/routing context where it does not contain PII or secrets
- AI agent messages (input, output, metadata)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: I think AI Agent Messages can actually contain PII. I think that should be under PII. We have that under PII in the user-facing docs, see for example Python.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we explicitly want to emit those by default, else our product becomes useless. Also, this is not Pii, this is maybe Pii, and everything is maybe Pii.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I get that. I'm fine with that, but I think we also need to align our docs because that confused me a bit that it was under PII for Python, for example. But that is out of scope of this PR.


This tier is **not** gated by the data collection configuration. SDKs **MAY** collect it by default.
Comment thread
s1gr1d marked this conversation as resolved.
Outdated

#### 2. PII Data

Personally identifiable or user-linked data:

- User identifiers (user ID, username, email)
- IP address
- Cookies and headers that identify the user or session
- HTTP request data (TBD)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: What about request paths?
Some requests may be identifiable, like /user/USER_ID
Should we have a denylist/allowlist for url paths?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 🤔
What we currently do in JS: When we either know it's a param route, we use the appropriate parametrized route name (e.g. user/:id) as the transaction name but the full URL (e.g. user/123) is still added in the attributes. @cleptric Any opinions on that?


This tier **MUST** be off by default unless the user opts in via `includeUserInfo` and/or explicit `collect` allowlists. See [`includeUserInfo`](#include-user-info-behavior), [`collect` options](#collect-option-behavior), and [Default Denylist](#default-denylist).

#### 3. Sensitive Data

Credentials and secrets that **MUST** never be sent by default:

- Passwords, tokens, API keys, bearer tokens
- Header or cookie values that match known sensitive names (auth, token, secret, password, key, jwt, etc.)
Comment thread
s1gr1d marked this conversation as resolved.
Outdated

SDKs **MUST** never send sensitive **values** through automatic instrumentation — values are replaced with `"[Filtered]"` while keys are retained (see [Default Denylist](#default-denylist)). Users can use `beforeSend` (or equivalent) to remove or redact keys if needed.

</SpecSection>

---

## Behavior

<SpecSection id="configuration-surface" status="draft" since="1.0.0">

### Configuration Requirements

All data-collection options live under a single top-level key: `dataCollection`. SDKs **MUST** support at least `includeUserInfo` and the `collect` object. SDKs **MAY** omit options that do not apply to the platform (e.g. no `outgoingRequestBody` on a browser-only SDK).

`dataCollection` accepts two fields:

- **`includeUserInfo`** — the primary toggle for Personally Identifiable Information (PII). Controls whether user-identity fields are included in automatic collection, and sets the default for PII-heavy `collect` options (such as HTTP request bodies - TBD). Defaults to `false`.
- **`collect`** — controls which categories of request/response and runtime data are gathered. See [`collect` Option Behavior](#collect-option-behavior) and [How Defaults Cascade](#how-defaults-cascade).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: The nested config with includeUserInfo cascading into collect defaults confuses me a bit. When I look at cookies: true, I can't tell what actually gets sent without also checking includeUserInfo — the same option behaves differently depending on another flag. For a privacy-sensitive config, I think it's important that users can look at their config and immediately know what's being collected.

I wonder if a flatter approach would work better:

1. A flat list of explicit options — each boolean or string list independently controls exactly what gets sent. No flag changes the behavior of other flags.

2. Presets / factory constructors that return a complete, resolved config:

const config = DataCollection.default();
// { userInfo: false, incomingRequestBody: false,
//   aiAgentMessages: true, stackFrameVariables: true,
//   cookies: ['locale', 'theme'],
//   httpHeaders: ['content-type', 'accept', 'x-request-id'],
//   queryParams: true, ... }

// Tweak individual fields — no surprises
config.userInfo = true;
config.incomingRequestBody = true;
config.httpHeaders.push('x-custom-trace');

init({ dsn: "...", dataCollection: config });

// Or start from a PII-inclusive preset
init({ dsn: "...", dataCollection: DataCollection.withPii() });

Why I think this could be better:

  • Transparency: Each option means exactly one thing. No implicit cascades to reason about.
  • Debuggability: SDK can log the resolved config at init. Support can ask "what's your config?" and get an unambiguous answer.
  • Simpler implementation: SDK devs check one value per data type — no "if includeUserInfo is false, then the effective default for incomingRequestBody is false, unless explicitly overridden" logic.
  • Presets handle the common cases: Most users want "default" or "with PII" — the factory gives them that in one call, and they can still tweak individual fields.

The tradeoff is slightly more verbose config for custom setups, but I think the clarity is worth it — especially when "I can't easily tell what's being sent" is a real problem.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As talked offline, this is a good suggestion that makes things more clear. I agree, the approach with "silently" overwriting the collection options can be confusing. I'll try to incorporate this into the spec 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had the idea of using presets that can be overwritten. It's the same concept but still a bit clumsy. Could look like this:

init({
  dataCollection: {
    preset: "default", // or "withPii"
    overrides: {
      userInfo: true,
      incomingRequestBody: true,
    },
  },
});

The approach you mentioned (have pre-configured objects) is very straight-forward and easily testable. Users would also be able to re-use the pre-configured options and e.g. filter on them.

Should also be possible with other languages Examples:

  • JS/TS: static factory funcs returning plain objects
  • Java/C#: DataCollectionConfig.defaultConfig() + builder mutation/copy
  • Python: dataclass + classmethods
  • Ruby/PHP: hash/array factories


</SpecSection>

<SpecSection id="include-user-info" status="draft" since="1.0.0">

### `includeUserInfo` Behavior

`includeUserInfo` controls whether the SDK automatically attaches user identity fields to events (e.g. `user.id`, `user.email`, `user.username`, `user.ip_address`). This is the primary PII gate: its value also sets the effective default for PII-heavy `collect` options.

| Value | Behavior |
|-------|----------|
| `true` | Attach all user identity fields captured by automatic instrumentation. Equivalent to the legacy `sendDefaultPii` flag scoped to user data. |
| `false` | Do not attach user identity fields from automatic instrumentation. |

When user data is set **explicitly** on the scope (or equivalent), it is **always** attached regardless of this setting. See [User-Set Data and Scrubbing](#user-set-data-and-scrubbing).

</SpecSection>

<SpecSection id="collect-options" status="draft" since="1.0.0">

### `collect` Option Behavior

Each key under `collect` maps to a category of automatically collected data and uses one of two option types, depending on whether the data is structured as key-value pairs.
Comment thread
s1gr1d marked this conversation as resolved.
Outdated

**Boolean options** — used where data cannot be meaningfully filtered at the key level. The SDK either collects the entire category or skips it.

| Value | Behavior |
|-------|----------|
| `true` | Collect and attach this data category. |
| `false` | Do not collect this data category at all. |

**Collection options** — used for key-value data (cookies, headers, query params), where the SDK can inspect individual keys and apply filtering rules before attaching.

| Value | Behavior |
|-------|----------|
| `true` | Collect this category. Apply the default denylist — values for sensitive key names are replaced with `"[Filtered]"` (see [Default Denylist](#default-denylist)). |
| `false` | Do not collect this category at all. |
| `{ deny: string[] }` | Collect this category. Apply the default denylist **plus** these additional key names. |
| `{ allow: string[] }` | Collect **only** keys in this list. The default denylist is bypassed, but sensitive values **MUST** still be scrubbed regardless. |

> **Note:** Sensitive key **values** are always scrubbed — replaced with `"[Filtered]"` — regardless of collection option configuration. The allow/deny lists control which keys are included, not whether scrubbing applies.

</SpecSection>

<SpecSection id="how-defaults-cascade" status="draft" since="1.0.0">

### How Defaults Cascade

`includeUserInfo` determines the effective default for PII-related `collect` options. Explicitly set `collect` options always override this default.

| Option type | Default when `includeUserInfo: true` | Default when `includeUserInfo: false` |
|-------------|--------------------------------------|----------------------------------------|
| Collection (key-value pairs) | `true` — use default denylist | `true` — use default denylist, plus PII keys denied |
| PII Boolean (e.g. `incomingRequestBody`) | `true` — attach | `false` — do not attach |

Non-PII boolean options (e.g. `stackFrameVariables`) are not affected by `includeUserInfo` and always default to their configured value.

</SpecSection>

<SpecSection id="default-denylist" status="draft" since="1.0.0">

### Default Denylist

For key-value data (HTTP headers, cookies, URL query params), SDKs **MUST** apply a **default denylist** by key name: values for known-sensitive keys are replaced with `"[Filtered]"`; **keys are never scrubbed** by the SDK.

#### Matching Rule

SDKs **MUST** perform a **partial, case-insensitive match** when comparing key names against the denylist. A key is treated as sensitive if any denylist term appears as a substring in the key name (e.g. the term `auth` matches `Authorization` and `X-Auth-Token`).

#### Base Denylist (Sensitive Data)

The following terms **MUST** be included in the default denylist for headers, and **SHOULD** be applied to cookies and query params where applicable:

`["auth", "token", "secret", "password", "passwd", "pwd", "key", "jwt", "bearer", "sso", "saml", "csrf", "xsrf", "credentials", "session", "sid", "identity"]`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: We have some additional filtered headers on cocoa that may be relevant here (https://github.com/getsentry/sentry-cocoa/blob/main/Sources/Swift/Core/Tools/HTTPHeaderSanitizer.swift#L8): X-REAL-IP and REMOTE-ADDR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, as x-real-ip and remote-addr are not considered "sensitive" but PII, I added those to the list of user-sensitive header snippets 👍


Values for keys that match **MUST** be replaced with `"[Filtered]"`.

#### PII Denylist (when `includeUserInfo` is `false`)

When `includeUserInfo` is `false`, SDKs **MUST** apply the base denylist **and** additionally treat the following as sensitive:

- Any data that contains email, user ID, IP address, username, or machine name (if applicable)
- Any key containing **`x-forwarded-`** (e.g. `x-forwarded-for`, `x-forwarded-host`) — often carries client IP or host
- Any key ending with or containing **`-user`** (e.g. `x-user-id`, `remote-user`) — often carries user identifiers

Effective denylist when PII is disabled: base list + `["x-forwarded-", "-user"]` (partial match, case-insensitive).

#### Cookies and Cookie Headers

- SDKs **SHOULD** maintain a default denylist of cookie names using the same matching rule (e.g. `session`, `auth`, `identity`). Values for matching cookie names **MUST** be replaced with `"[Filtered]"`.
- **When individual cookie key-value pairs cannot be extracted** (e.g. malformed or opaque cookie string), the entire `Cookie` or `Set-Cookie` header value **MUST** be replaced with `"[Filtered]"`. Unfiltered raw cookie header values **MUST NOT** be sent. When in doubt, treat the whole cookie header as sensitive.

#### Request Bodies
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Should the same apply for response bodies? This is (or will be, depends on the SDK) being recorded now for Session Replay

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration is set in SessionReplay configuration, it may be worth aligning there

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included them, but made the distinction that outgoing bodies are only applicable in a server-environment.


When request or response bodies are collected (`incomingRequestBody` / `outgoingRequestBody`):

- **Parseable as JSON or form data:** SDKs **MAY** extract key-value pairs and apply the same denylist rules to keys. Values for matching keys **MUST** be replaced with `"[Filtered]"`. This allows selective scrubbing while retaining non-sensitive fields for debugging.
- **Not parseable (raw bodies):** The body **MUST NOT** be attached to the event. When the SDK cannot parse the body into key-value structure, the entire body **MUST** be replaced with `"[Filtered]"`.

No built-in option scrubs **keys**; users who need to hide header or cookie names **MUST** use `beforeSend` (or equivalent).

</SpecSection>

<SpecSection id="user-set-data-scrubbing" status="draft" since="1.0.0">

### User-Set Data and Scrubbing

When the user **explicitly** sets data on the scope (user, request, response, tags, contexts, etc.) or on a span, log, or other telemetry, that data is **not** gated by `dataCollection`. It **MUST** always be attached to outgoing telemetry. The same applies to data the user provides via `beforeSend` or event processors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification 👍


SDKs **SHOULD** only replace sensitive values with `"[Filtered]"` when the data is gathered **automatically** through instrumentation. If the user explicitly provides data (e.g. by setting a request object on the scope), the SDK **MUST NOT** modify it; the user is responsible for what they attach.

Users can register callbacks (e.g. `beforeSend`, event processors) to remove or redact any data — including keys — before events are sent. This spec does not replace those hooks; they remain the mechanism for custom filtering and key removal.

</SpecSection>

---

## Public API

The `dataCollection` option is passed to the SDK's init function. All fields are optional; omitting a field uses the default.

```pseudocode
init({
dataCollection: {
includeUserInfo: boolean, // default: false
collect: {
cookies: Collection, // default: true
httpHeaders: Collection, // default: true
queryParams: Collection, // default: true
aiAgentMessages: boolean, // default: true
stackFrameVariables: boolean, // default: true
incomingRequestBody: boolean, // default: TBD
outgoingRequestBody: boolean, // default: TBD
frameContextLines: number, // default: 5 (boolean fallback: true)
},
},
})
```

### `dataCollection.includeUserInfo`

| Property | Value |
|----------|-------|
| Type | Boolean |
| Default | `false` |
| Since | 1.0.0 |
| Description | Primary PII toggle. Enables automatic collection of user identity fields (`user.id`, `user.email`, `user.username`, `user.ip_address`). Also sets the effective default for PII-heavy `collect` options. |

### `dataCollection.collect` Options

| Key | Option Type | Default | Since | Description |
|-----|-------------|---------|-------|-------------|
| `cookies` | Collection | `true` | 1.0.0 | Include cookie values; keys filtered by the default denylist or by allow/deny lists. |
| `httpHeaders` | Collection | `true` | 1.0.0 | Include HTTP header values; keys filtered by the default denylist or by allow/deny lists. |
| `queryParams` | Collection | `true` | 1.0.0 | Include URL query parameter values; keys filtered by the default denylist or by allow/deny lists. |
| `aiAgentMessages` | Boolean | `true` | 1.0.0 | Include AI agent input and output messages. |
| `stackFrameVariables` | Boolean | `true` | 1.0.0 | Include local variable values captured within stack frames. |
| `incomingRequestBody` | Boolean | TBD | 1.0.0 | Include full body of the incoming HTTP request. |
| `outgoingRequestBody` | Boolean | TBD | 1.0.0 | Include full body of outgoing HTTP requests. |
| `frameContextLines` | Number (Boolean fallback) | `5` (`true`) | 1.0.0 | Number of lines of context to include around stack frames. |

<Expandable title="Why are some options boolean-only?">
Unlike cookies or headers, some data (e.g. request bodies) has no predictable key structure for the SDK to filter. Data can still be redacted in `beforeSend` or event processors if needed.
</Expandable>

---

## Examples

### Default Configuration

An explicit representation of all defaults (with `includeUserInfo: false`):

```typescript
init({
dsn: "...",
dataCollection: {
includeUserInfo: false,
collect: {
cookies: true,
httpHeaders: true,
queryParams: true,
aiAgentMessages: true,
stackFrameVariables: true,
incomingRequestBody: false,
outgoingRequestBody: false,
frameContextLines: 5,
},
},
});
```

### Maximum PII (Full Collection)

Enable full PII collection, including request bodies and AI messages:

```typescript
init({
dsn: "...",
dataCollection: {
includeUserInfo: true,
collect: {
incomingRequestBody: true,
outgoingRequestBody: true,
},
},
});
```

**Result:** Technical context and request/response data (headers, cookies, query params) are collected with the default denylist; request bodies, user identifiers, and AI agent messages are included; sensitive values are still replaced with `"[Filtered]"`.

### Granular Debugging

Include user info and only specific headers for debugging; exclude query params entirely:

```typescript
init({
dsn: "...",
dataCollection: {
includeUserInfo: true,
collect: {
httpHeaders: { allow: ['x-request-id', 'x-trace-id', 'x-correlation-id'] },
queryParams: false,
},
},
});
```

### Migration from `sendDefaultPii`

- **`sendDefaultPii: true`** (legacy) → `dataCollection: { includeUserInfo: true, collect: { aiAgentMessages: false } }`, keep most `collect` defaults
Comment thread
sentry[bot] marked this conversation as resolved.
Outdated
- **`sendDefaultPii: false`** (legacy) → `dataCollection: { includeUserInfo: false }` (or omit entirely — same as default)

SDKs **SHOULD** document this mapping and **MAY** implement `send_default_pii` as a compatibility shim that sets `includeUserInfo`.
Copy link
Copy Markdown
Contributor

@alexander-alderman-webb alexander-alderman-webb Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, global options sound great!
What's the plan to migrate away from integration-level options? Such as recordInputs and recordOutputs in JavaScript AI integrations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I specified one option aiAgentMessages for all messages. Do you think it would make sense to split this into aiAgentInputMessages and aiAgendOutputMessages?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to summarize in-person discussion:

  • Once the spec is merged and implemented the integration level options are deprecated, and then removed in the next major of the respective SDKs. For finer controls users must use hooks. cc @nicohrubec
  • According to the LLM monitoring RFC we will have two options. See this commit. They are called record_inputs and record_outputs. Translating to global options (and camel case), I propose recordGenerativeAIInputs and recordGenerativeAIOutputs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I'll call them generativeAIInputs and generativeAIOutputs - the record is a bit repetitive as all the options are about "recoding" something.


---

## Changelog

<SpecChangelog />
Loading
Loading