Skip to content

docs(sdks): Add spec for dataCollection option to supersede sendDefaultPii#16796

Merged
s1gr1d merged 25 commits intomasterfrom
sig/data-collection-options
Apr 9, 2026
Merged

docs(sdks): Add spec for dataCollection option to supersede sendDefaultPii#16796
s1gr1d merged 25 commits intomasterfrom
sig/data-collection-options

Conversation

@s1gr1d
Copy link
Copy Markdown
Member

@s1gr1d s1gr1d commented Mar 5, 2026

This PR extends the Data Collection spec so it is the single place for what SDKs collect and how they scrub it. It adds concrete denylist behavior, request-body and cookie rules, and pulls in the relevant scrubbing behavior from the Data Handling doc.

IS YOUR CHANGE URGENT?

Help us prioritize incoming PRs by letting us know when the change needs to go live.

  • Urgent deadline (GA date, etc.):
  • Other deadline:
  • None: Not urgent, can wait up to 1 week+

SLA

  • Teamwork makes the dream work, so please add a reviewer to your PRs.
  • Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it.
    Thanks in advance for your help!

PRE-MERGE CHECKLIST

Make sure you've checked the following before merging your changes:

  • Checked Vercel preview for correctness, including links
  • PR was reviewed and approved by any necessary SMEs (subject matter experts)
  • PR was reviewed and approved by a member of the Sentry docs team

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
develop-docs Ready Ready Preview, Comment Apr 9, 2026 1:18pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
sentry-docs Ignored Ignored Preview Apr 9, 2026 1:18pm

Request Review

@s1gr1d s1gr1d changed the title docs(sdks): Add docs on dataCollection docs(sdks): Add spec for dataCollection option to supersede sendDefaultPii Mar 5, 2026
@s1gr1d s1gr1d requested a review from cleptric March 5, 2026 13:36
@s1gr1d s1gr1d force-pushed the sig/data-collection-options branch from 441172f to e99813b Compare March 5, 2026 13:58
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page does not exist anymore - we moved the PII related content to
https://develop.sentry.dev/sdk/foundations/data-scrubbing/

@dingsdax dingsdax added the sdk-develop-docs PRs touching develop-docs/sdk label Mar 9, 2026
@cleptric
Copy link
Copy Markdown
Member

cleptric commented Mar 9, 2026

We should include some more general information, such as how we expect the default snippet too like now, which is very minimal and in-line with the behavior of sendDefualtPii.

sentry.init({
  dsn: '...',
})

// or with user information

sentry.init({
  dsn: '...',
  dataCollection: {
    includeUserInfo: true,     
  }
})

and also mention the reasoning of this change. In particular, we should highlight that we now include more context by default, without a change in our position to be privacy first.

Copy link
Copy Markdown
Contributor

@itaybre itaybre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This make sense, but got some questions here


The following terms **MUST** be included in the default denylist for headers, and **SHOULD** be applied to cookies and query params where applicable:

`["auth", "token", "secret", "password", "passwd", "pwd", "key", "jwt", "bearer", "sso", "saml", "csrf", "xsrf", "credentials", "session", "sid", "identity"]`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: We have some additional filtered headers on cocoa that may be relevant here (https://github.com/getsentry/sentry-cocoa/blob/main/Sources/Swift/Core/Tools/HTTPHeaderSanitizer.swift#L8): X-REAL-IP and REMOTE-ADDR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, as x-real-ip and remote-addr are not considered "sensitive" but PII, I added those to the list of user-sensitive header snippets 👍

- SDKs **SHOULD** maintain a default denylist of cookie names using the same matching rule (e.g. `session`, `auth`, `identity`). Values for matching cookie names **MUST** be replaced with `"[Filtered]"`.
- **When individual cookie key-value pairs cannot be extracted** (e.g. malformed or opaque cookie string), the entire `Cookie` or `Set-Cookie` header value **MUST** be replaced with `"[Filtered]"`. Unfiltered raw cookie header values **MUST NOT** be sent. When in doubt, treat the whole cookie header as sensitive.

#### Request Bodies
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: Should the same apply for response bodies? This is (or will be, depends on the SDK) being recorded now for Session Replay

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration is set in SessionReplay configuration, it may be worth aligning there

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included them, but made the distinction that outgoing bodies are only applicable in a server-environment.


### User-Set Data and Scrubbing

When the user **explicitly** sets data on the scope (user, request, response, tags, contexts, etc.) or on a span, log, or other telemetry, that data is **not** gated by `dataCollection`. It **MUST** always be attached to outgoing telemetry. The same applies to data the user provides via `beforeSend` or event processors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification 👍

- User identifiers (user ID, username, email)
- IP address
- Cookies and headers that identify the user or session
- HTTP request data (TBD)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: What about request paths?
Some requests may be identifiable, like /user/USER_ID
Should we have a denylist/allowlist for url paths?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 🤔
What we currently do in JS: When we either know it's a param route, we use the appropriate parametrized route name (e.g. user/:id) as the transaction name but the full URL (e.g. user/123) is still added in the attributes. @cleptric Any opinions on that?

Comment on lines +316 to +321
### Migration from `sendDefaultPii`

- **`sendDefaultPii: true`** (legacy) → `dataCollection: { includeUserInfo: true, collect: { aiAgentMessages: false } }`, keep most `collect` defaults
- **`sendDefaultPii: false`** (legacy) → `dataCollection: { includeUserInfo: false }` (or omit entirely — same as default)

SDKs **SHOULD** document this mapping and **MAY** implement `send_default_pii` as a compatibility shim that sets `includeUserInfo`.
Copy link
Copy Markdown
Contributor

@alexander-alderman-webb alexander-alderman-webb Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, global options sound great!
What's the plan to migrate away from integration-level options? Such as recordInputs and recordOutputs in JavaScript AI integrations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I specified one option aiAgentMessages for all messages. Do you think it would make sense to split this into aiAgentInputMessages and aiAgendOutputMessages?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to summarize in-person discussion:

  • Once the spec is merged and implemented the integration level options are deprecated, and then removed in the next major of the respective SDKs. For finer controls users must use hooks. cc @nicohrubec
  • According to the LLM monitoring RFC we will have two options. See this commit. They are called record_inputs and record_outputs. Translating to global options (and camel case), I propose recordGenerativeAIInputs and recordGenerativeAIOutputs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I'll call them generativeAIInputs and generativeAIOutputs - the record is a bit repetitive as all the options are about "recoding" something.

Copy link
Copy Markdown
Member

@philipphofmann philipphofmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for tackling this, @s1gr1d. I have one major, high-level, general concept question we should address before moving forward (#16796 (comment)).

- Device and environment context (OS, runtime, non-PII identifiers)
- Performance and error context (stack frames, breadcrumbs, span metadata)
- Framework/routing context where it does not contain PII or secrets
- AI agent messages (input, output, metadata)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: I think AI Agent Messages can actually contain PII. I think that should be under PII. We have that under PII in the user-facing docs, see for example Python.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we explicitly want to emit those by default, else our product becomes useless. Also, this is not Pii, this is maybe Pii, and everything is maybe Pii.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I get that. I'm fine with that, but I think we also need to align our docs because that confused me a bit that it was under PII for Python, for example. But that is out of scope of this PR.


#### 1. Technical Context Data

Non-identifying context used for debugging and performance:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m: I think it would be great to clearly state that these here are just examples, but not a complete list. I think it would be great to add links to references where you can find a complete list, or we add a complete list here and then link from the other places in the doc that this is now here, the spec for the complete list. This applies to all three data tiers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is missing?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event context payload has way more properties: https://develop.sentry.dev/sdk/foundations/transport/event-payloads/contexts/

For example: culture context, GPU context, app context (version, permissions, view names), cloud resource context, memory info context, ...

Comment on lines +84 to +87
`dataCollection` accepts two fields:

- **`includeUserInfo`** — the primary toggle for Personally Identifiable Information (PII). Controls whether user-identity fields are included in automatic collection, and sets the default for PII-heavy `collect` options (such as HTTP request bodies - TBD). Defaults to `false`.
- **`collect`** — controls which categories of request/response and runtime data are gathered. See [`collect` Option Behavior](#collect-option-behavior) and [How Defaults Cascade](#how-defaults-cascade).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h: The nested config with includeUserInfo cascading into collect defaults confuses me a bit. When I look at cookies: true, I can't tell what actually gets sent without also checking includeUserInfo — the same option behaves differently depending on another flag. For a privacy-sensitive config, I think it's important that users can look at their config and immediately know what's being collected.

I wonder if a flatter approach would work better:

1. A flat list of explicit options — each boolean or string list independently controls exactly what gets sent. No flag changes the behavior of other flags.

2. Presets / factory constructors that return a complete, resolved config:

const config = DataCollection.default();
// { userInfo: false, incomingRequestBody: false,
//   aiAgentMessages: true, stackFrameVariables: true,
//   cookies: ['locale', 'theme'],
//   httpHeaders: ['content-type', 'accept', 'x-request-id'],
//   queryParams: true, ... }

// Tweak individual fields — no surprises
config.userInfo = true;
config.incomingRequestBody = true;
config.httpHeaders.push('x-custom-trace');

init({ dsn: "...", dataCollection: config });

// Or start from a PII-inclusive preset
init({ dsn: "...", dataCollection: DataCollection.withPii() });

Why I think this could be better:

  • Transparency: Each option means exactly one thing. No implicit cascades to reason about.
  • Debuggability: SDK can log the resolved config at init. Support can ask "what's your config?" and get an unambiguous answer.
  • Simpler implementation: SDK devs check one value per data type — no "if includeUserInfo is false, then the effective default for incomingRequestBody is false, unless explicitly overridden" logic.
  • Presets handle the common cases: Most users want "default" or "with PII" — the factory gives them that in one call, and they can still tweak individual fields.

The tradeoff is slightly more verbose config for custom setups, but I think the clarity is worth it — especially when "I can't easily tell what's being sent" is a real problem.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As talked offline, this is a good suggestion that makes things more clear. I agree, the approach with "silently" overwriting the collection options can be confusing. I'll try to incorporate this into the spec 👍

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had the idea of using presets that can be overwritten. It's the same concept but still a bit clumsy. Could look like this:

init({
  dataCollection: {
    preset: "default", // or "withPii"
    overrides: {
      userInfo: true,
      incomingRequestBody: true,
    },
  },
});

The approach you mentioned (have pre-configured objects) is very straight-forward and easily testable. Users would also be able to re-use the pre-configured options and e.g. filter on them.

Should also be possible with other languages Examples:

  • JS/TS: static factory funcs returning plain objects
  • Java/C#: DataCollectionConfig.defaultConfig() + builder mutation/copy
  • Python: dataclass + classmethods
  • Ruby/PHP: hash/array factories

Copy link
Copy Markdown
Contributor

@coolguyzone coolguyzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@s1gr1d
Copy link
Copy Markdown
Member Author

s1gr1d commented Mar 27, 2026

I addressed the feedback, renamed some options and restructured the document a bit. I put a summary of the changes in the description.

Copy link
Copy Markdown
Collaborator

@giortzisg giortzisg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @s1gr1d. The spec makes a lot of sense. Just left a high-level comment.


<SpecSection id="option-types" status="draft" since="0.1.0">

### Option Types
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that the Option Types are heavily optimized for Javascript. For statically typed languages this is really hard to implement. I would consider for this section to be kept as a semantic model for how the option types work and leave the implementation details to each SDK.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the options that would be hard to implement in Go for example?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union and nullable types in general. In Go we need something like this:

KeyValueCollectionPolicy {
  mode: Off | DenyList | AllowList
  terms: string[]
}

Copy link
Copy Markdown
Collaborator

@giortzisg giortzisg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, LGTM 🚀

Copy link
Copy Markdown
Member

@cleptric cleptric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Comment on lines +486 to +487
generativeAIInputs?: boolean, // default: true
generativeAIOutputs?: boolean, // default: true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
generativeAIInputs?: boolean, // default: true
generativeAIOutputs?: boolean, // default: true
genAIInputs?: boolean, // default: true
genAIOutputs?: boolean, // default: true

@s1gr1d s1gr1d enabled auto-merge (squash) April 9, 2026 13:11
@s1gr1d s1gr1d merged commit 223147f into master Apr 9, 2026
18 checks passed
@s1gr1d s1gr1d deleted the sig/data-collection-options branch April 9, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sdk-develop-docs PRs touching develop-docs/sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants