Skip to content

fix: split inproc handler thread#1446

Merged
supervacuus merged 142 commits intomasterfrom
fix/split_inproc_handler_thread
Feb 18, 2026
Merged

fix: split inproc handler thread#1446
supervacuus merged 142 commits intomasterfrom
fix/split_inproc_handler_thread

Conversation

@supervacuus
Copy link
Copy Markdown
Collaborator

@supervacuus supervacuus commented Nov 10, 2025

This is a more elaborate, long-term fix to getsentry/sentry-java#4830 than #1444.

It also finishes the work done here: #1088
And fixes the issues raised here:

So, while the driver for this PR is a downstream issue that exposes the signal-unsafety of some parts of the current inproc implementation, it also addresses a much broader range of concerns that regularly affect inproc users on all platforms.

At a high level, it introduces a separate handler thread for inproc, which the signal handler (or UEF on Windows) wakes after it exchanges crash context data.

The idea is that we minimize signal handler/UEF to do the least amount of syscall stuff (or at least the subset documented in the signal-safety man-page), while the handler thread can execute functions outside that range (with limitations, since thread sync and heap allocations are still problematic). This allows us to reuse stdio functionality like formatters without running squarely into UB territory or having to rewrite all utilities to async-signal-safe versions, as in #1444.

There are a few considerable changes to mention:

  • since we run the event construction in a separate handler thread, the use of backtrace() or any unwinder that runs from the "current" instruction address is entirely useless (ignoring the fact that backtrace() was always signal-unsafe to begin with, which itself was the source of crashes, hangs or just empty stack traces).
  • this means we require a "user context"-based stack walker in inproc, which we already partially acknowledged in Using libunwind for mac, since backtrace do not expect thread context… #1088 and fix: support musl on Linux #1233.
  • on Linux, this PR requires libunwind (the nognu implementation, not the llvm one, which is a pure C++ exception unwinder), which is a breaking change (at least in the sense that users now require an additional dependency at build and runtime). This means that the "general" Linux usage is now the same as with the musl libc environments.
  • on macOS, we provide a user context stack-walker based on frame pointer records for arm64 and x86-64, and use the system-provided libunwind for the default stack-trace from a call-site. It turned out that the system-provided libunwind wasn't safe enough to use in the context of the signal handler (either led to hangs or had issues with escaping the trampoline). This means users can now use inproc on macOS again (if their code is compiled without omitting frame pointers, which is always the case by default on macOS).

Further improvements/fixes (summarizing the 30 commits, which I didn't want to squash):

  • the libunwind-based unwinder modules now also validate retrieved ucontext pointers against memory mapping (for Linux and macOS)
  • got rid of all remaining __sync functions and replaced them with __atomic (especially the signal handler blocking logic and the spinlock)
  • rectified the inconsistent usage of C++ new with std::nothrow throughout the affected backend code (including the initialization of crashpad_state_t, which still used malloc and memset although it has std::atomic members)
  • cleaned up the CMake configure phase of the integration test suite.
  • ensures that test fixtures do not end up in macOS bundles
  • fixes build issues with by-default PIE and LTO builds
  • musl is no longer a special case "Linux" in the build script
  • fixes a couple of warnings and test-case instabilities
  • introduce macos-26 build config

TODOs:

  • finish the x86-64 stackwalker for macOS (and clean up the code)
  • Figure out if we need the libbacktrace fallback at all and how to handle it.
  • provide a module-level description of the new mechanism in inproc
  • Decide on having the change
  • Update documentation

* use `std::nothrow` `new` consistently to keep exception-free semantics for allocation
* rename static crashpad_handler to have no module-public prefix
* use `nullptr` for arguments where we previously used 0 to clarify that those are pointers
* eliminate the `memset()` of the `crashpad_state_t` initialization since it now contains non-trivially constructable fields (`std::atomic`) and replace it with `new` and an empty value initializer.
…ld, since libraries like libunwind.a might be packaged without PIC.
…ms with architecture prefixes (32-bit Linux)
…stack

also ensure to get the first frame
harmonize libunwind usage
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread CMakeLists.txt Outdated
jpnurmi added a commit to getsentry/sentry-dotnet that referenced this pull request Feb 18, 2026
Removed note about 'libunwind' requirement for Linux.
Update CHANGELOG with breaking changes and new features.
@supervacuus
Copy link
Copy Markdown
Collaborator Author

@jpnurmi, I saw you also ran against this branch. Any reasons not to merge this?

Copy link
Copy Markdown
Collaborator

@jpnurmi jpnurmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Tested with sentry-dotnet: getsentry/sentry-dotnet#4926

#endif

/**
* Inproc Backend Introduction
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I really tried not to be terse here because there is a lot of non-obvious, non-explicit stuff going on. At the same time, it required many explicit code changes to model the domain, which could feel overwhelming without the matching prose.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread tests/cmake.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread src/backends/sentry_backend_inproc.c
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread src/sentry_sync.h
@supervacuus supervacuus merged commit eecc7fa into master Feb 18, 2026
46 checks passed
@supervacuus supervacuus deleted the fix/split_inproc_handler_thread branch February 18, 2026 12:22
@getsentry getsentry deleted a comment from github-actions Bot Feb 19, 2026
supervacuus added a commit to getsentry/sentry-docs that referenced this pull request Feb 20, 2026
## DESCRIBE YOUR PR
This PR updates the native docs wrt the changes introduced here:
getsentry/sentry-native#1446

## IS YOUR CHANGE URGENT?  

Help us prioritize incoming PRs by letting us know when the change needs
to go live.
- [ ] Urgent deadline (GA date, etc.): <!-- ENTER DATE HERE -->
- [ ] Other deadline: <!-- ENTER DATE HERE -->
- [x] None: Not urgent, can wait up to 1 week+

## SLA

- Teamwork makes the dream work, so please add a reviewer to your PRs.
- Please give the docs team up to 1 week to review your PR unless you've
added an urgent due date to it.
Thanks in advance for your help!

## PRE-MERGE CHECKLIST

*Make sure you've checked the following before merging your changes:*

- [x] Checked Vercel preview for correctness, including links
- [x] PR was reviewed and approved by any necessary SMEs (subject matter
experts)
- [ ] PR was reviewed and approved by a member of the [Sentry docs
team](https://github.com/orgs/getsentry/teams/docs)

## LEGAL BOILERPLATE

<!-- Sentry employees and contractors can delete or ignore this section.
-->

Look, I get it. The entity doing business as "Sentry" was incorporated
in the State of Delaware in 2015 as Functional Software, Inc. and is
gonna need some rights from me in order to utilize my contributions in
this here PR. So here's the deal: I retain all rights, title and
interest in and to my contributions, and by keeping this boilerplate
intact I confirm that Sentry can use, modify, copy, and redistribute my
contributions, under Sentry's choice of terms.

## EXTRA RESOURCES

- [Sentry Docs contributor guide](https://docs.sentry.io/contributing/)
jamescrosswell pushed a commit to getsentry/sentry-dotnet that referenced this pull request Feb 24, 2026
* deps(sentry-native): checkout fix/split_inproc_handler_thread

getsentry/sentry-native#1446

* ci: install libunwind-dev and liblzma-dev on Linux

* build: adapt Sentry.Native.targets

* Enable diagnostic verbosity

* sentry-native: vendor_libunwind

getsentry/sentry-native#1523

* Revert "ci: install libunwind-dev and liblzma-dev on Linux"

This reverts commit 5746009.

* Revert "Enable diagnostic verbosity"

This reverts commit 288a685.

* Adapt build-sentry-native.ps1 & Sentry.Native.targets

* sentry-native: switch back to fix/split_inproc_handler_thread

* Combine None/Include
jamescrosswell pushed a commit to getsentry/sentry-dotnet that referenced this pull request Feb 24, 2026
* chore: update modules/sentry-native to 0.13.0

* build: migrate to Native SDK v0.13.0 (#4926)

* deps(sentry-native): checkout fix/split_inproc_handler_thread

getsentry/sentry-native#1446

* ci: install libunwind-dev and liblzma-dev on Linux

* build: adapt Sentry.Native.targets

* Enable diagnostic verbosity

* sentry-native: vendor_libunwind

getsentry/sentry-native#1523

* Revert "ci: install libunwind-dev and liblzma-dev on Linux"

This reverts commit 5746009.

* Revert "Enable diagnostic verbosity"

This reverts commit 288a685.

* Adapt build-sentry-native.ps1 & Sentry.Native.targets

* sentry-native: switch back to fix/split_inproc_handler_thread

* Combine None/Include

---------

Co-authored-by: GitHub <noreply@github.com>
Co-authored-by: J-P Nurmi <jpnurmi@gmail.com>
jpnurmi added a commit that referenced this pull request Mar 11, 2026
SA_NODEFER (added in #1446) is incompatible with the CHAIN_AT_START
signal handler strategy. When chaining to the runtime's signal handler
(e.g. Mono), the runtime may reset the signal to SIG_DFL and re-raise.
With SA_NODEFER the re-raised signal is delivered immediately, killing
the process before our handler can regain control.

Without SA_NODEFER, the re-raised signal is blocked during handler
execution, allowing the runtime handler to return and sentry-native
to proceed with crash capture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi added a commit that referenced this pull request Mar 14, 2026
SA_NODEFER (added in #1446) is incompatible with the CHAIN_AT_START
signal handler strategy. When chaining to the runtime's signal handler
(e.g. Mono), the runtime may reset the signal to SIG_DFL and re-raise.
With SA_NODEFER the re-raised signal is delivered immediately, killing
the process before our handler can regain control.

Without SA_NODEFER, the re-raised signal is blocked during handler
execution, allowing the runtime handler to return and sentry-native
to proceed with crash capture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi added a commit that referenced this pull request Mar 17, 2026
SA_NODEFER (added in #1446) is incompatible with the CHAIN_AT_START
signal handler strategy. When chaining to the runtime's signal handler
(e.g. Mono), the runtime may reset the signal to SIG_DFL and re-raise.
With SA_NODEFER the re-raised signal is delivered immediately, killing
the process before our handler can regain control.

Without SA_NODEFER, the re-raised signal is blocked during handler
execution, allowing the runtime handler to return and sentry-native
to proceed with crash capture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BernhardMarconato pushed a commit to elgatosf/sentry-native that referenced this pull request Apr 21, 2026
* fix: split inproc with a handler thread

* prevent warning on write() return value

* get rid of unsused dispatch parameter

* and another unused return value for write()

* fix trivial windows compile def issues to run tests locally

* make tests a bit less brittle

* ensure that find_cp_path isn't built on windows.

* clean up backends

* use `std::nothrow` `new` consistently to keep exception-free semantics for allocation
* rename static crashpad_handler to have no module-public prefix
* use `nullptr` for arguments where we previously used 0 to clarify that those are pointers
* eliminate the `memset()` of the `crashpad_state_t` initialization since it now contains non-trivially constructable fields (`std::atomic`) and replace it with `new` and an empty value initializer.

* eliminate local handler strategy declaration in signal handler

* turn off painful warning noise

* add libunwind to the CI dependencies

* ensure we build the example with PIE disabled when doing a static build, since libraries like libunwind.a might be packaged without PIC.

* extend handler crash to windows

* further clean of inproc

* review and fix/remove remaining TODOs from branch changes

* eliminate multichar warning in crashpad backend when using GCC

* ensure libunwind in benchmark and codeql workflows

* fix Linux ARM64 build (warning in libunwind)

* provide an x86_64 ucontext stackwalker for macOS

* bump lower-end GCC to 10.5.0

* remove obsolete ASAN patch from CI

* ensure lower-end GCC also finds libunwind

* disable LTO for the example too

* use clang diagnostics only for __clang__

* install 32-bit libunwind package for 32-bit Linux CI test config

* fix weird linker asymmetry

* provide empty path-suffix so that CMake can find libunwind on platforms with architecture prefixes (32-bit Linux)

* remove can_lock mechanism from the handler block API

* use PRIx64 without ull cast in sentry__value_new_addr()

* fix off-by-one bug when returning the trace length after walking the stack

also ensure to get the first frame
harmonize libunwind usage

* clean up cmake configure stage in pytest

* skip tests if zlib wasn't found during cmake configure stage

* fix stupid comparison error that led to all failing Linux tests

* exclude `logger_level` from transport unit tests.

* let find_path search the library architecture

* support i386 multilib on Ubuntu/Debian

* add 32-bit lzma package to Linux 32-bit build

* typo

* make even more elaborate multilib triplet and suffice handling based on FORCE32

* be even more careful wrt to max_frames in the fp walker and check on every access  for overrun (+ track with while instead for-looping)

* eliminate the non-atomic access to g_handler_thread_ready from has_handler_thread_crashed

* bring spinlock and pageallocator into the atomic fold

* fix: add GNU-stack note to crashpad_info_note.S

* update crashpad

* remove deprecated macOS 13 and replace it with macOS 26

* update crashpad after merging to getsentry

* simplify signal handler blocker by leaving and entering around the dispatch to the handler thread

* remove left-over handler switch

* reintroduce handling thread into cleaned up signal blocker for the remaining backends

* add valgrind suppression for false-positive deep in libunwind

* track failing ARM64/Linux clang-tsan build across available toolchain versions

* let valgrind match any frame between the libunwind obj matcher and the error site.

* move the check for a crashed handler thread into dispatch_ucontext() where we execute the unsafe part directly in the signal-handler if the handler thread is dead as a last chance report

* isolate the libunwind walker as a tsan cause in CI

* exclude walker variables to prevent unused variables Werror

* isolation pinpoint fine, but only exclude when using ucontext.

* silence clang-20 warning for crashpad

* update `crashpad` after merging to `getsentry`

* we know the walk crashes in aarch64 tsan, now introduce stack bound reader in the libunwind walker for Linux and log as much as possible to understand where the actual crash happens

* format

* add unistd.h for musl

* eliminate the logs and ensure we never walk the callers if the SP is in unmapped memory

* format

* fix invalid `find_mem_range()` return value check

* remove macOS left-overs in the `libbacktrace` unwinder module

* extract fp_walk for the macOS unwinder so we can also provide a stack-trace from an arbitrary frame-pointer

* for targets that must use `backtrace()` as an unwinder we fall back und running the deferred code directly inside the signal handler. Nothing changes for them.

* instead of spinning on atomic in the signal-handler add ACK pipe/semaphore on the return channel and let the OS block and wait.

Also check the return value of startup_handler_thread in the initialization and propagate the failure.

* actually check the FP against mach_vm_region bounds in the validation

* check mach_vm_region bounds only on macOS builds

* ensure we conditionally return on acquire in block_for_signal_handler

* move chain-first strategy completely outside the signal handler reentrancy guard

* up to now, we've been serializing signal handling even though we didn't know whether it was a runtime signal or one we should be handling
* this meant that we blocked all our critical sections during a managed exception
* it also meant that we blocked any concurrent managed exceptions
* it also meant that we introduced a race window during the time when we chained, because incoming signal on other threads would have gotten next in line, before we even completed the current signal handler

by moving it completely outside our synchronization we truly chain at start and don't interfere until we know we must.

* update changelog

* add inproc module-level docs for developers

* fix: return the number of frames when doing a stack walk from an FP directly.

* clean up inline docs

* replace iOS build inproc for crashpad

* clean up PAC stripping and actually test it in an arm64e build config

* bulk commit of local branches:

* introduces state-machine for in-progress crash handling vs other crashing threads
* extract async-safe logging macros
* adds arm64e support PAC registers (access to opaque fp, lr, sp, and pc)
* skip on_crash if we handle a failing handler thread (TODO: also before_send)
* handle failed pipe and semaphore signaling to handler thread
* introduce even more fall backs to handling inside the signal handler where it makes sense (i.e. failed attempt to signal handler thread)

* gate UB read in the mac libunwind stack walker

* after switching to nothrow new in the crashpad database clean up we actually have to check return values.

* revert ci build for ios back to inproc (no clue why this was switched)

* remove stupid copypasta from ci workflow

* exclude breakpad from the arm64e tests on macOS

* strip PAC also from unit-test function pointers

* add inproc concurrent crash stress test

* split out build_config.py from cmake.py so we can build and run new integration test executables with the same parametrization as the core artifacts and "example".

* look up pthread correctly using find_package() rather than assuming pthread exists as a separate library

* allow the inproc stress tests to run on Android using adb shell

* make the inproc output test assertions a bit more defensive

* deduplicate CMAKE_DEFINES from build_config refactor

* print macOS architecture in CI

* bump PAC runner to macOS-26 (since macOS-15 counter to docs is x86-64)

* don't run the inproc stress tests via adb shell in CI, it is just too unstable... it works great locally and should remain for local testing and debugging though.

* mark test_inproc_handler_thread_crash flaky for now

* fix Windows build issue (noinline)

* ensure that on Windows the inproc stress test executable placed directly in the build directory

* provide a more comprehensive solution to recursive crash scenarios:

* reintroduce unblocked signal handling thread to sentry__enter_signal_handler()
* but also add a counter that tracks how often that thread entered the signal handler
* and(!) configure all signal handlers as SA_NODEFER to allow recursive detection
* add additional tests and test hook-points to both crash and abort() from within the signal handler
* reintroduce the skip_hook idea into inproc, where a handler on the second recursive crash no longer calls on_crash or before_send

* make the inproc stress tests widechar path friendly

* do not set SA_NODEFER for SIGABRT

* introduce sigaltstack also to the handler thread

and set up a sigabrt handler for Windows, so inproc can handle abort()

* disable abort pop up on windows which blocks the test in CI indefinitely

* disable any stdio logging in inproc signal handler fallback paths.

* don't allow recursive abort(), allowing a crash handler to handle a crash in the handler that is coming from an abort is extremely unstable and why might extend that to other signals as well, but those typically come from the kernel as hard-fault translations. However abort() comes from libc infra and is far from safe to keep running.

We always assumed reports for crashed handlers to be a best effort topic, now we have way to detect those in the recursion and can act on particularly bad crashes in the handler to just skip.

* guard the chain at start strategy against SIGABRT

* disable the abort dialog for Win32 in the example when abort() is triggered

* extract abort check into a function so we can handle UNIX/Windows differences there

* ensure handler thread started before we crash

* assume that inproc stress tests output UTF-8 on Windows and enable replace error handling

* temporarily disable CI gate for inproc stress on android

* use a time-based timeout for the init wait on inproc handler_thread_ready

* permanently disable CI gate for inproc stress on android

* clean up changelog

* temporarily remove SIGABRT recursion gate

* update changelog

* also hide is_abort() for all Werror builds

* delete data in crashpad_backend_free() rather than sentry_free since it was allocated with new.

* add signal-safe address formatter

* remove `abort()` recursion-gate completely

* update README.md to remove inproc exceptions.

* ensure top-frames are put verbatim into the stack-trace

* add libunwind as build dependency for inproc to the CONTRIBUTING.md

* Terminate abort() if only SIG_DFL or SIG_IGN are left.

* improve report by explicitly providing a string for STATUS_FATAL_APP_EXIT

* fully clean up after a handler thread start timed out

* reset previous SIGABRT handler on Windows unconditionally

* handle frame duplication when LR matches the LR in current frame record

* switch compile guard to only __APPLE__ because that is how we use it

* don't let the stack trace tests run on android (check for sys.platform is not enough, because the android tests run on macOS runners)

* On x86_64, without an LR register, there's no mechanism for the FP-based unwinder to recover a frame when the callee has no frame
 record. So, lets limit that particular test-case to aarch64.

* vendor libunwind for Linux (getsentry#1523)

* remove return value from sentry__addr_to_string()

* vendor libunwind with a minimal Linux cmake setup for Linux x86, x86_64 and aarch64

* set CMAKE_SYSTEM_PROCESSOR=i686 for SENTRY_BUILD_FORCE32

* build libunwind with C11 target props

* fix the source instead of setting C11 on target

* move unw_context_t out of the else block.

* enable ASM for libunwind

* remove the ASM in ENABLED_LANGUAGES check for the breakpad build

* remove unused files

* document all changes

* also get rid of humongous autoreconf generated release artifacts

* remove no-pie/no-lto options from example entirely

* Update CONTRIBUTING.md to remove libunwind note

Removed note about 'libunwind' requirement for Linux.

* Revise CHANGELOG for breaking changes

Update CHANGELOG with breaking changes and new features.

* test: eliminate `llvm-cov` flag duplication (cmake.py vs build_config.py)

* Changed depth >= 2 to `skip_hooks` in the Unix pipe-failure fallback path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants