[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch by henrylhtsang · Pull Request #1692 · NVIDIA/cuda-python

henrylhtsang · 2026-02-25T23:33:59Z

Description

CUtensorMap_st in cuda.h uses alignas(64), but when embedded inline as _pvt_val in a Cython cdef class, CPython's allocator (PyObject_Malloc) only
guarantees 16-byte alignment. The compiler generates aligned instructions (e.g. movaps) for the 64-byte-aligned member, which causes SIGSEGV on the unaligned
PyObject memory.

Fix: heap-allocate with posix_memalign(64) instead of embedding inline.

Note: We are not experts on cuda-python internals — this fix was developed with AI assistance. Happy to iterate on the approach.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

CUtensorMap_st in cuda.h is declared with alignas(64), but when embedded inline as _pvt_val in a Cython cdef class, the Python object allocator (PyObject_Malloc) only guarantees 8-16 byte alignment. The compiler sees alignas(64) and may generate aligned instructions (e.g. movaps) for zero-initializing the struct, causing SIGSEGV on the unaligned memory. Fix by heap-allocating the CUtensorMap buffer with posix_memalign(64) instead of embedding it inline in the Python object.

copy-pr-bot · 2026-02-25T23:34:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cuda_bindings/cuda/bindings/driver.pyx.in

henrylhtsang · 2026-02-26T23:25:21Z

Updated. Please feel free to push back since I am not familiar with the stack.

rwgk · 2026-02-26T23:45:58Z

I'm not an expert here, too. Sharing the observations below, with summary:

Every 12.* file (12.0.1 through 12.9.1) has the CUtensorMap block with alignas(64) and _Alignas(64), and the comment says aligning to 64 bytes.
Every 13.* file (13.0.2, 13.1.1) has alignas(128) and _Alignas(128), and the comment says aligning to 128 bytes.

Question: Do we need to use 128 in this PR for CTK 13 (and 64 for CTK 12)?

All released CTK 12 versions have this block in cuda.h:

/**
 * Size of tensor map descriptor
 */
#define CU_TENSOR_MAP_NUM_QWORDS 16

/**
 * Tensor map descriptor. Requires compiler support for aligning to 64 bytes.
 */
typedef struct CUtensorMap_st {
#if __cplusplus >= 201103L
    alignas(64)
#elif __STDC_VERSION__ >= 201112L
    _Alignas(64)
#endif
    cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
} CUtensorMap;

All released CTK 13 versions have this block in cuda.h:

/**
 * Size of tensor map descriptor
 */
#define CU_TENSOR_MAP_NUM_QWORDS 16

/**
 * Tensor map descriptor. Requires compiler support for aligning to 128 bytes.
 */
typedef struct CUtensorMap_st {
#if defined(__cplusplus) && (__cplusplus >= 201103L)
    alignas(128)
#elif __STDC_VERSION__ >= 201112L
    _Alignas(128)
#endif
    cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
} CUtensorMap;

For completeness, these are the files I looked at:

rwgk-win11.localdomain:~/ctk_downloads/extracted $ find . -type f -name cuda.h | sort
./12.0.1_525.85.12/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.1.1_530.30.02/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.2.2_535.104.05/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.3.2_545.23.08/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.4.1_550.54.15/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.5.1_555.42.06/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.6.3_560.35.05/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.8.1_570.124.06/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.9.1_575.57.08/cuda_cudart/targets/x86_64-linux/include/cuda.h
./13.0.2_580.95.05/cuda_cudart/targets/x86_64-linux/include/cuda.h
./13.1.1_590.48.01/cuda_cudart/targets/x86_64-linux/include/cuda.h

leofang

Hi @henrylhtsang 👋 Thanks for reaching out to us!

Apology for the confusion, but we currently cannot accept external contribution to cuda.bindings unfortunately, due to license limitation as outlined here. Note that the files under cuda_bindings/ are subject to NV Software License instead of Apache 2.0. Also, the relevant code is auto-generated, so we'll need to fix in the (internal) codegen instead.

I am actually confused by this PR, see below. Could you please provide a reproducer for the segfault, and share with us details of your environments such as:

cuda.bindings version
CUDA driver version (attaching the output of nvidia-smi is fine)
CUDA Toolkit version (and how was it installed?)

Thanks!

leofang · 2026-02-27T02:25:03Z

cuda_bindings/cuda/bindings/driver.pxd.in

        Get memory address of class instance
    """
-    cdef cydriver.CUtensorMap_st _pvt_val
+    cdef void* _pvt_buf


This is where the confusion arises. In the way Cython compiles, this member maps to the actual CUtensorMap from cuda.h, which should carry the alignment information (since the headers are build-time dependencies for cuda.bindings and get included by the generated .cpp files). So it is unclear to me how this struct member, after Cython emits C++ code that transforms this cdef class to a C struct (+ free functions), could lose its alignment requirement.

Given what Ralf alluded to above (that the alignment changes between CUDA 12 (64B) and 13 (128B)), I suspect there is a nontrivial mix-n-match between 12/13 somewhere in your environment.

henrylhtsang · 2026-02-27T21:38:54Z

Thanks @leofang,

I am on cuda bindings 12.9.5, CUDA driver 580.82.07, B200, cuda toolkit 12.8.

I got told it can also be due to how we build cuda bindings. Apparently we build statically instead of shared linking.

henrylhtsang · 2026-02-28T01:38:59Z

trying to resolve through other means

leofang · 2026-02-28T04:07:36Z

Sounds good, thanks Henry! If you could share with us a reproducer & core dump, we could also try to investigate next week.

rparolin reviewed Feb 26, 2026

View reviewed changes

cuda_bindings/cuda/bindings/driver.pyx.in Outdated Show resolved Hide resolved

Use std::aligned_alloc instead of posix_memalign for C++17 compatibility

a03d844

henrylhtsang requested a review from rparolin February 26, 2026 23:25

leofang requested changes Feb 27, 2026

View reviewed changes

henrylhtsang marked this pull request as draft February 28, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch#1692

[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch#1692
henrylhtsang wants to merge 2 commits intoNVIDIA:mainfrom
henrylhtsang:fix/cutensormap-alignment-segfault

henrylhtsang commented Feb 25, 2026

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Uh oh!

henrylhtsang commented Feb 26, 2026

Uh oh!

rwgk commented Feb 26, 2026 •

edited

Loading

Uh oh!

leofang left a comment

Uh oh!

leofang Feb 27, 2026 •

edited

Loading

Uh oh!

henrylhtsang commented Feb 27, 2026

Uh oh!

henrylhtsang commented Feb 28, 2026

Uh oh!

leofang commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

henrylhtsang commented Feb 25, 2026

Description

Checklist

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Uh oh!

henrylhtsang commented Feb 26, 2026

Uh oh!

rwgk commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henrylhtsang commented Feb 27, 2026

Uh oh!

henrylhtsang commented Feb 28, 2026

Uh oh!

leofang commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rwgk commented Feb 26, 2026 •

edited

Loading

leofang Feb 27, 2026 •

edited

Loading