Skip to content

[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch#1692

Draft
henrylhtsang wants to merge 2 commits intoNVIDIA:mainfrom
henrylhtsang:fix/cutensormap-alignment-segfault
Draft

[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch#1692
henrylhtsang wants to merge 2 commits intoNVIDIA:mainfrom
henrylhtsang:fix/cutensormap-alignment-segfault

Conversation

@henrylhtsang
Copy link

Description

CUtensorMap_st in cuda.h uses alignas(64), but when embedded inline as _pvt_val in a Cython cdef class, CPython's allocator (PyObject_Malloc) only
guarantees 16-byte alignment. The compiler generates aligned instructions (e.g. movaps) for the 64-byte-aligned member, which causes SIGSEGV on the unaligned
PyObject memory.

Fix: heap-allocate with posix_memalign(64) instead of embedding inline.

Note: We are not experts on cuda-python internals — this fix was developed with AI assistance. Happy to iterate on the approach.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

CUtensorMap_st in cuda.h is declared with alignas(64), but when embedded
inline as _pvt_val in a Cython cdef class, the Python object allocator
(PyObject_Malloc) only guarantees 8-16 byte alignment. The compiler sees
alignas(64) and may generate aligned instructions (e.g. movaps) for
zero-initializing the struct, causing SIGSEGV on the unaligned memory.

Fix by heap-allocating the CUtensorMap buffer with posix_memalign(64)
instead of embedding it inline in the Python object.
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@henrylhtsang
Copy link
Author

Updated. Please feel free to push back since I am not familiar with the stack.

@rwgk
Copy link
Collaborator

rwgk commented Feb 26, 2026

I'm not an expert here, too. Sharing the observations below, with summary:

  • Every 12.* file (12.0.1 through 12.9.1) has the CUtensorMap block with alignas(64) and _Alignas(64), and the comment says aligning to 64 bytes.
  • Every 13.* file (13.0.2, 13.1.1) has alignas(128) and _Alignas(128), and the comment says aligning to 128 bytes.

Question: Do we need to use 128 in this PR for CTK 13 (and 64 for CTK 12)?


All released CTK 12 versions have this block in cuda.h:

/**
 * Size of tensor map descriptor
 */
#define CU_TENSOR_MAP_NUM_QWORDS 16

/**
 * Tensor map descriptor. Requires compiler support for aligning to 64 bytes.
 */
typedef struct CUtensorMap_st {
#if __cplusplus >= 201103L
    alignas(64)
#elif __STDC_VERSION__ >= 201112L
    _Alignas(64)
#endif
    cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
} CUtensorMap;

All released CTK 13 versions have this block in cuda.h:

/**
 * Size of tensor map descriptor
 */
#define CU_TENSOR_MAP_NUM_QWORDS 16

/**
 * Tensor map descriptor. Requires compiler support for aligning to 128 bytes.
 */
typedef struct CUtensorMap_st {
#if defined(__cplusplus) && (__cplusplus >= 201103L)
    alignas(128)
#elif __STDC_VERSION__ >= 201112L
    _Alignas(128)
#endif
    cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
} CUtensorMap;

For completeness, these are the files I looked at:

rwgk-win11.localdomain:~/ctk_downloads/extracted $ find . -type f -name cuda.h | sort
./12.0.1_525.85.12/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.1.1_530.30.02/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.2.2_535.104.05/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.3.2_545.23.08/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.4.1_550.54.15/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.5.1_555.42.06/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.6.3_560.35.05/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.8.1_570.124.06/cuda_cudart/targets/x86_64-linux/include/cuda.h
./12.9.1_575.57.08/cuda_cudart/targets/x86_64-linux/include/cuda.h
./13.0.2_580.95.05/cuda_cudart/targets/x86_64-linux/include/cuda.h
./13.1.1_590.48.01/cuda_cudart/targets/x86_64-linux/include/cuda.h

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @henrylhtsang 👋 Thanks for reaching out to us!

Apology for the confusion, but we currently cannot accept external contribution to cuda.bindings unfortunately, due to license limitation as outlined here. Note that the files under cuda_bindings/ are subject to NV Software License instead of Apache 2.0. Also, the relevant code is auto-generated, so we'll need to fix in the (internal) codegen instead.

I am actually confused by this PR, see below. Could you please provide a reproducer for the segfault, and share with us details of your environments such as:

  • cuda.bindings version
  • CUDA driver version (attaching the output of nvidia-smi is fine)
  • CUDA Toolkit version (and how was it installed?)

Thanks!

Get memory address of class instance
"""
cdef cydriver.CUtensorMap_st _pvt_val
cdef void* _pvt_buf
Copy link
Member

@leofang leofang Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the confusion arises. In the way Cython compiles, this member maps to the actual CUtensorMap from cuda.h, which should carry the alignment information (since the headers are build-time dependencies for cuda.bindings and get included by the generated .cpp files). So it is unclear to me how this struct member, after Cython emits C++ code that transforms this cdef class to a C struct (+ free functions), could lose its alignment requirement.

Given what Ralf alluded to above (that the alignment changes between CUDA 12 (64B) and 13 (128B)), I suspect there is a nontrivial mix-n-match between 12/13 somewhere in your environment.

@henrylhtsang
Copy link
Author

Thanks @leofang,

I am on cuda bindings 12.9.5, CUDA driver 580.82.07, B200, cuda toolkit 12.8.

I got told it can also be due to how we build cuda bindings. Apparently we build statically instead of shared linking.

@henrylhtsang
Copy link
Author

trying to resolve through other means

@henrylhtsang henrylhtsang marked this pull request as draft February 28, 2026 01:39
@leofang
Copy link
Member

leofang commented Feb 28, 2026

Sounds good, thanks Henry! If you could share with us a reproducer & core dump, we could also try to investigate next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants