[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch#1692
[RFC] Fix CUtensorMap segfault due to alignas(64) vs Python allocator mismatch#1692henrylhtsang wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
CUtensorMap_st in cuda.h is declared with alignas(64), but when embedded inline as _pvt_val in a Cython cdef class, the Python object allocator (PyObject_Malloc) only guarantees 8-16 byte alignment. The compiler sees alignas(64) and may generate aligned instructions (e.g. movaps) for zero-initializing the struct, causing SIGSEGV on the unaligned memory. Fix by heap-allocating the CUtensorMap buffer with posix_memalign(64) instead of embedding it inline in the Python object.
|
Updated. Please feel free to push back since I am not familiar with the stack. |
|
I'm not an expert here, too. Sharing the observations below, with summary:
Question: Do we need to use All released CTK 12 versions have this block in /**
* Size of tensor map descriptor
*/
#define CU_TENSOR_MAP_NUM_QWORDS 16
/**
* Tensor map descriptor. Requires compiler support for aligning to 64 bytes.
*/
typedef struct CUtensorMap_st {
#if __cplusplus >= 201103L
alignas(64)
#elif __STDC_VERSION__ >= 201112L
_Alignas(64)
#endif
cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
} CUtensorMap;All released CTK 13 versions have this block in /**
* Size of tensor map descriptor
*/
#define CU_TENSOR_MAP_NUM_QWORDS 16
/**
* Tensor map descriptor. Requires compiler support for aligning to 128 bytes.
*/
typedef struct CUtensorMap_st {
#if defined(__cplusplus) && (__cplusplus >= 201103L)
alignas(128)
#elif __STDC_VERSION__ >= 201112L
_Alignas(128)
#endif
cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS];
} CUtensorMap;For completeness, these are the files I looked at: |
leofang
left a comment
There was a problem hiding this comment.
Hi @henrylhtsang 👋 Thanks for reaching out to us!
Apology for the confusion, but we currently cannot accept external contribution to cuda.bindings unfortunately, due to license limitation as outlined here. Note that the files under cuda_bindings/ are subject to NV Software License instead of Apache 2.0. Also, the relevant code is auto-generated, so we'll need to fix in the (internal) codegen instead.
I am actually confused by this PR, see below. Could you please provide a reproducer for the segfault, and share with us details of your environments such as:
cuda.bindingsversion- CUDA driver version (attaching the output of
nvidia-smiis fine) - CUDA Toolkit version (and how was it installed?)
Thanks!
| Get memory address of class instance | ||
| """ | ||
| cdef cydriver.CUtensorMap_st _pvt_val | ||
| cdef void* _pvt_buf |
There was a problem hiding this comment.
This is where the confusion arises. In the way Cython compiles, this member maps to the actual CUtensorMap from cuda.h, which should carry the alignment information (since the headers are build-time dependencies for cuda.bindings and get included by the generated .cpp files). So it is unclear to me how this struct member, after Cython emits C++ code that transforms this cdef class to a C struct (+ free functions), could lose its alignment requirement.
Given what Ralf alluded to above (that the alignment changes between CUDA 12 (64B) and 13 (128B)), I suspect there is a nontrivial mix-n-match between 12/13 somewhere in your environment.
|
Thanks @leofang, I am on cuda bindings 12.9.5, CUDA driver 580.82.07, B200, cuda toolkit 12.8. I got told it can also be due to how we build cuda bindings. Apparently we build statically instead of shared linking. |
|
trying to resolve through other means |
|
Sounds good, thanks Henry! If you could share with us a reproducer & core dump, we could also try to investigate next week. |
Description
CUtensorMap_stincuda.husesalignas(64), but when embedded inline as_pvt_valin a Cythoncdef class, CPython's allocator (PyObject_Malloc) onlyguarantees 16-byte alignment. The compiler generates aligned instructions (e.g.
movaps) for the 64-byte-aligned member, which causes SIGSEGV on the unalignedPyObjectmemory.Fix: heap-allocate with
posix_memalign(64)instead of embedding inline.Note: We are not experts on cuda-python internals — this fix was developed with AI assistance. Happy to iterate on the approach.
Checklist