Skip to content

[Issue]: Deadlock/crash due to UAF in inspector plugin #2000

@RicoloveFeng

Description

@RicoloveFeng

How is this issue impacting you?

Application crash

Share Your Debug Logs

Recently, we encountered serious problems while trying to enhance the inspector plugin and integrate it into training.

Issue 1: Incorrect rwlock usage

At first, a GPU process hung with utilization dropped to 0, and then a watchdog error was triggered. After some analysis with GDB, we found that the main thread got stuck when attempting to send a proxy operation to the progress thread.

Meanwhile, proxy progress thread was deadlocked inside a call where the inspector records kernel channel status. (Line number may not correspond to original code.)

Image Image

Through logging, we discovered that the stuck proxy progress thread was processing a collInfo belonging to another communication group, which caused a deadlock. Upon reviewing the locking/unlocking mechanism, we identified the following issues:

  1. The inspectorPluginCollInfoDeRefSafe function is never called — it is essentially a dead function.
inspectorResult_t inspectorPluginCollInfoDeRefSafe(struct inspectorCollInfo *collInfo) {
  inspectorLockWr(&collInfo->guard);
  inspectorResult_t res = inspectorPluginCollInfoDeRef(collInfo);
  inspectorUnlockRWLock(&collInfo->guard);
  return res;
}
  1. The inspector seems to assume that a lock can be directly destroyed while still held. In our enviorment, by printing the error code before returning inspectorLockError, we observed that EDEADLK (errno 35) appeared multiple times before the deadlock. Our platform may have allowed execution to continue for this errno, so the problem did not surface immediately.
// line 385, inspectorPluginStopEvent
if (collInfo && collInfo->type == ncclProfileColl) {
      inspectorLockWr(&collInfo->guard); // ---> locked here
      // ...
      if ((collInfo->nKernelChCompleted == collInfo->nKernelChStarted)
          && (collInfo->nKernelChCompleted == collInfo->nChannels)) {
        // kernel completed...
        res = inspectorPluginCollInfoDeRef(collInfo);  // ---> trying destroy and free space
        if (res != inspectorReturn) {
          inspectorUnlockRWLock(&collInfo->guard);
        }

// line 163
inspectorResult_t inspectorPluginCollInfoDeRef(struct inspectorCollInfo *collInfo) {
  collInfo->refCount -= 1;
  if (collInfo->refCount == 0) {
    inspectorLockDestroy(&collInfo->guard); // ---> undefined behavior: destroy before unlock 
    memset(collInfo, 0, sizeof(struct inspectorCollInfo));
    free(collInfo);
    return inspectorReturn;
  }
  return inspectorSuccess;
}

According to standard documentation, a lock must be unlocked before being destroyed; otherwise, undefined behavior occurs.

The pthread_rwlock_destroy() function shall destroy the read-write lock object referenced by rwlock and release any resources used by the lock. The effect of subsequent use of the lock is undefined until the lock is reinitialized by another call to pthread_rwlock_init(). An implementation may cause pthread_rwlock_destroy() to set the object referenced by rwlock to an invalid value. Results are undefined if pthread_rwlock_destroy() is called when any thread holds rwlock. Attempting to destroy an uninitialized read-write lock results in undefined behavior.

We believe this is the root cause of the deadlock. By adding an unlock before destruction, we alleviated the deadlock issue, but we still encountered program crashes due to malloc data corruption.

Besides, it was still confusing since our logging showed that proxy progress threads were deadling with a proxy op that belongs to another communicator.

Issue 2: Reusing already freed collInfo

Although our earlier fix resolved the locking problem, after some time of training we still encountered heap corruption errors (double free or corruption (!prev) and malloc(): unsorted double linked list corrupted). We then enabled ASAN, which quickly reported use-after-free errors.

2026-01-23 17:18:26	==158983==ERROR: AddressSanitizer: heap-use-after-free on address 0x7d5e5f378588 at pc 0x7b1d954bcc1f bp 0x7b19bde1ccc0 sp 0x7b19bde1ccb0\n
2026-01-23 17:18:26	READ of size 8 at 0x7d5e5f378588 thread T80\n
2026-01-23 17:18:27	    #0 0x7b1d954bcc1e in inspectorPluginRecordEventState(void*, ncclProfilerEventState_t, ncclProfilerEventStateArgs_v5_t*) /root/nccl-2.28.9-1/ext-profiler/inspector/inspector_plugin.cc:459\n
2026-01-23 17:18:27	    #1 0x7b1dde2de15d in ncclProfilerStopKernelChEvent(ncclProxyArgs*, int, unsigned long) plugin/profiler.cc:611\n
2026-01-23 17:18:27	    #2 0x7b1dde295697 in profilerProxyProgress transport/profiler.cc:41\n
2026-01-23 17:18:27	    #3 0x7b1dde295697 in profilerProxyProgress transport/profiler.cc:21\n
2026-01-23 17:18:27	    #4 0x7b1dde2373d6 in progressOps /tmp/nccl/src/proxy.cc:761\n
2026-01-23 17:18:27	    #5 0x7b1dde2373d6 in ncclProxyProgress(void*) /tmp/nccl/src/proxy.cc:962\n
2026-01-23 17:18:27	    #6 0x7f1e5c957b69 in asan_thread_start(void*) (/usr/lib64/libasan.so.8+0x5eb69) (BuildId: 891749577baeaf8d94ea365be2f5f0a525944b41)\n
2026-01-23 17:18:27	    #7 0x7f1e5c6e11c9 in start_thread (/lib64/libpthread.so.0+0x81c9) (BuildId: bb27eb9d041cb2218a40476ce63385d9b6ccd567)\n
2026-01-23 17:18:27	    #8 0x7f1e5bbd08d2 in clone (/lib64/libc.so.6+0x398d2) (BuildId: 5d21e08f89469e4d1a4120e80635235a01ef86e6)\n
2026-01-23 17:18:27	\n
2026-01-23 17:18:27	0x7d5e5f378588 is located 1160 bytes inside of 7344-byte region [0x7d5e5f378100,0x7d5e5f379db0)\n
2026-01-23 17:18:27	freed by thread T80 here:\n
2026-01-23 17:18:27	    #0 0x7f1e5ca17e1b in free.part.0 (/usr/lib64/libasan.so.8+0x11ee1b) (BuildId: 891749577baeaf8d94ea365be2f5f0a525944b41)\n
2026-01-23 17:18:27	    #1 0x7b1d954bc3f0 in inspectorPluginCollInfoDeRef(inspectorCollInfo*) /root/nccl-2.28.9-1/ext-profiler/inspector/inspector_plugin.cc:168\n
2026-01-23 17:18:27	    #2 0x7b1d954bc3f0 in inspectorPluginCollInfoDeRef(inspectorCollInfo*) /root/nccl-2.28.9-1/ext-profiler/inspector/inspector_plugin.cc:163\n
2026-01-23 17:18:27	\n
2026-01-23 17:18:27	previously allocated by thread T69 (pt_autograd_6) here:\n
2026-01-23 17:18:27	    #0 0x7f1e5ca18c23 in calloc (/usr/lib64/libasan.so.8+0x11fc23) (BuildId: 891749577baeaf8d94ea365be2f5f0a525944b41)\n
2026-01-23 17:18:27	    #1 0x7b1d954bbca1 in inspectorPluginCollInfoInit /root/nccl-2.28.9-1/ext-profiler/inspector/inspector_plugin.cc:206\n
2026-01-23 17:18:27	    #2 0x7b1d954bbca1 in inspectorPluginStartEvent(void*, void**, ncclProfilerEventDescr_v5_t*) /root/nccl-2.28.9-1/ext-profiler/inspector/inspector_plugin.cc:324\n
2026-01-23 17:18:27	\n

Based on ASAN’s report, UAF happened at inspectorPluginRecordEventState events, while the upper collInfo that saved kernelCh array had already been freed.

// src/transport/profiler.cc
      // ...
      if (sub->transmitted < sub->nsteps && sub->base <= workCompleted[sub->channelId].data[sub->base%MAX_PROFILER_EVENTS_PER_CHANNEL].counter) {
        ncclProfilerStopKernelChEvent(args, s, workCompleted[sub->channelId].data[sub->base%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp); // ---> try to stop a kernel channel event, but this channel has been processed
        sub->transmitted = sub->nsteps;
        args->done++;
      }
      //...


// src/plugin/profiler.cc
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t stop) {
  if (__builtin_expect(ncclProfiler != NULL, 0)) {
    struct ncclProxySubArgs* sub = &args->subs[s];
    if (sub->kernelEventHandle) {
      ncclProfilerEventStateArgs_t a = { };
      a.kernelCh.pTimer = stop;
      ncclProfiler->recordEventState(sub->kernelEventHandle, ncclProfilerKernelChStop, &a); // ---> enter
      ncclProfiler->stopEvent(sub->kernelEventHandle);
    }
  }
  return ncclSuccess;
}

// inspector_plugin.cc
__hidden ncclResult_t inspectorPluginRecordEventState(void* eHandle,
                                                      ncclProfilerEventState_t eState,
                                                      ncclProfilerEventStateArgs_t* eStateArgs) {
  if (eHandle == nullptr || eStateArgs == nullptr)
    return ncclSuccess;

  uint64_t type = *(uint64_t *)eHandle; // <--- UAF here

However, debugging result was confusing: the inspector determines that collective communication is complete when the started and completed channels are equal. The freed collInfo already satisfied this condition, yet a certain kernel channel event still entered the record-and-stop processing flow. Since collInfo was already freed, this leads a serious UAF problem. Issue #1992 may also a victim of this UAF.

This also explains the question that why we see deadlock and progress threads intersection: the whole memory was dirty when we trying to retrieve locker thread.

We worked around this by introducing an object pool, but we still cannot understand why the proxy progress thread tries to process an extra kernel channel event. We hope the dev team can investigate and fix this issue!

Steps to Reproduce the Issue

Deploy ASAN and run a training task will possibly reproduce this issue.

By our experiment, a long seq-length model will more likely reproduce ASAN UAF report. We were runing with 32 ranks.

One factor that might affect reproducing is that we implemented P2P tracking. This may increase memory alloc/free frequency.

NCCL Version

2.28.9

Your platform details

ldd (GNU libc) 2.28

Error Message & Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions