Question
Does using NVSHMEMI_THREADGROUP_THREAD as the default scope cause excessive redundant work? Specifically, when nvshmem_fence() is called from a warp or block, all threads execute nvshmemi_ibgda_fence(), each seeing index_in_scope == 0 and scope_size == 1, and thus redundantly iterating over all DCIs and RC QPs to issue ibgda_quiet(qp) calls. Could this lead to significant performance overhead?
Question
Does using
NVSHMEMI_THREADGROUP_THREADas the default scope cause excessive redundant work? Specifically, whennvshmem_fence()is called from a warp or block, all threads executenvshmemi_ibgda_fence(), each seeingindex_in_scope == 0andscope_size == 1, and thus redundantly iterating over all DCIs and RC QPs to issueibgda_quiet(qp)calls. Could this lead to significant performance overhead?