Skip to content

Implement OpenMP offload for do group update#1782

Draft
edoyango wants to merge 23 commits into
NOAA-GFDL:mainfrom
edoyango:ompoffload
Draft

Implement OpenMP offload for do group update#1782
edoyango wants to merge 23 commits into
NOAA-GFDL:mainfrom
edoyango:ompoffload

Conversation

@edoyango

@edoyango edoyango commented Oct 10, 2025

Copy link
Copy Markdown

Description
This adds the necessary code to make mpp_do_group_update work with arrays that are managed by NVIDIA's OpenMP offload runtime. This attempts to be minimally disruptive in that non-nvidia compilers will see the same behaviour as previously by adding macros around the relevant openmp directives.

Fixes #1771

How Has This Been Tested?
The OpenMP offload capability is currently tested on the "double gyre" case in MOM6-examples using the nvfortran compiler and a cuda-aware openmpi. We have some notes on how to run the gpu-enabled MOM6, but is outdated.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • New check tests, if applicable, are included
  • make distcheck passes

do i = is, ie
pos = pos + 1
field(i,j,k) = buffer(pos)
idx = pos + (k-1)*nj*ni + (j-js)*ni + (i-is) + 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these two implementations equivalent? Is new idx = old pos always?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they're equivalent. For any iteration, idx = pos + (k-1)*nj*ni + (j-js)*ni + (i-is) + 1 produces the same value that pos would have had at that point. The formula accounts for all the iterations that would have occurred in the nested loops up to that (i,j,k) position.

The reason for the change is that each nested iteration is now independent and can be performed in parallel.

@marshallward

marshallward commented Jan 29, 2026

Copy link
Copy Markdown
Member

I've finally managed to put together a timing test for MOM6, and the timings for domain decomposition are promising. There is strong scaling for the larger domains as we saturate the GPU.

gpu_mpi_comm

Scaling is shown for up to 8 GPUs over 4 nodes on Ursa, so this method should work on our NOAA resources.

Ursa does not have Nvlink, so its comm time shouldn't be taken as reflective of production runs. But I was surprised that it was not too large, even after going through the PCIe and Infiniband layers.

I will look into including mpp tests which exercise this new code.

@edoyango

Copy link
Copy Markdown
Author

In 0cc2a77, compiler macros are replaced with the runtime OpenMP if to control whether CPU parallelism or GPU offload is used at runtime. However, removing the macros cause a significant slowdown in the packing and unpacking on GPU - even if the CPU parallelism is supposed to be disabled by the if clause.

For example, below halo communication times doing a self-communication (re-entrant domain) in MOM6 barotropic, where packing and unpacking are on GPU:

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
(Ocean BT pre-step halo updates)       768    109.800896    109.800896    109.800896      0.000000  0.370    41     0     0
(Ocean BT stepping halo updates)       960     78.534592     78.534592     78.534592      0.000000  0.265    41     0     0

And with the CPU OpenMP directives not compiled via macros:

                                      hits          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
(Ocean BT pre-step halo updates)       768      0.431521      0.431521      0.431521      0.000000  0.009    41     0     0
(Ocean BT stepping halo updates)       960      0.211485      0.211485      0.211485      0.000000  0.004    41     0     0

Hence, 0cc2a77 will be reverted.

@bensonr bensonr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some of the target omp statements, the if (use_device_ptr) is prior to the target data mapping and other times after. Does it matter where in the !$omp specification the if-test occurs?

The preference for !$omp is default(none) with full specification of shared and private.

Comment on lines +21 to +25
#ifndef __NVCOMPILER_OPENMP_GPU
!$OMP parallel do default(none) shared(npack,group,ptr,nvector,ksize,buffer_start_pos) &
!$OMP private(buffer_pos,pos,m,is,ie,js,je,rotation, &
!$OMP ptr_field, ptr_fieldx, ptr_fieldy,n,k)
!$OMP ptr_field, ptr_fieldx, ptr_fieldy,n,k,ni,nj,idx)
#endif

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where __NVCOMPILER_OPENMP_GPU is defined, the compiled library may offload the packing to the device (assuming use_device_ptr). But if use_device_ptr is false, then we get zero parallelism. It would be good to have both options available at the same time for maximum flexibility.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan is to rename this to something that is set at compile-time (i.e. cmake, autoconf).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marshallward - Not sure I understand your comment and I'm sure mine wasn't entirely clear. I think we need both a CPU and GPU version of routines like this to allow offload and traditional execution within the same programmatic unit. It isn't entirely clear we get this with such fine-grained GPU offloading and a mix of macro if-tests.

@marshallward marshallward Mar 20, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the brief reply above. As you say, the current form would only allow either CPU parallelism or GPU parallelism. I don't think we imagined using both at this stage of development, and assumed one would make this decision at compile-time.

Replacing __NVCOMPILER_OPENMP_GPU with a generic macro would decouple that decision from the choice of compiler, but I agree it doesn't permit the use of both in the same model.

We can create separate versions of group_update_[un]pack.inc rather than rely on preprocessing, if you are OK with the code duplication.

There are also bigger picture ideas here that could consolidate the two versions (e.g. team-parallelism of groups with thread-parallelism of buffer transfer), but AFAIK there are technical issues preventing this, such as the presence of Cray pointers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marshallward - Not sure I understand your comment and I'm sure mine wasn't entirely clear. I think we need both a CPU and GPU version of routines like this to allow offload and traditional execution within the same programmatic unit. It isn't entirely clear we get this with such fine-grained GPU offloading and a mix of macro if-tests.

Is there a preference for how this is implemented? would something like below work? To avoid code duplication

! below include contains openmp offloaded directives but not cpu openmp directives
#include <group_update_pack.inc>
! undefine macro to enable compilation of cpu openmp parallel loop
#ifdef GPU_MACRO
#undef GPU_MACRO
if (.not.omp_offload) then
! contains cpu openmp directives without openmp offload directives
#include <group_update_pack.inc>
endif
#define GPU_MACRO
#endif

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoyango - macro-driven, templated Fortran is preferable to maintaining different, yet identical versions that differ by compile-time comment lines only. Please implement this - if you haven't already.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @bensonr I've attempted to implement the suggested macro solution, but nvfortran fails to run.

the fail happens when there's an openmp offload loop and a openmp cpu loop - either with default(none) - referencing the same cray pointer in the same procedure. For the latest compilers, whichever loop has default(none) either hangs or segfaults at runtime.

the workaround is to use default(shared) in both directives, or to put them into isolated subroutines.

reproducer:
test_cray_ptr_omp.txt
compile with nvfortran -mp=gpu and run with ./a.out 0/1/2/3. 0, 1 cases are gpu and cpu (respectively) with default(none), and 2, 3 are with default(shared).

Could a concession be made to use default(shared)? otherwise we can't make progress until cray pointers are removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoyango - Thanks for explaining the issue with default(none) and why you'd need default(shared) for now. It would be helpful to leave the default(none) logic in place along with extra comments for the people in the process of removing cray pointers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bensonr I've:

  • added required shared/private to the openmp offloaded loops
  • swapped from default(none) to default(shared) in the packing/unpacking loops
  • added comments to the top of group_update_un/pack.inc files explaining the restrictions imposed by nvfortran with cray pointers
  • through macro manipulation, ensured openmp cpu parallelism is activated when ompoffload==.false.

I hope that's sufficient.

@marshallward

marshallward commented Mar 20, 2026

Copy link
Copy Markdown
Member

For some of the target omp statements, the if (use_device_ptr) is prior to the target data mapping and other times after. Does it matter where in the !$omp specification the if-test occurs?

AFAIK clause order is not important. I tend to prefer ending with if (similar to the one-line Python construct) but we can switch to your preference.

(Also note that some of the if (use_device_ptr) statements are actual Fortran if-blocks, rather than OMP directive clauses.)

The preference for !$omp is default(none) with full specification of shared and private.

👍

end subroutine MPP_SEND_

subroutine MPP_RECV_SCALAR_( get_data, from_pe, glen, block, tag, request )
subroutine MPP_RECV_SCALAR_( get_data, from_pe, glen, block, tag, request, omp_offload )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoyango - I see you only added the offload option for the scalar version of the send/recv pairs. Is there a reason this is not being done for all of the send/recv routines?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to keep the scope of this PR to only cover what's required to get do_group_update working and do_group_update only uses MPP_SEND/RECV_SCALAR_. I can add the omp_offload flag to the other implementations, but it would be untested.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I'll discuss with the rest of the team how we want to handle the other routines.

Now that you've mentioned testing, will you be adding or adapting any of the existing unit tests to add GPU-specific tests?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|> Now that you've mentioned testing, will you be adding or adapting any of the existing unit tests to add GPU-specific tests?

Sorry for the late reply! I just pushed some tests to this branch. It co-opts the existing mpp domain tests by adding a test_group_offload switch (thanks @rem1776 for mocking one together). The test passes for both nvhpc 26.3. gcc passes as well, although it doesn't actually offload to GPU.

Note that the domain tests include non-blocking tests, but the openmp offload version of those haven't been implemented, so those tests just run on CPU.

@rem1776

rem1776 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed.

@edoyango

Copy link
Copy Markdown
Author

@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed.

Hi @rem1776 i've tried rebasing on top of main locally, but nvfortran still can't build main because of the use of select rank in mpp/include/mpp_pack.fh

@rem1776

rem1776 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@edoyango Could you update this with main when you get a chance? Looks like its showing some conflicts that we should probably clear up before we get this reviewed.

Hi @rem1776 i've tried rebasing on top of main locally, but nvfortran still can't build main because of the use of select rank in mpp/include/mpp_pack.fh

Sorry about that! Thought the fix (#1873) was merged but looks like its still waiting for a review. It should get merged sometime this week, I'll let you know.

rem1776 and others added 20 commits June 26, 2026 16:41
* add multi gpu support

* address review comments, add helpful comment for the acc/mp runbtime call
To enable this,  had to be removed -
otherwise segfaults happen on the GPU.
This reverts commit 0cc2a77.
Having both the CPU and GPU OpenMP directives compiled caused
a significant slowdown in GPU packing/unpacking performance -
even if parallelism is controlled using OpenMP "if" clause.
Some very minor changes to the OpenMP target MPI PR:

* use_device_ptr -> use_device_addr

  This appears to be the updated form (or at least nvfortran says it is)

* Whitespace added to `!$ use omp_lib`

  Does not seem crucial but from our previous discussion it appears more
  correct.

* Removal of some trailing whitespace.
This patch refactors several lines to keep within the 121-character line
length limit prescribed by the FMS style guidelines.
The no-comm (no MPI) interface has been updated to support the new
omp_offload argument.
This ensures that (un)packing steps in do_group_update is performed
with openmp cpu parallelism if ompoffload=.false.. Previously it
would only do serial. This is implemented by undefining the GPU
macro (currently __NVCOMPILER_OPENMP_GPU) and re-including the
(un)packing files.

To make this work, the default(shared) was used in all the relevant
OpenMP directives. If default(none) is used, the loops would hang
or segfault.
edoyango added 2 commits June 26, 2026 16:41
Did't use the if(use_device_ptr) omp clause because
the tests would hang. Instead made explicit branches.

Left out mappings because gcc didn't like cray pointers
in maps
@edoyango

edoyango commented Jun 26, 2026

Copy link
Copy Markdown
Author

@rem1776 ok i've rebased on top of main as well as your select rank workaround and kept my tests.

I did notice that main + your workaround produces garbled terminal output with nvhpc though...

e.g. this is MOM6 standalone output:

P-�p@Xpr��������-�������@����+��:��;�pkhprL|�'�mgpr�:gp�gprP�\�Ɉhpr�gprpkhpr�\��\�gpr��\�Ɉhpr�
%���\���\��\�
             �ݞ@�gprL|�'�gpr(�\T�\�ohpr�gprL|�'p�\�P�\���hprpkhpr��\���hprohpr��\�pr
pkhpr�\��\�gpr��\�Ɉhpr�
%���\���\��\�
             �ݞ@�gprL|�'�gpr(�\T�\�ohpr�gprL|�'p�\�P�\���hprpkhpr��\���hprohpr��\�prNOTE: MPP_DOMAINS_SET_STACK_SIZE: stack size set to   955296.
S_SET_STACK_SIZE: stack size set to   955296.

Update test_mpp_domains to pass the OpenMP offload flag through the group
update path and stage test data on the device for offloaded runs.

Add explicit OpenMP maps and shared clauses for the offloaded setup and
reference-copy loops, including shifted CGRID and BGRID extents.

Unskip the group update OpenMP offload test in test_mpp_domains2.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable OpenMP GPU-to-GPU MPI blocking transfers

6 participants