OGSTM-BFM is a coupled physical-biogeochemical model developed
at the National Institute of Oceanography and Applied Geophysics (OGS) [1, 2, 3, 4]
and is used for climate-related studies. Recent work within ESiWACE3 (Centre of
Excellence in Simulation of Weather and Climate in Europe) reported substantial
performance gains after porting the model to GPU architectures [5].
This thesis presents a reproducibility study of selected GPU performance results
of OGSTM-BFM on the Leonardo supercomputer, as well as further investigations
that were not included in previous ESiWACE3 reports. A signi cant acceleration
can be achieved when using GPUs, provided that appropriate MPI rank-to-GPU
mappings and Multi-Process Service (MPS) con gurations are employed. Several
con gurations were tested. When we consider the GPU version running on two
nodes, with four NVIDIA A100 GPUs per node and four MPI ranks per GPU,
compared to two dual-socket nodes with Intel Sapphire Rapids CPUs, this leads
to a speedup of 1.64. Once MPS is enabled, performance increases dramatically
(speedup of 5.72). Di erent mappings of the ranks to the GPUs were tested. It
was found that round-robin mapping combined with 50% MPS (each rank limited
to ∼ 50% GPU threads), further increases the speedup to 5.97. Lastly, after adding
NUMA binding in the MPS launch path, we managed to achieve a speedup of 6.22.
We note that a speedup of 7.41 was reported on the ESiWACE3 Technical Report
[5]; however, in this study, we were not able to actually reach this result.
In addition, this work evaluates an alternative implementation of the vertical dif
fusion tridiagonal solver using NVIDIA's cuSPARSE batched routines. The original
implementation solved a system with a tridiagonal matrix by a method that does
not allow for full parallelism. The algorithm, however, is highly specialized to tridi
agonal matrices. We developed benchmark tests both in isolation and integrated
into OGSTM-BFM. The cuSPARSE-based variant in isolation is about 3.33 times
slower than the specialized version. When integrated into OGSTM-BFM it leads to
8% slower runs than the baseline in the tested con guration.
These results emphasize that increased parallelism alone does not guarantee im
proved time-to-solution; they also show that launch-level tuning (MPS, mapping,
NUMAplacement) is as important as kernel-level optimization. The thesis provides
practical insights into GPU reproducibility, solver integration, and performance en
gineering for large-scale scienti c applications.