Logo del repository
  1. Home
 
Opzioni

Performance Analysis and GPU Scalability of OGSTM-BFM

FEITOSA BENEVIDES, ANDRE'
2026-03-27
  • other

Abstract
OGSTM-BFM is a coupled physical-biogeochemical model developed at the National Institute of Oceanography and Applied Geophysics (OGS) [1, 2, 3, 4] and is used for climate-related studies. Recent work within ESiWACE3 (Centre of Excellence in Simulation of Weather and Climate in Europe) reported substantial performance gains after porting the model to GPU architectures [5]. This thesis presents a reproducibility study of selected GPU performance results of OGSTM-BFM on the Leonardo supercomputer, as well as further investigations that were not included in previous ESiWACE3 reports. A signi cant acceleration can be achieved when using GPUs, provided that appropriate MPI rank-to-GPU mappings and Multi-Process Service (MPS) con gurations are employed. Several con gurations were tested. When we consider the GPU version running on two nodes, with four NVIDIA A100 GPUs per node and four MPI ranks per GPU, compared to two dual-socket nodes with Intel Sapphire Rapids CPUs, this leads to a speedup of 1.64. Once MPS is enabled, performance increases dramatically (speedup of 5.72). Di erent mappings of the ranks to the GPUs were tested. It was found that round-robin mapping combined with 50% MPS (each rank limited to ∼ 50% GPU threads), further increases the speedup to 5.97. Lastly, after adding NUMA binding in the MPS launch path, we managed to achieve a speedup of 6.22. We note that a speedup of 7.41 was reported on the ESiWACE3 Technical Report [5]; however, in this study, we were not able to actually reach this result. In addition, this work evaluates an alternative implementation of the vertical dif fusion tridiagonal solver using NVIDIA's cuSPARSE batched routines. The original implementation solved a system with a tridiagonal matrix by a method that does not allow for full parallelism. The algorithm, however, is highly specialized to tridi agonal matrices. We developed benchmark tests both in isolation and integrated into OGSTM-BFM. The cuSPARSE-based variant in isolation is about 3.33 times slower than the specialized version. When integrated into OGSTM-BFM it leads to 8% slower runs than the baseline in the tested con guration. These results emphasize that increased parallelism alone does not guarantee im proved time-to-solution; they also show that launch-level tuning (MPS, mapping, NUMAplacement) is as important as kernel-level optimization. The thesis provides practical insights into GPU reproducibility, solver integration, and performance en gineering for large-scale scienti c applications.
Archivio
https://hdl.handle.net/20.500.11767/151812
https://ricerca.unityfvg.it/handle/20.500.11767/151812
Diritti
open access
license:non specificato
license uri:na
google-scholar
Get Involved!
  • Source Code
  • Documentation
  • Slack Channel
Make it your own

DSpace-CRIS can be extensively configured to meet your needs. Decide which information need to be collected and available with fine-grained security. Start updating the theme to match your nstitution's web identity.

Need professional help?

The original creators of DSpace-CRIS at 4Science can take your project to the next level, get in touch!

Realizzato con Software DSpace-CRIS - Estensione mantenuta e ottimizzata da 4Science

  • Impostazioni dei cookie
  • Informativa sulla privacy
  • Accordo con l'utente finale
  • Invia il tuo Feedback