Performance Improvement Strategies for an FVM-Based Shallow Water Flow Model on 2D Structured Grids

IAHR Document Library

« Back to Library Homepage « Proceedings of the 39th IAHR World Congress (Granada, 2022)

Performance Improvement Strategies for an FVM-Based Shallow Water Flow Model on 2D Structured Grids

Download

Author(s): Lennart Steffen; Finn Amann; Reinhard Hinkelmann

Linked Author(s): Lennart Steffen, Reinhard Hinkelmann

Keywords: Performance improvement; Finite volume method; High performance computing; Shallow water equations

Abstract: Different avenues for improving the computational performance of explicit 2D FVM simulations on structured grids are explored, using a MUSCL solver for the depth-averaged shallow water equations (SWE), written in C++. The aim is to provide an overview of possible improvement strategies and their relative impact on computational performance, measured as speedups over a baseline execution time. The key factors under consideration are parallelisation (both MIMD and SIMD) and cache utilisation. For MIMD-type parallelisation, the solver uses OpenMP and MPI, while for SIMD-type parallelisation, the linear algebra library "eigen" is employed to facilitate vectorisation. When designing a solver, the point within the algorithm at which MIMD-type parallelisation is injected has to be decided early on, as a large portion of the code depends on it. Here, three variants are compared: Firstly (1), the time marching loop can be executed in serial, with the constituting steps being individually parallelised. Secondly (2), the time marching loop can be executed in parallel, with the constituting steps being executed as a whole for each cell individually. And lastly (3), the time marching loop can be executed in parallel, with the constituting steps being executed as a whole for blocks of cells. While variant 1 allows for the elimination of redundant flux computations, it leads to poor cache utilisation due to evictions between steps. Variant 2 leads to good cache utilisation, but each flux must be calculated twice. With flux computation taking up the largest time share of the overall computation, variant 2 is the slowest of the three. For variant 3, cache utilisation, number of redundant flux computations and load balancing are all dependent on the block size. With the right block size, this variant shows the best performance. However, it arguably leads to the most complex code. Cache utilisation for variant 1 is dependent on the order in which edges are traversed during the flux computation step. Furthermore, concurrent write accesses for storing the flux computation results can be all but eliminated, if the traversal pattern is specifically designed for it. Here, a block-wise traversal pattern was found to show the best performance and scalability, while the pattern designed to eliminate concurrent write accesses proved slightly slower, likely due to load balancing issues. However, in the best case for both variants 1 and 3, variant 3 still showed a relative speedup of about 1.3x. Vectorisation of cell-based steps is relatively straightforward and has been carried out for all three variants. Enabling vectorisation for the edge-based flux computation process is more challenging. Here, vectorisation has so far only been carried out for variant 3 and produced a significant additional speedup.

DOI: https://doi.org/10.3850/IAHR-39WC252171192022753

Year: 2022