From CPU to GPU in 80 Days
As of today, the project "From CPU to GPU in 80 days" is complete. All three examples (Taylor-Green vortex, flow through porous media, mutli-component flow) can be executed on GPU and reach performance close to the performance reached in a simple demo code (STLBM) using the formalism of C++ Parallel algorithms. The performance reached on a NVIDIA RTX 3090 GPU (a high-end gaming GPU) is summarized on the image below.
The dashed line represents the performance achieved by the STLBM code for the Taylor-Green vortex, showing that the integration of the formalism of parallel algorithms into the Palabos platform worked really well.
The GPU code works also with Palabos' MPI formalism, which allows multi-GPU execution. Multi-GPU performance metrics have not yet been obtained, though, and will be provided later.
The goal is to adapt substantial parts of the Palabos code base to run on GPU. In this context, GPU also means multi-GPU, because we keep the Palabos MPI framework which decomposes the overall work into MPI threads, but allow specific threads to be GPU accelerated. Both the user interface and the interface for the development of new models should remain close to the original Palabos framework where possible.
The STLBM project (https://gitlab.com/unigehpfs/stlbm) serves as a preliminary feasibility study for the present project. It indicates how state-of-the-art GPU performance is achieved within the framework of C++ Parallel Algorithms, which is part of the C++17 standard: no external libraries or code annotations are required. It also provides guidelines regarding the choice of data alignment in memory.
The implementation strategy explored in this project is summarized as follows:
- As a supplement to the existing MultiBlockLattice, we develop the AcceleratedLattice which has the same structure, but in which individual atomic BlockLattices can be offloaded to accelerators. MultiBlockLattices and AcceleratedLattices can co-exist in an application and data can be transferred from between them. To port for example an existing CPU application go GPU, the problem setup can be left unchanged with a MultiBlockLattice, which is converted to an AcceleratedLattice before starting the time iterations.
- The data structure is different in the AcceleratedLattice: The MultiBlockLattice uses an array-of-structure format (which is inefficient on GPU) and a variant of the Swap pattern (which is not thread safe, and therefore doesn't run out of the box on GPU). The AccelerateLattice uses a structure-of-array format with the thread-safe AA-pattern.
- The approach to collision modeling is very flexible in Palabos, thanks to generic Dynamics objects which can be chained (e.g. Collision Model -> Force Model -> Subgridscale Model). This is no longer possible to get an efficient GPU code, the desired combinations of collision odels need to be enumerated and rewritten as static functions. In other words, the code of the Dynamics classes cannot be reused. However, this code usually forwards most of its algorithmic work to generic templates which will be reused.
- Data Processors in Palabos allow implementation of non-local code portions. They make explicit assumptions on the data layout (array-of-structure) and can therefore no longer be used out of the box. However, the code of data processors usually forwards most of its algoritihmic work to generic templates which will be reused. Furthermore, data processors for problem setup and for data post-processing can be reused, because it is possible to switch back and forth between MultiBlockLattice and AcceleratedLattice.
- To write GPU code, we will try to use an approach based on C++ Parallel Algorithms, which may also allow to try other accelerators than GPUs. In particular, this may lead to efficient multi-core multi-threading on a single CPU node. We will however also explore the need to use explicit OpenACC statements, in particular to manage memory transfers.
The work is split into 5 work packages. Each WP will be announced on the forum, including indications for interested community members to participate in it. The results achieved at the end of each WP will again be posted on the forum.
Work package 1: Setup of test cases
|Start:||5th of May 2021|
|Community tasks:||Familiarize yourself with the basics of C++ Parallel Algorithms (see e.g. the STLBM project)|
We implement five test cases in the original Palabos, including a built-in performance measurement framework. By the end of the project, some or all of the cases should run on GPU.
|Resolved flow in a porous media||Code|
|Multi-component flow: Rayleigh-Taylor instability.||Code|
Work package 2: AcceleratedLattice on CPU
|Start:||19th of May 2021|
We implement the AcceleratedLattice, with structure-of-array and AA-pattern data layout. At the end of WP2, test cases 1, 2, and 3 should run on CPU with the AcceleratedLattice, in MPI mode and in hybrid MPI - multi-threading mode.
The AcceleratedLattice is fully implemented and tested on CPU. For the first time in the history of Palabos, it is possible to run hybrid MPI / multi-threaded simulations. Multithreading is implemented both in terms of OpenMP and parallel algorithms. The AcceleratedLattice is implemented in terms of a double-population approach, instead of the originally planned AA-pattern, which turned out to be too technical for the strict time frame of the project.
See the detailed results of the work package to learn how to use the AcceleratedLattice.
Work package 3: AcceleratedLattice on GPU
|Start:||2nd of June 2021|
Run test cases 1, 2, and 3 on GPU. At this stage, we still work with uniform collision terms and without data processors on AcceleratedLattices (except for the multi-component coupling of test-case 3).
All three test cases run on GPU with good performance. See the README file for details.
Work package 4: Framework for dynamics objects and data processors
|Start:||16th of June 2021|
Develop a framework to port a reasonable number of collision models, including chained collision terms, to GPU. Same goal for data processors. Attempt to port test case 4 and 5 to GPU.
A framework was implemented which implements the array of dynamics objects into a tag matrix, allowing cell-based custom collision models without function calls through function pointers (which are problematic on GPU). Furthermore, a technique was implemented to provide custom additional data to every cell, and to reduce the size of the produced CUDA kernel (see the README file for details).
Work package 5: Improvement, acceleration
|Start:||30th of June 2021|
|End:||23rd of July 2021|
Extent the capabilities of the GPU framework, run benchmarks, propose performance improvements.
Single-GPU performance has been carefully optimized and is now as good as it gets (see graph at the beginning of this page). Multi-GPU capabilities of the code are available, but thorough performance measurements are postponed and will not be included in this project.