From CPU to GPU in 80 Days
|Goal:||Port substantial parts of Palabos to GPU|
|Dates:||5th of May 2021 - 23rd of July 2021|
|Kick-off event:||Palabos Online Seminar, 5th of May, 10 am CET|
|Community involvement:||Community members are encouraged to try out the code as it evolves, report issues, and suggest improvements. Specific suggested community tasks will be proposed at the beginning of each work package.|
The goal is to adapt substantial parts of the Palabos code base to run on GPU. In this context, GPU also means multi-GPU, because we keep the Palabos MPI framework which decomposes the overall work into MPI threads, but allow specific threads to be GPU accelerated. Both the user interface and the interface for the development of new models should remain close to the original Palabos framework where possible.
The STLBM project (https://gitlab.com/unigehpfs/stlbm) serves as a preliminary feasibility study for the present project. It indicates how state-of-the-art GPU performance is achieved within the framework of C++ Parallel Algorithms, which is part of the C++17 standard: no external libraries or code annotations are required. It also provides guidelines regarding the choice of data alignment in memory.
The implementation strategy explored in this project is summarized as follows:
- As a supplement to the existing MultiBlockLattice, we develop the AcceleratedLattice which has the same structure, but in which individual atomic BlockLattices can be offloaded to accelerators. MultiBlockLattices and AcceleratedLattices can co-exist in an application and data can be transferred from between them. To port for example an existing CPU application go GPU, the problem setup can be left unchanged with a MultiBlockLattice, which is converted to an AcceleratedLattice before starting the time iterations.
- The data structure is different in the AcceleratedLattice: The MultiBlockLattice uses an array-of-structure format (which is inefficient on GPU) and a variant of the Swap pattern (which is not thread safe, and therefore doesn't run out of the box on GPU). The AccelerateLattice uses a structure-of-array format with the thread-safe AA-pattern.
- The approach to collision modeling is very flexible in Palabos, thanks to generic Dynamics objects which can be chained (e.g. Collision Model -> Force Model -> Subgridscale Model). This is no longer possible to get an efficient GPU code, the desired combinations of collision odels need to be enumerated and rewritten as static functions. In other words, the code of the Dynamics classes cannot be reused. However, this code usually forwards most of its algorithmic work to generic templates which will be reused.
- Data Processors in Palabos allow implementation of non-local code portions. They make explicit assumptions on the data layout (array-of-structure) and can therefore no longer be used out of the box. However, the code of data processors usually forwards most of its algoritihmic work to generic templates which will be reused. Furthermore, data processors for problem setup and for data post-processing can be reused, because it is possible to switch back and forth between MultiBlockLattice and AcceleratedLattice.
- To write GPU code, we will try to use an approach based on C++ Parallel Algorithms, which may also allow to try other accelerators than GPUs. In particular, this may lead to efficient multi-core multi-threading on a single CPU node. We will however also explore the need to use explicit OpenACC statements, in particular to manage memory transfers.
The work is split into 5 work packages. Each WP will be announced on the forum, including indications for interested community members to participate in it. The results achieved at the end of each WP will again be posted on the forum.
Work package 1: Setup of test cases
|Start:||5th of May 2021|
|Community tasks:||Familiarize yourself with the basics of C++ Parallel Algorithms (see e.g. the STLBM project)|
We implement five test cases in the original Palabos, including a built-in performance measurement framework. By the end of the project, some or all of the cases should run on GPU. The cases are:
- Taylor-Green vortex [Uniform collision model, no boundary condition].
- Resolved flow in a porous media [Mesh-aligned inflow and outflow, bounce-back nodes].
- Multi-component flow segregation with pseudo-potential approach [Multi-phase coupling, no boundary condition].
- Flow around a sphere (no mesh refinement) [Off-lattice boundary condition around the obstacle, subgrid-scale model].
- Flow inside a tube (channel with circular cross-section) [Off-lattice boundary condition around the obstacle, subgrid-scale model].
Work package 2: AcceleratedLattice on CPU
|Start:||19th of May 2021|
Goal: implement the AcceleratedLattice, with structure-of-array and AA-pattern data layout. At the end of WP2, test cases 1, 2, and 3 should run on CPU with the AcceleratedLattice, in MPI mode and in hybrid MPI - multi-threading mode.
Work package 3: AcceleratedLattice on GPU
|Start:||2nd of June 2021|
Goal: run test cases 1, 2, and 3 on GPU. At this stage, we still work with uniform collision terms and without data processors on AcceleratedLattices (except for the multi-component coupling of test-case 3).
Work package 4: Framework for dynamics objects and data processors
|Start:||16th of June 2021|
Goal: develop a framework to port a reasonable number of collision models, including chained collision terms, to GPU. Same goal for data processors. Attempt to port test case 4 and 5 to GPU.
Work package 5: Improvement, acceleration
|Start:||30th of June 2021|
|End:||23rd of July 2021|
Extent the capabilities of the GPU framework, run benchmarks, propose performance improvements.