Community

From CPU to GPU in 80 Days

 
Goal: Port substantial parts of Palabos to GPU
Dates: 5th of May 2021 - 23rd of July 2021
Kick-off event: Palabos Online Seminar, 5th of May, 10 am CET
Repository: https://gitlab.com/unigehpfs/palabos
Detailed project results:  Readme file
Benchmarks: Online Spreadsheet
Status: Project complete

Project Results

As of today, the project "From CPU to GPU in 80 days" is complete. All three examples (Taylor-Green vortex, flow through porous media, mutli-component flow) can be executed on GPU and reach performance close to the performance reached in a simple demo code (STLBM)  using the formalism of C++ Parallel algorithms. The performance reached on a NVIDIA RTX 3090 GPU (a high-end gaming GPU) is summarized on the image below. 

graph2.png

The dashed line represents the performance achieved by the STLBM code for the Taylor-Green vortex, showing that the integration of the formalism of parallel algorithms into the Palabos platform worked really well.

The GPU code works also with Palabos' MPI formalism, which allows multi-GPU execution. Multi-GPU performance metrics have not yet been obtained, though, and will be provided later.

Overview

The goal is to adapt substantial parts of the Palabos code base to run on GPU. In this context, GPU also means multi-GPU, because we keep the Palabos MPI framework which decomposes the overall work into MPI threads, but allow specific threads to be GPU accelerated. Both the user interface and the interface for the development of new models should remain close to the original Palabos framework where possible.

The STLBM project (https://gitlab.com/unigehpfs/stlbm) serves as a preliminary feasibility study for the present project. It indicates how state-of-the-art GPU performance is achieved within the framework of C++ Parallel Algorithms, which is part of the C++17 standard: no external libraries or code annotations are required. It also provides guidelines regarding the choice of data alignment in memory.

The implementation strategy explored in this project is summarized as follows:

  • As a supplement to the existing MultiBlockLattice, we develop the AcceleratedLattice which has the same structure, but in which individual atomic BlockLattices can be offloaded to accelerators. MultiBlockLattices and AcceleratedLattices can co-exist in an application and data can be transferred from between them. To port for example an existing CPU application go GPU, the problem setup can be left unchanged with a MultiBlockLattice, which is converted to an AcceleratedLattice before starting the time iterations.
  • The data structure is different in the AcceleratedLattice: The MultiBlockLattice uses an array-of-structure format (which is inefficient on GPU) and a variant of the Swap pattern (which is not thread safe, and therefore doesn't run out of the box on GPU). The AccelerateLattice uses a structure-of-array format with the thread-safe AA-pattern.
  • The approach to collision modeling is very flexible in Palabos, thanks to generic Dynamics objects which can be chained (e.g. Collision Model -> Force Model -> Subgridscale Model). This is no longer possible to get an efficient GPU code, the desired combinations of collision odels need to be enumerated and rewritten as static functions. In other words, the code of the Dynamics classes cannot be reused. However, this code usually forwards most of its algorithmic work to generic templates which will be reused.
  • Data Processors in Palabos allow implementation of non-local code portions. They make explicit assumptions on the data layout (array-of-structure) and can therefore no longer be used out of the box. However, the code of data processors usually forwards most of its algoritihmic work to generic templates which will be reused. Furthermore, data processors for problem setup and for data post-processing can be reused, because it is possible to switch back and forth between MultiBlockLattice and AcceleratedLattice.
  • To write GPU code, we will try to use an approach based on C++ Parallel Algorithms, which may also allow to try other accelerators than GPUs. In particular, this may lead to efficient multi-core multi-threading on a single CPU node. We will however also explore the need to use explicit OpenACC statements, in particular to manage memory transfers.

The work is split into 5 work packages. Each WP will be announced on the forum, including indications for interested community members to participate in it. The results achieved at the end of each WP will again be posted on the forum.

Work package 1: Setup of test cases

Start: 5th of May 2021
Community tasks: Familiarize yourself with the basics of C++ Parallel Algorithms (see e.g. the STLBM project)


Goal:

We implement five test cases in the original Palabos, including a built-in performance measurement framework. By the end of the project, some or all of the cases should run on GPU.

Results:

Taylor-Green vortex Code  
Resolved flow in a porous media Code  
Multi-component flow: Rayleigh-Taylor instability. Code  

 

Work package 2: AcceleratedLattice on CPU

Start: 19th of May 2021
Community tasks:
  • Run the benchmark cases on your own hardware, and explore the efficiency (with MPI, on one ore multiple nodes). If you let us know (email, forum), we can add your measurement to the performance spreadsheet.
  • Continue learning about the basics of C++ Parallel Algorithms (see e.g. the STLBM project). This topic will also be presented at the online AMS Seminar series on Thursday May 20.


Goal:

We implement the AcceleratedLattice, with structure-of-array and AA-pattern data layout. At the end of WP2, test cases 1, 2, and 3 should run on CPU with the AcceleratedLattice, in MPI mode and in hybrid MPI - multi-threading mode.

Results:

The AcceleratedLattice is fully implemented and tested on CPU. For the first time in the history of Palabos, it is possible to run hybrid MPI / multi-threaded simulations. Multithreading is implemented both in terms of OpenMP and parallel algorithms. The AcceleratedLattice is implemented in terms of a double-population approach, instead of the originally planned AA-pattern, which turned out to be too technical for the strict time frame of the project.

See the detailed results of the work package to learn how to use the AcceleratedLattice.

Work package 3: AcceleratedLattice on GPU

Start: 2nd of June 2021
Community tasks: TBA


Goal:

Run test cases 1, 2, and 3 on GPU. At this stage, we still work with uniform collision terms and without data processors on AcceleratedLattices (except for the multi-component coupling of test-case 3).

Results:

All three test cases run on GPU with good performance. See the README file for details.

Work package 4: Framework for dynamics objects and data processors

Start: 16th of June 2021
Community tasks: TBA


Goal:

Develop a framework to port a reasonable number of collision models, including chained collision terms, to GPU. Same goal for data processors. Attempt to port test case 4 and 5 to GPU.

Results:

A framework was implemented which implements the array of dynamics objects into a tag matrix, allowing cell-based custom collision models without function calls through function pointers (which are problematic on GPU). Furthermore, a technique was implemented to provide custom additional data to every cell, and to reduce the size of the produced CUDA kernel (see the README file for details).

Work package 5: Improvement, acceleration

Start: 30th of June 2021
End: 23rd of July 2021
Community tasks: TBA

 

Goal:
Extent the capabilities of the GPU framework, run benchmarks, propose performance improvements.

Results:

Single-GPU performance has been carefully optimized and is now as good as it gets (see graph at the beginning of this page). Multi-GPU capabilities of the code are available, but thorough performance measurements are postponed and will not be included in this project.