

#### FERMILAB-SLIDES-23-038-CMS-CSAID

This manuscript has been authored by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics.

> **Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code**

N. Andriotis<sup>1</sup>, A. Bocci<sup>2</sup>, E. Cano<sup>2</sup>, L. Cappelli<sup>3</sup>, M. Dewing<sup>4</sup>, T. Di Pilato<sup>5,6</sup>, J. Esseiva<sup>7</sup>, L. Ferragina<sup>8</sup>, G. Hugo<sup>2</sup>, **M. Kortelainen**<sup>9</sup>, M. Kwok<sup>9</sup>, J. J. Olivera Loyola<sup>10</sup>, F. Pantaleo<sup>2</sup>, A. Perego<sup>11</sup>, W. Redjeb<sup>2,12</sup> <sup>1</sup>BSC <sup>2</sup>CERN <sup>3</sup>INFN Bologna <sup>4</sup>ANL <sup>5</sup>CASUS <sup>6</sup>University of Geneva <sup>7</sup>LBNL <sup>8</sup>University of Bologna <sup>9</sup>FNAL <sup>10</sup>ITESM <sup>11</sup>University of Milano Bicocca <sup>12</sup> RWTH CHEP 2023 11 May 2021





M

# Introduction

2

2023-05-11

- CMS uses GPUs as part of the High-Level Trigger farm in LHC Run 3
- GPU vendors provide their own APIs that also differ from programming the CPU
  - Want to minimize development and maintenance effort
  - CMS is moving to have portable code between CPU and NVIDIA and AMD GPUs via Alpaka
    - Want to be aware of the other technologies in the market to guide long term planning
- Used CMS heterogeneous pixel reconstruction (Patatrack) as a use case for a set of realistic algorithms utilizing GPU effectively
- Measure the performance of direct, Alpaka, Kokkos, and SYCL versions on CPU, NVIDIA GPU, and AMD GPU
  - All versions give the same results (within reproducibility accuracy)
  - Some grain of salt needed to interpret the results
    - The versions using different portability technologies have differences
- Report initial experience with std::par and OpenMP Target offload



# **CMS Heterogeneous Pixel Reconstruction**

- About 40 kernels organized in 5 "framework modules"
  - <u>arXiv:2008.13461</u>



- Kernels are short: few µs to ~1 ms, performance sensitive to overheads
- Raw pixel detector data (~250 kB/event) transferred to the GPU
- Only final results transferred back to the CPU: ~4 MB for tracks, ~90 kB for vertices
  - Not considered in throughput measurements in this talk
- Extracted into a standalone program to enable rapid prototyping
  - Flexible GNU Make -based build system
  - Simple framework mimicking CMSSW's use of oneTBB tasks
  - Disk I/O contribution to time measurements is ignored
    - 1000 events from TTbar + pileup 50 simulation from <u>CMS Open Data</u> read at the beginning of the job and recycled





## Alpaka and Kokkos versions are most mature

- <u>Alpaka</u> (earlier reported in ACAT 21: <u>J. Phys. Conf. Ser. 2438 012058</u>)
  - Thin, header-only, templated C++ library, abstraction level similar to CUDA
    - Backends include serial, OpenMP 2, std::thread, CUDA, HIP, SYCL (experimental)
  - Flexible to work with
    - E.g. can build a single application that supports multiple GPU backends
  - Somewhat more verbose syntax compared to others
- Kokkos (earlier reported in vCHEP 21: EPJ Web. Conf. 251 03034)
  - Templated C++ library, higher abstraction level than CUDA
    - Backends include serial, OpenMP, CUDA, HIP, HPX, OpenMP-Target, SYCL (experimental)
  - Implements a higher-than-CUDA level programming model on top of the low-level APIs
  - Constraints how to build the application code
  - Have had to understand what Kokkos does between developer and vendor API





#### **Performance measurements**

- Performance measurements done using the resources of the Joint Laboratory for System Evaluation at Argonne National Laboratory
- CPUs:
  - 2-socket Intel Xeon Platinum 8176 (Skylake): 28 cores and 56 threads x 2
  - 1-socket AMD EPYC 7532 (Milan): 32 cores and 32 threads
  - Measure total throughput of full node
    - N processes of M threads such that NxM = number of HW threads
- GPUs:
  - NVIDIA: A100 (19.5 FP32 TFLOPS) and A40 (37.4 FP32 TFLOPS)
  - AMD: MI100 (32.1 FP32 TFLOPS) and MI250 (90.5 FP32 TFLOPS)
  - Measure the throughput on a single GPU by increasing the number of concurrent events
  - Node has no other activity
- Take average of 4 executions



## Event processing throughput on CPU "serial backends"



"Serial" = one instance of the backend for each concurrent event





**‡Fermilab** 



CMS

**HEP-CCE** 

## Peak memory on CPU "serial backends"



# Event processing throughput on CPU "parallel backends"



One event in flight  $\rightarrow$  concurrent event processing is more useful than intra-algorithm parallelism in this case



**‡Fermilab** 



## **Event processing throughput on NVIDIA GPU**



9 2023-05-11 Matti Kortelainen | Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction



# Mean GPU and CPU utilization on NVIDIA A40 GPU



**‡Fermilab** 



CMS

**HEP-CCE** 

# Peak memory usage on NVIDIA A40 GPU



As reported by nvidia-smi and /proc/<PID>/status. A100 shows similar behavior.



CMS

## **Event processing throughput on AMD GPUs**





# Host memory and CPU utilization on AMD MI100 GPU



**CF**ermilab

CMS

MI250 shows similar behavior

13

#### **‡** Fermilab

# SYCL version: complete and runs on some hardware

- SYCL: Specification by the Khronos Group
  - Some notable implementations:
    - Intel's <u>oneAPI DPC++</u> and <u>open-source LLVM</u>
    - Open SYCL (not tested)
  - Allows simultaneous use of multiple backends
- Development of SYCL version revealed many bugs in the Intel LLVM
  - E.g. collective operations on CPU, block shared variables
- Was not able to replicate the setup that would result in a working executable on other machines with e.g. A100
  - Also did not succeed to compile for AMD GPUs
- Some kernels are slower than in CUDA, every operation creates a SYCL event, SYCL events can not be reused





M

HEP-C

### std::par version: technically complete

- STL parallel algorithms as implemented by NVIDIA in their HPC SDK
  - Relies on unified memory
- std::par version is complete, but testing is difficult because of compiler bugs
- Abstraction level much higher than Alpaka/Kokkos/SYCL
  - Low barrier for using GPUs in a new codebase
  - Converting a large and optimized CUDA application is easier to map to Alpaka/Kokkos/SYCL
    - std::par requires some algorithmic changes and/or more kernels
    - Hierarchical parallelism, e.g. synchronizing threads of a block, not supported
      - Have to split or rework such kernels
    - No access to CUDA shared memory, need to use global memory and use atomics
- Must compile the whole program with nvc++ when offloading for NVIDIA GPU
  - To avoid One Definition Rule violations with e.g. std::vector

#### **‡** Fermilab

# **OpenMP Target offload: in progress**

- Compiler pragma-based approach, popular for multithreading e.g. in HPC
- Can use #omp target offload in conjunction of multithreading with oneTBB
- Had lots of problems with compilers, especially in conjunction with Eigen
  - Mostly with LLVM (15, 16, main): targeting NVIDIA and AMD GPU backends
  - NVIDIA HPC SDK: compiles, fails at run time
  - AMD (AOMP, AFAR; amdclang underneath): compiler crashes
  - Intel oneAPI (icpx): compiles, but not pursued further yet
- Preliminary look on performance of some individual kernels with Nsight Systems
  - OpenMP kernels are slower than corresponding CUDA kernels
  - Much more data movement in OpenMP version compared to direct CUDA version





# Conclusions

- We have compared the performance of various versions of CMS Heterogeneous Pixel Reconstruction
  - Direct, Alpaka, Kokkos, SYCL on x86 CPU, NVIDIA GPU, and AMD GPU
- · Overall the best performance was achieved with Alpaka
- For this use case, Alpaka was also the easiest to work with
  - Flexible, little constraints added on top of the vendor APIs
- Kokkos: no concurrent instances of Serial backend (yet), often need to understand what Kokkos does in between developer and vendor API
- SYCL: compilation problems, overheads
- std::par: compilation problems, crashes, leads to many more kernels
- OpenMP Target offload: compilation problems, data movement is a concern





CM

HEP-CO

# **Related contributions**

- <u>M. Kortelainen: "Performance of Heterogeneous Algorithm Scheduling in CMSSW",</u> <u>Track X Tuesday 15:15</u>
- <u>A. Bocci: "Adoption of the alpaka performance portability library in the CMS software", Track 2 Tuesday 17:00</u>
- Other portability studies from HEP-CCE
  - <u>M. Kwok: "Application of performance portability solutions for GPUs and many-core CPUs</u> to track reconstruction kernels", Track X Monday 11:00
  - <u>M. Atif: "Porting ATLAS FastCaloSim to GPUs with OpenMP Target Offloading", Tuesday</u> poster session
  - V. Tsulaia: "Porting ATLAS FastCaloSim to GPUs with std::par and with Alpaka", Tuesday poster session
  - <u>"Porting ATLAS FastCaloSim to GPUs with Performance Portable Programming Models"</u>, <u>Track X Tuesday 15:00</u>
  - <u>– "Results from HEP-CCE", Track X Tuesday 11:00</u>





#### **Spares**



19 2023-05-11 Matti Kortelainen | Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction



#### **Software versions**

|            | Direct                  | Alpaka<br><u>b518e8c9</u> | Kokkos<br>3.5 or 4.0                  | SYCL<br>Intel LLVM tag<br><u>2022-09</u><br>( <u>0f579ba</u> ) |
|------------|-------------------------|---------------------------|---------------------------------------|----------------------------------------------------------------|
| x86 CPU    | GCC 11.1                | GCC 11.1                  | GCC 11.1<br>Kokkos 3.5                | GCC 8.5                                                        |
| NVIDIA GPU | GCC 11.1<br>CUDA 11.6.2 | GCC 11.1<br>CUDA 11.6.2   | GCC 11.1<br>CUDA 11.6.2<br>Kokkos 3.5 | GCC 8.5<br>CUDA 11.8                                           |
| AMD GPU    | GCC 12.2<br>ROCm 5.4    | GCC 12.2<br>ROCm 5.4      | GCC 12.2<br>ROCm 5.4<br>Kokkos 4.0    |                                                                |

