¹¹institutetext: KTH Royal Institute of Technology, Stockholm, Sweden ²²institutetext: LeCAD, University of Ljubljana, Ljubljana, Slovenia ³³institutetext: University of Oregon, Eugene, Oregon, The United States of America ⁴⁴institutetext: Institute of Plasma Physics of the CAS, Prague, Czech Republic ⁵⁵institutetext: Max Planck Computing and Data Facility, Garching and Greifswald, Germany

Understanding the Impact of openPMD on BIT1, a Particle-in-Cell Monte Carlo Code, through Instrumentation, Monitoring, and In-Situ Analysis

Jeremy J. Williams 11 Stefan Costea 22 Allen D. Malony 33 David Tskhakaya 44 Leon Kos 22 Ales Podolnik 44 Jakub Hromadka 44 Kevin Huck 33 Erwin Laure 55 Stefano Markidis 11

Abstract

Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and storage. This integration not only enhanced data management but also improved write throughput and storage efficiency. In this work, we delve deeper into the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. Utilizing cutting-edge profiling and monitoring tools such as gprof, CrayPat, Cray Apprentice2, IPM, and Darshan, we dissect BIT1’s performance post-integration, shedding light on computation, communication, and I/O operations. Fine-grained instrumentation offers insights into BIT1’s runtime behavior, while immediate monitoring aids in understanding system dynamics and resource utilization patterns, facilitating proactive performance optimization. Advanced visualization techniques further enrich our understanding, enabling the optimization of BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation.

Keywords:

Performance Monitoring and Analysis, openPMD, Parallel I/O, ADIOS2, gprof, CrayPat, Cray Apprentice2, IPM, Darshan, Distributed Storage, Efficient Data Processing, In-Situ Analysis, Large-Scale PIC Simulations

1 Introduction

Particle-in-Cell (PIC) Monte Carlo (MC) simulations are critical for understanding plasma dynamics in fusion devices, requiring efficient data handling and analysis. Our prior work addressed the critical need for high-throughput parallel I/O in these simulations by integrating openPMD with the BIT1 code, enabling seamless streaming of particle and field information to storage systems. This integration not only improved data handling but also enhanced write throughput and storage efficiency. Building upon this, in this work, we investigate the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. We utilize state-of-the-art profiling tools like gprof, CrayPat, Cray Apprentice2, IPM, and Darshan to analyze BIT1’s performance post-integration, uncovering insights into computation, communication, and I/O operations. Thorough instrumentation provides fine-grained insights into BIT1’s runtime behavior, while immediate monitoring enhances understanding of system dynamics and resource utilization patterns, facilitating proactive performance tuning and optimization efforts. Advanced visualization techniques further aid in representing data flow, system interactions, and performance bottlenecks, empowering us to optimize BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation.

In this work, we focus on understanding the impact of openPMD enabling high-throughput parallel I/O in BIT1, achieved through comprehensive instrumentation, monitoring, and in-situ analysis. The contributions of this work include:

•

We identify the most computationally intensive parts by applying an I/O adaptor for the openPMD I/O interface that uses ADIOS2 BP4 as the I/O interface, which helps us understand the performance impact of running BIT1 openPMD BP4 on a single node.
•

We apply profiling and monitoring techniques to evaluate the impact of using openPMD to implement parallel I/O compared to traditional file I/O in BIT1 when diagnostics are activated in strong scaling tests.
•

We utilize a customized Python script with the openPMD API and ADIOS2 BP4 backend for real-time checkpoint access and visualization of BIT1 File I/O (from disk), without causing any interruption to the simulation.

2 Background

The PIC method serves as a numerical technique utilized in emulating plasma behaviors. This method governs particle dynamics across one to three dimensions in physical space and typically employs three dimensions in velocity space. In plasma edge scenarios, the PIC method often integrates MC routines to simulate particle collisions and their interactions with the walls of the plasma device chamber. The computational PIC cycle comprises five distinct phases: initiating plasma density calculations via particle-to-grid interpolation, executing a density smoothing operation to eliminate spurious frequencies, employing a field solver to tackle linear systems for electric and magnetic fields, managing particle collisions and wall interactions using MC techniques, and advancing particle positions and velocities over time, as detailed in [10].

BIT1 stands out as a tool of choice designed for accurately describing atomic processes and collisions during plasma-wall interactions. BIT1 is an electrostatic PIC code optimized for plasma edge modeling, with the key goal of enabling full-scale kinetic modeling of the plasma edge in next-generation fusion devices like ITER and DEMO. The input to BIT1 is a relatively small file (around 3 kB) that is read by all processes. The output analysis corresponds to two critical input parameters [11]:

•

mvflag: Represents a flag for activating and enabling time-dependent diagnostics of plasma profiles and particle angular, velocity, and energy distribution functions. If > 0, it specifies the number of time steps over which the time-dependent diagnostics are averaged.
•

mvStep: Counts the time steps for the interval between time-dependent diagnostics.

While the original version of BIT1 boasts robust serial I/O functionality, the need for parallel I/O capabilities has become apparent, particularly as simulations scale up in size and complexity. BIT1’s serial I/O encountered challenges beyond certain thresholds, becoming time-consuming and prone to file corruption, requiring the implementation of novel parallel I/O methods. Introducing new libraries and tools can enhance certain aspects of the code but may also introduce challenges in other areas. To address these issues and ensure the accuracy of output files while optimizing performance for extensive simulations, it’s imperative to implement novel parallel I/O methods in BIT1.

2.1 ADIOS Version 2

ADIOS2 (Adaptable Input/Output System 2) is a high-performance I/O library designed for managing data movement in scientific simulations and applications, offering flexibility and efficiency [4]. ADIOS2 has many support engines, BP4 (Binary Pack 4) stands as one of the supported data formats, optimized for performance and scalability, especially for large-scale runs. BP4 enables reduced storage requirements, faster read/write speeds, and compatibility with parallel I/O operations. When ADIOS2 is configured with the BP4 backend, it means that the library is tailored to utilize the BP4 format for storing and retrieving data, a configuration particularly beneficial in scientific computing scenarios where performance and scalability are paramount.

2.2 openPMD Standard & openPMD-api Integration

The openPMD Standard, abbreviated as "open Particle-Mesh Data," presents a standardized format engineered for efficiently storing and exchanging data from scientific simulations involving particles and meshes [5]. Its streamlined format not only facilitates the storage and exchange of data but also ensures that vital metadata is retained, enabling effective interpretation and analysis [11]. Moreover, openPMD supports a wide array of backends for data storage, including popular formats like HDF5, ADIOS1, ADIOS2, and JSON. This adaptability empowers scientists to seamlessly integrate openPMD into their existing workflows, choosing the backend that aligns best with their specific needs and preferences.

3 Methodology & Experimental Setup

In this work, we aim to understand the impact of integrating openPMD with BIT1, determining its associated performance characteristics, we employ a suite of sophisticated profiling and monitoring tools. Specifically, we utilize the following tools:

•

gprof is an open-source profiling tool that collects execution time data and identifies the functions most frequently used by the processor. Since each MPI process generates a separate gprof output, these individual profiling results are consolidated into a single report encompassing all statistics.
•

CrayPat & Cray Apprentice2 are powerful profiling tools for investigating and optimizing the performance of parallel applications on Cray architectures [1]. CrayPAT is used to instrument the code, while Cray Apprentice2 enables interactive, graphical performance analysis and visualization, specifically for our strong scaling experiments.
•

IPM (Integrated Performance Monitoring) is a performance profiling tool that captures the computation and communication activities of parallel programs. It provides detailed reports on MPI calls and buffer sizes [3].
•

Darshan is a performance monitoring tool specifically designed for analyzing serial and parallel I/O workloads [8]. We assess the I/O performance of BIT1 in terms of write throughput by using Darshan logs to extract data on high-throughput and the amount of data stored by each file on the file system.

3.1 Use Case & Experimental Environment

We focus on exploring the impact of openPMD on enabling high-throughput parallel I/O in BIT1. Our simulations target neutral particle ionization resulting from interactions with electrons in upcoming magnetic confinement fusion devices such as ITER and DEMO. The scenario involves an unbounded unmagnetized plasma consisting of electrons, $D^{+}$ ions and $D$ neutrals. Due to ionization, neutral concentration decreases with time according to $\partial n/\partial t=nn_{e}R$ , where $n$ , $n_{e}$ and $R$ are neutral particles, plasma densities and ionization rate coefficient, respectively. We use a one-dimensional geometry with 100K cells, three plasma species ( $e$ electrons, $D^{+}$ ions and $D$ neutrals), and 10M particles per cell per species. The total number of particles in the system is 30M. Unless differently specified, we simulate up to 200K time steps. An important point of this test is that it does not use the Field solver and smoother phases (shown in [10]).

We simulate and evaluate the impact of openPMD enabling high-throughput parallel I/O in BIT1 on the following three distinct systems:

•

Dardel, an HPE Cray EX supercomputer, has 1270 compute nodes, each with two AMD EPYC™ Zen2 2.25 GHz 64-core processors, 256 GB DRAM. Nodes are connected via HPE Slingshot network (200 GiB/s) with Dragonfly topology. Storage includes a 12 PB LFS with 48 OSTs. The OS is SUSE Linux Enterprise Server 15 SP3, with applications compiled using GCC 11.2, openPMD 0.15.2, ADIOS2 2.10.0 (Blosc and bzip2 enabled), and Cray MPICH 8.1.
•

Discoverer, a petascale EuroHPC supercomputer, has 1128 compute nodes, each with two AMD EPYC 7H12 64-Core processors, 256 GB DDR4 SDRAM (regular nodes), or 1TB DDR4 SDRAM (fat nodes). Nodes are connected via Ethernet Controller I350 (10 GiB/s) and Mellanox ConnectX-6 InfiniBand (200 GiB/s) with Dragonfly+ topology. Storage includes a 4.4 TB Network File System and a 2.1 PB Lustre File System (LFS) with 4 Object Storage Targets (OSTs). The OS is Red Hat Enterprise Linux 8.4, with applications compiled using GCC 11.4.0 and MPICH 4.1.2.
•

Vega, a petascale EuroHPC supercomputer, has 960 compute nodes, each with two AMD EPYC 7H12 64-Core processors, 256 GB DDR4 SDRAM (80% nodes), or 1TB DDR4 SDRAM (20% nodes). Nodes are connected via Mellanox ConnectX-6 InfiniBand HDR100 (500 GiB/s) with Dragonfly+ topology. Storage includes a 23 PB Ceph File System (CephFS) and a 1 PB LFS with 80 OSTs. The OS is Red Hat Enterprise Linux 8, with applications compiled using GCC 12.3.0 and OpenMPI 4.1.2.1.

3.2 BIT1 openPMD BP4 I/O Workflow

As outlined by Williams et al. [10, 12], BIT1 performs serial I/O operations throughout each simulation. Similar to the process described in [7], a workflow has been established using specific ADIOS2 engines along with the requisite output extensions (.bp, .bp4, and .bp5 respectively). For each extension, a distinct ADIOS2 file (or directory) is generated, containing one or multiple data files (data.0, data.1 ... data.N, data.N+1), a metadata file (md.0), an index table (md.idx), and, if enabled, a profiling file (profiling.json). BIT1 I/O workflow using openPMD with the ADIOS2 BP4 engine will employ the output extension directory, data_file.bp4 [11].

4 Performance Results & Analysis

In this work, we investigate the impact of integrating openPMD with BIT1 using the ADIOS2 BP4 backend.

Refer to caption — Figure 1: BIT1 Original File I/O Write Throughput, on Discoverer, Dardel and Vega CPU LFS, up to 100 Nodes (12,800 MPI processes), measured in GiB/s [11].

4.1 BIT1 openPMD BP4 Instrumentation & Monitoring

We begin by utilizing Darshan, a performance monitoring tool tailored for analyzing I/O workloads; we assess BIT1’s write throughput by extracting data from Darshan logs.

Fig. 1 displays the Write Throughput (GiB/s) across three unique CPU LFS supercomputers: Dardel, Discoverer, and Vega. As the number of nodes increases, we observe varied performance trends across each system. Discoverer’s performance shows fluctuations, with a slight initial increase followed by a decrease and then a minor increase again. Dardel exhibits generally increasing performance with the number of nodes. Notably, Dardel achieves the highest throughput among the three systems, reaching 0.74 GiB/s with 40 nodes. Vega’s performance also demonstrates an upward trajectory overall, although with some fluctuations, especially evident at higher node counts. Despite these differences, Dardel consistently outperforms both Discoverer and Vega CPUs, making it the most promising option for further work. Its superior performance, as seen in Fig. 2 where BIT1 openPMD + BP4 maintains stable throughput, indicates its suitability for tasks requiring high-throughput and efficiency. Therefore, it’s recommended to continue our investigation on the Dardel CPU LFS Supercomputer to capitalize on its outstanding performance characteristics.

Next, we utilized gprof, an open-source profiling tool, to analyze execution time and identify the most frequently used functions across MPI processes. The consolidated gprof report offers a detailed performance analysis of BIT1 with and without openPMD.

Fig. 3 compares the performance of different operations between "BIT1 openPMD BP4" and "Original BIT1" configurations, based on the percentage of total time spent on each operation. In the Original BIT1 configuration, the "arrj" function consumes 75.5% of the total time. With BIT1 openPMD BP4, this drops to 65.5%, indicating a 10% improvement in efficiency, likely due to better data management and optimized processing.

Other notable observations include a slight increase in time spent on "move0" from 18% to 20%, possibly due to overhead from the new implementation. A significant decrease in "rempar2" from 14% to 7.7% suggests improved parallelization or data partitioning. There is a slight increase in "nmove" from 9.6% to 10.8%, which might be a trade-off for other improvements. Increases in "avq_mpi" and "accum_mpi" from 4% to 9.8% and 4% to 10.4%, respectively, indicate enhanced MPI operations due to better communication and data exchange mechanisms. Despite slight increases in some operations, the overall time distribution indicates a more balanced and optimized use of computational resources.

In addition to gprof, CrayPat & Cray Apprentice2 were employed to investigate the performance of the BIT1 openPMD BP4 application across various scales, ranging from small-scale runs on a single node to large-scale runs on up to 100 nodes. CrayPat facilitated code instrumentation, while Cray Apprentice2 supported interactive, graphical performance analysis, and visualization, particularly for strong scaling experiments.

Fig. 4 displays the performance of BIT1 on small to large scale runs, specifically focusing on the impact of the openPMD BP4 backend. For small-scale runs, CrayPat and Cray Apprentice2 provided a detailed breakdown of function calls with significant exclusive sample hits, averaged across ranks. Notably, functions such as “arrj” (19.8%), “adios2::AggregateCollectiveMetadata” (11.9%), “MPI_Wait” (25.5%), “adios2::helper::CommReqImplMPI::Wait” (5.5%), and “avq_0’ (10.2%) exhibited substantial percentages of sample hits, revealing their impact on the overall performance of BIT1 openPMD BP4 runs. Additionally, “move0” (9.7%) and “nmove” (5.2%) were significant contributors to execution time. The remaining functions collectively accounted for 2.8% of the sample hits.

Upon scaling up to 100 nodes, a different performance profile emerged. Despite the reduced total sample hits compared to small-scale runs, functions like “arrj” (40%), “avq_0” (21.3%), “MPI_Wait” (12.9%), “move0” (9.3%), and “nmove” (4.5%) remained prominent contributors to the overall execution time. Notably, “adios2::AggregateCollectiveMetadata” (3.4%) demonstrated a decrease in its impact. Interestingly, the “MPI_Wait” function decreased from 25.5% on a single node to 12.9% on 100 nodes, which differs from the traditional expectation of increased MPI communication with node count. This decrease in MPI communication is attributed to the utilization of openPMD with the ADIOS2 BP4 backend and its aggregation capabilities, optimizing MPI communication, and enhancing overall performance compared to the original BIT1, as presented by Williams et al. [10].

Based on the CrayPat & Cray Apprentice2 results, we further study overall MPI communication and load balancing in BIT1 openPMD BP4 simulations to investigate if there is a trade-off effect for this enhancement in the “MPI_Wait” function. Fig.5 displays the MPI aggregated communication time for the BIT1 openPMD BP4 simulation on 100 nodes for a total of 12,800 cores. “MPI_Gatherv” (67.65%) dominates the communication time, indicating a need to optimize data gathering processes. Significant time is also spent in “MPI_Recv” (19.04%) and “MPI_Wait” (5.76%), indicating potential inefficiencies in message handling and synchronization on large runs. In Fig.5, we also show the amount of memory consumed per node. There is a balanced use of compute nodes: the largest usage of memory per node is approximately 33 GB while the smallest is approximately 29 GB.

4.2 BIT1 openPMD BP4 In-Situ Analysis

In-situ analysis facilitates real-time assessment of data directly within the environment where it is generated, without the need for extensive data transfers or storage. BIT1 can operate with minimal diagnostics, tracking the total particle number over time. Depending on input parameters, it can additionally log particle and power fluxes to the wall with minimal computational overhead. It also supports periodic system state saving for checkpointing and restoration.

Fig 6 shows a real-time checkpoint analysis of BIT1 File I/O stored in the output directory, data_file.bp4, facilitated by a customized Python script utilizing the openPMD API and the ADIOS2 BP4 backend for 12,800 MPI Processes. Key visualizations include profiles of electric potential, plasma species densities, and temperatures, providing insights into plasma sheath presence and self-consistent electric field utilization.

Additionally, workload distribution among MPI ranks is visualized to ensure balanced computation load. Assessment of simulation steady-state is conducted through the time evolution of total particles of each species, aiding in understanding plasma sources and sinks dynamics. In the ionization case, the increase in electron and ion numbers due to the ionization of neutrals leads to a decrease in neutral particles over time. Using in-situ analysis offers immediate insights into system states, enabling efficient adjustments to simulation parameters. This real-time capability ensures the analysis occurs promptly at every checkpoint without causing any interruption to the simulation.

5 Related Work

PIC codes, crucial in simulating plasma physics and related fields, are increasingly undergoing performance and scalability enhancements. One approach involves integrating ADIOS2 and openPMD. ADIOS2, facilitating seamless data movement across various channels like files, networks, and direct memory, addresses the challenge of managing substantial data volumes generated by parallel simulations [4]. openPMD, an open-source initiative [5], aims to standardize particle and mesh data file formats for diverse simulations, fostering interoperability and simplifying data analysis and visualization. Understanding openPMD’s impact on BIT1, a PIC MC code, requires utilizing instrumentation, monitoring, and in-situ analysis techniques. Previous research highlights the significance of leveraging HPC profiling and tracing tools to analyze simulation performance, covering single-node, multiple-node, and I/O aspects [10]. Additionally, optimizing BIT1 through OpenMP/OpenACC and GPU acceleration enhances its capabilities [10]. Williams et al. [9] demonstrated the importance of optimizing iPIC3D for large-scale 3D plasma simulations, offering practical recommendations to enhance performance and address Geospace Environmental Modeling magnetic reconnection challenges. Faj et al. [2] analyzed Vlasiator’s MPI performance, highlighting MPI nonblocking communication’s dominance in communication time and advocating for OpenMP to eliminate intra-node communication, crucial for optimizing Vlasiator for Exascale machines. Poeschel et al. emphasized the openPMD-api as a valuable tool for describing scientific data according to the openPMD standard [7, 6]. These studies collectively emphasize the importance of utilizing instrumentation, monitoring, and in-situ analysis capabilities to identify key areas for optimization, enhancement, and enablement in PIC MC codes like BIT1.

6 Discussion and Conclusion

The integration of openPMD with BIT1 marks a significant leap in plasma dynamics simulations for fusion devices. These advancements pave the way for further optimizations, boosting the accuracy and efficiency of models for plasma-material interfaces. Comprehensive instrumentation, monitoring, and visualization have provided valuable insights into BIT1’s performance, particularly in computation, communication, and I/O operations.

This work emphasizes openPMD’s crucial role in facilitating high-throughput parallel I/O in BIT1, crucial for handling vast data from plasma simulations. Profiling tools like gprof, CrayPat, Cray Apprentice2, IPM, and Darshan have helped identify areas for improvement. Comparing the original BIT1 setup with openPMD BP4 integration reveals significant efficiency gains, especially in data management and processing. The integration of the ADIOS2 BP4 backend with openPMD notably reduces write throughput bottlenecks and enhances performance across various computational platforms. Analyzing MPI communication and load balancing in BIT1 openPMD BP4 simulations offers insights into optimization strategies, particularly in data gathering and message handling. Visualization has played a pivotal role, providing intuitive representations of data flow and system interactions. In-situ analysis of electric potential profiles and workload distribution among MPI ranks has provided valuable real-time understanding of plasma dynamics and simulation behavior.

Future research will investigate the decrease in MPI communication when parallel I/O is considered. It can also explore integrating high-performance streaming using the Sustainable Staging Transport (SST) backend, extending real-time checkpoint analysis capabilities in memory rather than file I/O. Improvements in checkpoint restart and load balancing can enhance efficiency, resilience, and fault tolerance. Further integration can enable in-situ visualization with ParaView Catalyst 2 and ADIOS2 for efficient data transfer and non-blocking visualization capabilities, contributing to achieving exascale computing and addressing BIT1’s grand challenge of controlling plasma-material interfaces.

Acknowledgments. Funded by the European Union. This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Sweden, Finland, Germany, Greece, France, Slovenia, Spain, and Czech Republic under grant agreement No 101093261. The computations/data handling were/was enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.

References

[1] Budiardja, R., et al.: Using caascade and craypat for analysis of hpc applications. Tech. rep., Oak Ridge National Lab. (ORNL), Oak Ridge, TN (USA) (2018)
[2] Faj, J., et al.: Mpi performance analysis in vlasiator: Unraveling communication bottlenecks. In: SC23: The International Conference for High Performance Computing, Networking, Storage, and Analysis (2023)
[3] Fuerlinger, K., et al.: Effective performance measurement at petascale using IPM. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems. pp. 373–380. IEEE (2010)
[4] Godoy, W.F., et al.: Adios 2: The adaptable input output system. a framework for high-performance data management. SoftwareX 12, 100561 (2020)
[5] Huebl, A., et al.: openPMD: A meta data standard for particle and mesh based data. https://github.com/openPMD (2015). https://doi.org/10.5281/zenodo.591699, https://www.openPMD.org
[6] Huebl, A., et al.: openPMD-api: C++ & Python API for Scientific I/O with openPMD (2018). https://doi.org/10.14278/rodare.27, https://github.com/openPMD/openPMD-api
[7] Poeschel, F., et al.: Transitioning from file-based hpc workflows to streaming data pipelines with openpmd and adios2. In: Smoky Mountains Computational Sciences and Engineering Conference. pp. 99–118. Springer (2021)
[8] Snyder, S., et al.: Modular HPC I/O characterization with Darshan. In: 2016 5th workshop on extreme-scale programming tools (ESPT). pp. 9–17. IEEE (2016)
[9] Williams, J.J., et al.: Characterizing the performance of the implicit massively parallel particle-in-cell ipic3d code. In: SC23: The International Conference for High Performance Computing, Networking, Storage, and Analysis (2023)
[10] Williams, J.J., et al.: Leveraging hpc profiling and tracing tools to understand the performance of particle-in-cell monte carlo simulations. In: European Conference on Parallel Processing. pp. 123–134. Springer (2023)
[11] Williams, J.J., et al.: Enabling high-throughput parallel i/o in pic mc simulations using openpmd. Manuscript submitted for publication (2024)
[12] Williams, J.J., et al.: Optimizing bit1, a particle-in-cell monte carlo code, with openmp/openacc and gpu acceleration. arXiv preprint arXiv:2404.10270 (2024)