subscribe to arXiv mailings

Understanding Large-Scale Plasma Simulation Challenges for Fusion Energy on Supercomputers

Authors: Jeremy J. Williams, Ashish Bhole, Dylan Kierans, Matthias Hoelzl, Ihor Holod, Weikang Tang, David Tskhakaya, Stefan Costea, Leon Kos, Ales Podolnik, Jakub Hromadka, JOREK Team, Erwin Laure, Stefano Markidis

Abstract: Understanding plasma instabilities is essential for achieving sustainable fusion energy, with large-scale plasma simulations playing a crucial role in both the design and development of next-generation fusion energy devices and the modelling of industrial plasmas. To achieve sustainable fusion energy, it is essential to accurately model and predict plasma behavior under extreme conditions, requiri… ▽ More Understanding plasma instabilities is essential for achieving sustainable fusion energy, with large-scale plasma simulations playing a crucial role in both the design and development of next-generation fusion energy devices and the modelling of industrial plasmas. To achieve sustainable fusion energy, it is essential to accurately model and predict plasma behavior under extreme conditions, requiring sophisticated simulation codes capable of capturing the complex interaction between plasma dynamics, magnetic fields, and material surfaces. In this work, we conduct a comprehensive HPC analysis of two prominent plasma simulation codes, BIT1 and JOREK, to advance understanding of plasma behavior in fusion energy applications. Our focus is on evaluating JOREK's computational efficiency and scalability for simulating non-linear MHD phenomena in tokamak fusion devices. The motivation behind this work stems from the urgent need to advance our understanding of plasma instabilities in magnetically confined fusion devices. Enhancing JOREK's performance on supercomputers improves fusion plasma code predictability, enabling more accurate modelling and faster optimization of fusion designs, thereby contributing to sustainable fusion energy. In prior studies, we analysed BIT1, a massively parallel Particle-in-Cell (PIC) code for studying plasma-material interactions in fusion devices. Our investigations into BIT1's computational requirements and scalability on advanced supercomputing architectures yielded valuable insights. Through detailed profiling and performance analysis, we have identified the primary bottlenecks and implemented optimization strategies, significantly enhancing parallel performance. This previous work serves as a foundation for our present endeavours. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: Accepted by EPS PLASMA 2024 (50th European Physical Society Conference on Plasma Physics), prepared in the standardized EPS conference proceedings format and consists of 4 pages, which includes the main text, references, and figures

arXiv:2406.19058 [pdf, other]

Understanding the Impact of openPMD on BIT1, a Particle-in-Cell Monte Carlo Code, through Instrumentation, Monitoring, and In-Situ Analysis

Authors: Jeremy J. Williams, Stefan Costea, Allen D. Malony, David Tskhakaya, Leon Kos, Ales Podolnik, Jakub Hromadka, Kevin Huck, Erwin Laure, Stefano Markidis

Abstract: Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and s… ▽ More Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and storage. This integration not only enhanced data management but also improved write throughput and storage efficiency. In this work, we delve deeper into the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. Utilizing cutting-edge profiling and monitoring tools such as gprof, CrayPat, Cray Apprentice2, IPM, and Darshan, we dissect BIT1's performance post-integration, shedding light on computation, communication, and I/O operations. Fine-grained instrumentation offers insights into BIT1's runtime behavior, while immediate monitoring aids in understanding system dynamics and resource utilization patterns, facilitating proactive performance optimization. Advanced visualization techniques further enrich our understanding, enabling the optimization of BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Accepted by the Euro-Par 2024 workshops (PHYSHPC 2024), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figures

arXiv:2406.14297 [pdf, other]

AI in Space for Scientific Missions: Strategies for Minimizing Neural-Network Model Upload

Authors: Jonah Ekelund, Ricardo Vinuesa, Yuri Khotyaintsev, Pierre Henri, Gian Luca Delzanno, Stefano Markidis

Abstract: Artificial Intelligence (AI) has the potential to revolutionize space exploration by delegating several spacecraft decisions to an onboard AI instead of relying on ground control and predefined procedures. It is likely that there will be an AI/ML Processing Unit onboard the spacecraft running an inference engine. The neural-network will have pre-installed parameters that can be updated onboard by… ▽ More Artificial Intelligence (AI) has the potential to revolutionize space exploration by delegating several spacecraft decisions to an onboard AI instead of relying on ground control and predefined procedures. It is likely that there will be an AI/ML Processing Unit onboard the spacecraft running an inference engine. The neural-network will have pre-installed parameters that can be updated onboard by uploading, by telecommands, parameters obtained by training on the ground. However, satellite uplinks have limited bandwidth and transmissions can be costly. Furthermore, a mission operating with a suboptimal neural network will miss out on valuable scientific data. Smaller networks can thereby decrease the uplink cost, while increasing the value of the scientific data that is downloaded. In this work, we evaluate and discuss the use of reduced-precision and bare-minimum neural networks to reduce the time for upload. As an example of an AI use case, we focus on the NASA's Magnetosperic MultiScale (MMS) mission. We show how an AI onboard could be used in the Earth's magnetosphere to classify data to selectively downlink higher value data or to recognize a region-of-interest to trigger a burst-mode, collecting data at a high-rate. Using a simple filtering scheme and algorithm, we show how the start and end of a region-of-interest can be detected in on a stream of classifications. To provide the classifications, we use an established Convolutional Neural Network (CNN) trained to an accuracy >94%. We also show how the network can be reduced to a single linear layer and trained to the same accuracy as the established CNN. Thereby, reducing the overall size of the model by up to 98.9%. We further show how each network can be reduced by up to 75% of its original size, by using lower-precision formats to represent the network parameters, with a change in accuracy of less than 0.6 percentage points. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2405.07222 [pdf, other]

What is Quantum Parallelism, Anyhow?

Authors: Stefano Markidis

Abstract: Central to the power of quantum computing is the concept of quantum parallelism: quantum systems can explore and process multiple computational paths simultaneously. In this paper, we discuss the elusive nature of quantum parallelism, drawing parallels with classical parallel computing models to elucidate its fundamental characteristics and implications for algorithmic performance. We begin by def… ▽ More Central to the power of quantum computing is the concept of quantum parallelism: quantum systems can explore and process multiple computational paths simultaneously. In this paper, we discuss the elusive nature of quantum parallelism, drawing parallels with classical parallel computing models to elucidate its fundamental characteristics and implications for algorithmic performance. We begin by defining quantum parallelism as arising from the superposition of quantum states, allowing for the exploration of multiple computational paths in parallel. To quantify and visualize quantum parallelism, we introduce the concept of quantum dataflow diagrams, which provide a graphical representation of quantum algorithms and their parallel execution paths. We demonstrate how quantum parallelism can be measured and assessed by analyzing quantum algorithms such as the Quantum Fourier Transform (QFT) and Amplitude Amplification (AA) iterations using quantum dataflow diagrams. Furthermore, we examine the interplay between quantum parallelism and classical parallelism laws, including Amdahl's and Gustafson's laws. While these laws were originally formulated for classical parallel computing systems, we reconsider their applicability in the quantum computing domain. We argue that while classical parallelism laws offer valuable insights, their direct application to quantum computing is limited due to the unique characteristics of quantum parallelism, including the role of destructive interference and the inherent limitations of classical-quantum I/O. Our analysis highlights the need for an increased understanding of quantum parallelism and its implications for algorithm design and performance. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: Accepted at ISC HPC 2024 conference

arXiv:2405.05640 [pdf, other]

Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing Architectures

Authors: Martin Karp, Estela Suarez, Jan H. Meinke, Måns I. Andersson, Philipp Schlatter, Stefano Markidis, Niclas Jansson

Abstract: The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high… ▽ More The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the modular supercomputing architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 13 pages, 5 figures, 3 tables, preprint

ACM Class: J.2; C.1.4; G.4

arXiv:2405.05639 [pdf, other]

Supercomputers as a Continous Medium

Authors: Martin Karp, Niclas Jansson, Philipp Schlatter, Stefano Markidis

Abstract: As supercomputers' complexity has grown, the traditional boundaries between processor, memory, network, and accelerators have blurred, making a homogeneous computer model, in which the overall computer system is modeled as a continuous medium with homogeneously distributed computational power, memory, and data movement transfer capabilities, an intriguing and powerful abstraction. By applying a ho… ▽ More As supercomputers' complexity has grown, the traditional boundaries between processor, memory, network, and accelerators have blurred, making a homogeneous computer model, in which the overall computer system is modeled as a continuous medium with homogeneously distributed computational power, memory, and data movement transfer capabilities, an intriguing and powerful abstraction. By applying a homogeneous computer model to algorithms with a given I/O complexity, we recover from first principles, other discrete computer models, such as the roofline model, parallel computing laws, such as Amdahl's and Gustafson's laws, and phenomenological observations, such as super-linear speedup. One of the homogeneous computer model's distinctive advantages is the capability of directly linking the performance limits of an application to the physical properties of a classical computer system. Applying the homogeneous computer model to supercomputers, such as Frontier, Fugaku, and the Nvidia DGX GH200, shows that applications, such as Conjugate Gradient (CG) and Fast Fourier Transforms (FFT), are rapidly approaching the fundamental classical computational limits, where the performance of even denser systems in terms of compute and memory are fundamentally limited by the speed of light. △ Less

Submitted 9 May, 2024; originally announced May 2024.

Comments: 10 pages, 8 figures, 3 tables

ACM Class: F.1; F.2; I.6

arXiv:2405.04161 [pdf, other]

Opportunities for machine learning in scientific discovery

Authors: Ricardo Vinuesa, Jean Rabault, Hossein Azizpour, Stefan Bauer, Bingni W. Brunton, Arne Elofsson, Elias Jarlebring, Hedvig Kjellstrom, Stefano Markidis, David Marlevi, Paola Cinnella, Steven L. Brunton

Abstract: Technological advancements have substantially increased computational power and data availability, enabling the application of powerful machine-learning (ML) techniques across various fields. However, our ability to leverage ML methods for scientific discovery, {\it i.e.} to obtain fundamental and formalized knowledge about natural processes, is still in its infancy. In this review, we explore how… ▽ More Technological advancements have substantially increased computational power and data availability, enabling the application of powerful machine-learning (ML) techniques across various fields. However, our ability to leverage ML methods for scientific discovery, {\it i.e.} to obtain fundamental and formalized knowledge about natural processes, is still in its infancy. In this review, we explore how the scientific community can increasingly leverage ML techniques to achieve scientific discoveries. We observe that the applicability and opportunity of ML depends strongly on the nature of the problem domain, and whether we have full ({\it e.g.}, turbulence), partial ({\it e.g.}, computational biochemistry), or no ({\it e.g.}, neuroscience) {\it a-priori} knowledge about the governing equations and physical properties of the system. Although challenges remain, principled use of ML is opening up new avenues for fundamental scientific discoveries. Throughout these diverse fields, there is a theme that ML is enabling researchers to embrace complexity in observational data that was previously intractable to classic analysis and numerical investigations. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2404.10270 [pdf, other]

doi 10.1007/978-3-031-63749-0_22

Optimizing BIT1, a Particle-in-Cell Monte Carlo Code, with OpenMP/OpenACC and GPU Acceleration

Authors: Jeremy J. Williams, Felix Liu, David Tskhakaya, Stefan Costea, Ales Podolnik, Stefano Markidis

Abstract: On the path toward developing the first fusion energy devices, plasma simulations have become indispensable tools for supporting the design and development of fusion machines. Among these critical simulation tools, BIT1 is an advanced Particle-in-Cell code with Monte Carlo collisions, specifically designed for modeling plasma-material interaction and, in particular, analyzing the power load distri… ▽ More On the path toward developing the first fusion energy devices, plasma simulations have become indispensable tools for supporting the design and development of fusion machines. Among these critical simulation tools, BIT1 is an advanced Particle-in-Cell code with Monte Carlo collisions, specifically designed for modeling plasma-material interaction and, in particular, analyzing the power load distribution on tokamak divertors. The current implementation of BIT1 relies exclusively on MPI for parallel communication and lacks support for GPUs. In this work, we address these limitations by designing and implementing a hybrid, shared-memory version of BIT1 capable of utilizing GPUs. For shared-memory parallelization, we rely on OpenMP and OpenACC, using a task-based approach to mitigate load-imbalance issues in the particle mover. On an HPE Cray EX computing node, we observe an initial performance improvement of approximately 42%, with scalable performance showing an enhancement of about 38% when using 8 MPI ranks. Still relying on OpenMP and OpenACC, we introduce the first version of BIT1 capable of using GPUs. We investigate two different data movement strategies: unified memory and explicit data movement. Overall, we report BIT1 data transfer findings during each PIC cycle. Among BIT1 GPU implementations, we demonstrate performance improvement through concurrent GPU utilization, especially when MPI ranks are assigned to dedicated GPUs. Finally, we analyze the performance of the first BIT1 GPU porting with the NVIDIA Nsight tools to further our understanding of BIT1 computational efficiency for large-scale plasma simulations, capable of exploiting current supercomputer infrastructures. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Accepted by ICCS 2024 (The 24th International Conference on Computational Science), prepared in English, formatted according to the Springer LNCS templates and consists of 15 pages, which includes the main text, references, and figures

arXiv:2401.15661 [pdf, other]

Brain-Inspired Physics-Informed Neural Networks: Bare-Minimum Neural Architectures for PDE Solvers

Authors: Stefano Markidis

Abstract: Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving partial differential equations~(PDEs) in various scientific and engineering domains. However, traditional PINN architectures typically rely on large, fully connected multilayer perceptrons~(MLPs), lacking the sparsity and modularity inherent in many traditional numerical solvers. An unsolved and critical question… ▽ More Physics-Informed Neural Networks (PINNs) have emerged as a powerful tool for solving partial differential equations~(PDEs) in various scientific and engineering domains. However, traditional PINN architectures typically rely on large, fully connected multilayer perceptrons~(MLPs), lacking the sparsity and modularity inherent in many traditional numerical solvers. An unsolved and critical question for PINN is: What is the minimum PINN complexity regarding nodes, layers, and connections needed to provide acceptable performance? To address this question, this study investigates a novel approach by merging established PINN methodologies with brain-inspired neural network techniques. We use Brain-Inspired Modular Training~(BIMT), leveraging concepts such as locality, sparsity, and modularity inspired by the organization of the brain. With brain-inspired PINN, we demonstrate the evolution of PINN architectures from large, fully connected structures to bare-minimum, compact MLP architectures, often consisting of a few neural units! Moreover, using brain-inspired PINN, we showcase the spectral bias phenomenon occurring on the PINN architectures: bare-minimum architectures solving problems with high-frequency components require more neural units than PINN solving low-frequency problems. Finally, we derive basic PINN building blocks through BIMT training on simple problems akin to convolutional and attention modules in deep neural networks, enabling the construction of modular PINN architectures. Our experiments show that brain-inspired PINN training leads to PINN architectures that minimize the computing and memory resources yet provide accurate results. △ Less

Submitted 19 April, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

Comments: Accepted at the 24th International Conference on Computational Science (ICCS)

arXiv:2401.14576 [pdf]

Accelerating Scientific Application through Transparent I/O Interposition

Authors: Steven W. D. Chien, Kento Sato, Artur Podobas, Niclas Jansson, Stefano Markidis, Michio Honda

Abstract: The ability to handle a large volume of data generated by scientific applications is crucial. We have seen an increase in the heterogeneity of storage technologies available to scientific applications, such as burst buffers, local temporary block storage, managed cloud parallel file systems (PFS), and non-POSIX object stores. However, scientific applications designed for traditional HPC systems ca… ▽ More The ability to handle a large volume of data generated by scientific applications is crucial. We have seen an increase in the heterogeneity of storage technologies available to scientific applications, such as burst buffers, local temporary block storage, managed cloud parallel file systems (PFS), and non-POSIX object stores. However, scientific applications designed for traditional HPC systems can not easily exploit those storage systems due to cost, throughput, and programming model challenges. We present iFast, a new library-level approach to transparently accelerating scientific applications based on MPI-IO. It decouples application I/O, data caching, and data storage to support heterogeneous storage models. Design decisions of iFast are based on a strong emphasis on deployability. It is highly general with only MPI as a core dependency, allowing users to run unmodified MPI-based applications with unmodified MPI implementations - even proprietary ones like IntelMPI and Cray MPICH. Our approach supports a wide range of networked storage, including traditional PFS, ordinary NFS, and S3-based cloud storage. Unlike previous approaches, iFast ensures crash consistency even across compute nodes. We demonstrate iFast in cloud HPC platform, small local cluster, and hybrid of both to show its generality. Our results show that iFast reduces end-to-end execution time by 13-26% for three popular scientific applications on the cloud. It also outperforms the state-of-the-art system, SymphonyFS, a filesystem-based approach for similar goals but without crash consistency, by 12-23%. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: Submitted to HPDC 2024

arXiv:2308.00763 [pdf, other]

Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU

Authors: Gabin Schieffer, Nattawat Pornthisan, Daniel Araújo de Medeiros, Stefano Markidis, Jacob Wahlgren, Ivy Peng

Abstract: High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on N… ▽ More High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on Nvidia V100, A100, A40 and T4 GPUs. To mitigate numerical instability and precision losses, we introduce algorithmic changes in the particle filters. Using half-precision leads to a performance improvement of 1.5-2x and 2.5-4.6x with respect to single- and double-precision baselines respectively, at the cost of a relatively small loss of accuracy. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: 12 pages, 8 figures, conference. To be published in The 21st International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar2023)

arXiv:2308.00637 [pdf, other]

Krylov Solvers for Interior Point Methods with Applications in Radiation Therapy and Support Vector Machines

Authors: Felix Liu, Albin Fredriksson, Stefano Markidis

Abstract: Interior point methods are widely used for different types of mathematical optimization problems. Many implementations of interior point methods in use today rely on direct linear solvers to solve systems of equations in each iteration. The need to solve ever larger optimization problems more efficiently and the rise of hardware accelerators for general purpose computing has led to a large interes… ▽ More Interior point methods are widely used for different types of mathematical optimization problems. Many implementations of interior point methods in use today rely on direct linear solvers to solve systems of equations in each iteration. The need to solve ever larger optimization problems more efficiently and the rise of hardware accelerators for general purpose computing has led to a large interest in using iterative linear solvers instead, with the major issue being inevitable ill-conditioning of the linear systems arising as the optimization progresses. We investigate the use of Krylov solvers for interior point methods in solving optimization problems from radiation therapy and support vector machines. We implement a prototype interior point method using a so called doubly augmented formulation of the Karush-Kuhn-Tucker linear system of equations, originally proposed by Forsgren and Gill, and evaluate its performance on real optimization problems from radiation therapy and support vector machines. Crucially, our implementation uses a preconditioned conjugate gradient method with Jacobi preconditioning internally. Our measurements of the conditioning of the linear systems indicate that the Jacobi preconditioner improves the conditioning of the systems to a degree that they can be solved iteratively, but there is room for further improvement in that regard. Furthermore, profiling of our prototype code shows that it is suitable for GPU acceleration, which may further improve its performance in practice. Overall, our results indicate that our method can find solutions of acceptable accuracy in reasonable time, even with a simple Jacobi preconditioner. △ Less

Submitted 26 February, 2024; v1 submitted 1 August, 2023; originally announced August 2023.

arXiv:2308.00497 [pdf, other]

Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries

Authors: Yifei He, Artur Podobas, Stefano Markidis

Abstract: FFTc is a Domain-Specific Language (DSL) for designing and generating Fast Fourier Transforms (FFT) libraries. The FFTc uniqueness is that it leverages and extend Multi-Level Intermediate Representation (MLIR) dialects to optimize FFT code generation. In this work, we present FFTc extensions and improvements such as the possibility of using different data layout for complex-value arrays, and spars… ▽ More FFTc is a Domain-Specific Language (DSL) for designing and generating Fast Fourier Transforms (FFT) libraries. The FFTc uniqueness is that it leverages and extend Multi-Level Intermediate Representation (MLIR) dialects to optimize FFT code generation. In this work, we present FFTc extensions and improvements such as the possibility of using different data layout for complex-value arrays, and sparsification to enable efficient vectorization, and a seamless porting of FFT libraries to GPU systems. We show that, on CPUs, thanks to vectorization, the performance of the FFTc-generated FFT is comparable to performance of FFTW, a state-of-the-art FFT libraries. We also present the initial performance results for FFTc on Nvidia GPUs. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.14860 [pdf, other]

Quantum Computer Simulations at Warp Speed: Assessing the Impact of GPU Acceleration

Authors: Jennifer Faj, Ivy Peng, Jacob Wahlgren, Stefano Markidis

Abstract: Quantum computer simulators are crucial for the development of quantum computing. In this work, we investigate the suitability and performance impact of GPU and multi-GPU systems on a widely used simulation tool - the state vector simulator Qiskit Aer. In particular, we evaluate the performance of both Qiskit's default Nvidia Thrust backend and the recent Nvidia cuQuantum backend on Nvidia A100 GP… ▽ More Quantum computer simulators are crucial for the development of quantum computing. In this work, we investigate the suitability and performance impact of GPU and multi-GPU systems on a widely used simulation tool - the state vector simulator Qiskit Aer. In particular, we evaluate the performance of both Qiskit's default Nvidia Thrust backend and the recent Nvidia cuQuantum backend on Nvidia A100 GPUs. We provide a benchmark suite of representative quantum applications for characterization. For simulations with a large number of qubits, the two GPU backends can provide up to 14x speedup over the CPU backend, with Nvidia cuQuantum providing further 1.5-3x speedup over the default Thrust backend. Our evaluation on a single GPU identifies the most important functions in Nvidia Thrust and cuQuantum for different quantum applications and their compute and memory bottlenecks. We also evaluate the gate fusion and cache-blocking optimizations on different quantum applications. Finally, we evaluate large-number qubit quantum applications on multi-GPU and identify data movement between host and GPU as the limiting factor for the performance. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2306.16512 [pdf, other]

doi 10.1007/978-3-031-50684-0_10

Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations

Authors: Jeremy J. Williams, David Tskhakaya, Stefan Costea, Ivy B. Peng, Marta Garcia-Gasulla, Stefano Markidis

Abstract: Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work… ▽ More Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: Accepted by the Euro-Par 2023 workshops (TDLPP 2023), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figures

arXiv:2305.09419 [pdf, other]

doi 10.1145/3587135.3592191

QHDL: a Low-Level Circuit Description Language for Quantum Computing

Authors: Gilbert Netzer, Stefano Markidis

Abstract: This paper proposes a descriptive language called QHDL, akin to VHDL, to program gate-based quantum computing systems. Unlike other popular quantum programming languages, QHDL targets low-level quantum computing programming and aims to provide a common framework for programming FPGAs and gate-based quantum computing systems. The paper presents an initial implementation and design principles of the… ▽ More This paper proposes a descriptive language called QHDL, akin to VHDL, to program gate-based quantum computing systems. Unlike other popular quantum programming languages, QHDL targets low-level quantum computing programming and aims to provide a common framework for programming FPGAs and gate-based quantum computing systems. The paper presents an initial implementation and design principles of the QHDL framework, including a compiler and quantum computer simulator. We discuss the challenges of low-level integration of streaming models and quantum computing for programming FPGAs and gate-based quantum computing systems. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: 4 pages, 7 figures, to be published in Proceedings of the 20th ACM International Conference on Computing Frontiers, May 9-11, 2023, Bologna, Italy

arXiv:2305.04635 [pdf, other]

Parallel Cholesky Factorization for Banded Matrices using OpenMP Tasks

Authors: Felix Liu, Albin Fredriksson, Stefano Markidis

Abstract: Cholesky factorization is a widely used method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is numerical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example wher… ▽ More Cholesky factorization is a widely used method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is numerical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example where this is the case, and the specific type of optimization problem motivating this work, is radiation therapy treatment planning, where numerical optimization is used to create individual treatment plans for patients. To address this bottleneck, we propose a task-based multi-threaded method for Cholesky factorization of banded matrices with medium-sized bands. We implement our algorithm using OpenMP tasks and compare our performance with state-of-the-art libraries such as Intel MKL. Our performance measurements show a performance that is on par or better than Intel MKL (up to ~26%) for a wide range of matrix bandwidths on two different Intel CPU systems. △ Less

Submitted 8 May, 2023; originally announced May 2023.

arXiv:2302.12164 [pdf, other]

doi 10.1016/j.future.2023.06.01

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Authors: Ayesha Afzal, Georg Hager, Stefano Markidis, Gerhard Wellein

Abstract: Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate inject… ▽ More Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI communication in memory-bound parallel programs on multicore clusters and how it can be facilitated. For instance, slowing down MPI processes by deliberate injection of delays can improve performance if certain conditions are met. This leads to the counter-intuitive conclusion that noise, independent of its source, is not always detrimental but can be leveraged for performance improvements. We employ phase-space graphs as a new tool to visualize parallel program dynamics. They are useful in spotting certain patterns in parallel execution that will easily go unnoticed with traditional tracing tools. We investigate five different microbenchmarks and applications on different supercomputer platforms: an MPI-augmented STREAM Triad, two implementations of Lattice-Boltzmann fluid solvers, and the LULESH and HPCG proxy applications. △ Less

Submitted 24 February, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: 18 pages, 14 figures, 7 tables. Corrected Fig. 4 layout

arXiv:2209.14754 [pdf, other]

doi 10.3389/fams.2022.1036711

On Physics-Informed Neural Networks for Quantum Computers

Authors: Stefano Markidis

Abstract: Physics-Informed Neural Networks (PINN) emerged as a powerful tool for solving scientific computing problems, ranging from the solution of Partial Differential Equations to data assimilation tasks. One of the advantages of using PINN is to leverage the usage of Machine Learning computational frameworks relying on the combined usage of CPUs and co-processors, such as accelerators, to achieve maximu… ▽ More Physics-Informed Neural Networks (PINN) emerged as a powerful tool for solving scientific computing problems, ranging from the solution of Partial Differential Equations to data assimilation tasks. One of the advantages of using PINN is to leverage the usage of Machine Learning computational frameworks relying on the combined usage of CPUs and co-processors, such as accelerators, to achieve maximum performance. This work investigates the design, implementation, and performance of PINNs, using the Quantum Processing Unit (QPU) co-processor. We design a simple Quantum PINN to solve the one-dimensional Poisson problem using a Continuous Variable (CV) quantum computing framework. We discuss the impact of different optimizers, PINN residual formulation, and quantum neural network depth on the quantum PINN accuracy. We show that the optimizer exploration of the training landscape in the case of quantum PINN is not as effective as in classical PINN, and basic Stochastic Gradient Descent (SGD) optimizers outperform adaptive and high-order optimizers. Finally, we highlight the difference in methods and algorithms between quantum and classical PINNs and outline future research challenges for quantum PINN development. △ Less

Submitted 17 October, 2022; v1 submitted 28 September, 2022; originally announced September 2022.

Comments: Updated the previous work section and abstract, fixed typos, and changed the title

Journal ref: https://www.frontiersin.org/articles/10.3389/fams.2022.1036711/full

arXiv:2208.13658 [pdf, other]

doi 10.1007/978-3-031-30442-2_25

Breaking Down the Parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software

Authors: Måns I. Andersson, N. Arul Murugan, Artur Podobas, Stefano Markidis

Abstract: GROMACS is one of the most widely used HPC software packages using the Molecular Dynamics (MD) simulation technique. In this work, we quantify GROMACS parallel performance using different configurations, HPC systems, and FFT libraries (FFTW, Intel MKL FFT, and FFT PACK). We break down the cost of each GROMACS computational phase and identify non-scalable stages, such as MPI communication during th… ▽ More GROMACS is one of the most widely used HPC software packages using the Molecular Dynamics (MD) simulation technique. In this work, we quantify GROMACS parallel performance using different configurations, HPC systems, and FFT libraries (FFTW, Intel MKL FFT, and FFT PACK). We break down the cost of each GROMACS computational phase and identify non-scalable stages, such as MPI communication during the 3D FFT computation when using a large number of processes. We show that the Particle-Mesh Ewald phase and the 3D FFT calculation significantly impact the GROMACS performance. Finally, we discuss performance opportunities with a particular interest in developing GROMACS for the FFT calculations. △ Less

Submitted 29 August, 2022; originally announced August 2022.

arXiv:2208.11395 [pdf, other]

doi 10.1007/978-3-031-30442-2_29

Distributed Objective Function Evaluation for Optimization of Radiation Therapy Treatment Plans

Authors: Felix Liu, Måns I. Andersson, Albin Fredriksson, Stefano Markidis

Abstract: The modern workflow for radiation therapy treatment planning involves mathematical optimization to determine optimal treatment machine parameters for each patient case. The optimization problems can be computationally expensive, requiring iterative optimization algorithms to solve. In this work, we investigate a method for distributing the calculation of objective functions and gradients for radia… ▽ More The modern workflow for radiation therapy treatment planning involves mathematical optimization to determine optimal treatment machine parameters for each patient case. The optimization problems can be computationally expensive, requiring iterative optimization algorithms to solve. In this work, we investigate a method for distributing the calculation of objective functions and gradients for radiation therapy optimization problems across computational nodes. We test our approach on the TROTS dataset -- which consists of optimization problems from real clinical patient cases -- using the IPOPT optimization solver in a leader/follower type approach for parallelization. We show that our approach can utilize multiple computational nodes efficiently, with a speedup of approximately 2-3.5 times compared to the serial version. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Comments: Accepted for publication at the PPAM22 conference

Journal ref: Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham

arXiv:2207.07098 [pdf, other]

Large-Scale Direct Numerical Simulations of Turbulence Using GPUs and Modern Fortran

Authors: Martin Karp, Daniele Massaro, Niclas Jansson, Alistair Hart, Jacob Wahlgren, Philipp Schlatter, Stefano Markidis

Abstract: We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by… ▽ More We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world's first direct numerical simulation of the flow around a Flettner rotor at Re=30'000 and its interaction with a turbulent boundary layer. We present one of the first performance comparisons between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency. △ Less

Submitted 23 June, 2022; originally announced July 2022.

Comments: 13 pages, 7 figures

ACM Class: G.4; J.2

arXiv:2207.06803 [pdf, other]

FFTc: An MLIR Dialect for Developing HPC Fast Fourier Transform Libraries

Authors: Yifei He, Artur Podobas, Måns I. Andersson, Stefano Markidis

Abstract: Discrete Fourier Transform (DFT) libraries are one of the most critical software components for scientific computing. Inspired by FFTW, a widely used library for DFT HPC calculations, we apply compiler technologies for the development of HPC Fourier transform libraries. In this work, we introduce FFTc, a domain-specific language, based on Multi-Level Intermediate Representation (MLIR), for express… ▽ More Discrete Fourier Transform (DFT) libraries are one of the most critical software components for scientific computing. Inspired by FFTW, a widely used library for DFT HPC calculations, we apply compiler technologies for the development of HPC Fourier transform libraries. In this work, we introduce FFTc, a domain-specific language, based on Multi-Level Intermediate Representation (MLIR), for expressing Fourier Transform algorithms. We present the initial design, implementation, and preliminary results of FFTc. △ Less

Submitted 26 July, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

arXiv:2206.14103 [pdf, other]

Workflows to driving high-performance interactive supercomputing for urgent decision making

Authors: Nick Brown, Rupert Nash, Gordon Gibb, Evgenij Belikov, Artur Podobas, Wei Der Chien, Stefano Markidis, Markus Flatken, Andreas Gerndt

Abstract: Interactive urgent computing is a small but growing user of supercomputing resources. However there are numerous technical challenges that must be overcome to make supercomputers fully suited to the wide range of urgent workloads which could benefit from the computational power delivered by such instruments. An important question is how to connect the different components of an urgent workload; na… ▽ More Interactive urgent computing is a small but growing user of supercomputing resources. However there are numerous technical challenges that must be overcome to make supercomputers fully suited to the wide range of urgent workloads which could benefit from the computational power delivered by such instruments. An important question is how to connect the different components of an urgent workload; namely the users, the simulation codes, and external data sources, together in a structured and accessible manner. In this paper we explore the role of workflows from both the perspective of marshalling and control of urgent workloads, and at the individual HPC machine level. Ultimately requiring two workflow systems, by using a space weather prediction urgent use-cases, we explore the benefit that these two workflow systems provide especially when one exploits the flexibility enabled by them interoperating. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: Pre-print of paper accepted to the InteractiveHPC workshop of ISC2022

arXiv:2205.13963 [pdf, other]

doi 10.1007/978-3-031-30442-2_12

Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications

Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein, Stefano Markidis

Abstract: This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time p… ▽ More This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics. △ Less

Submitted 27 May, 2022; originally announced May 2022.

Comments: 12 pages, 9 figures, 1 table

arXiv:2112.00116 [pdf, ps, other]

A Review on Parallel Virtual Screening Softwares for High Performance Computers

Authors: Natarajan Arul Murugan, Artur Podobas, Davide Gadioli, Emanuele Vitali, Gianluca Palermo, Stefano Markidis

Abstract: Drug discovery is the most expensive, time demanding and challenging project in biopharmaceutical companies which aims at the identification and optimization of lead compounds from large-sized chemical libraries. The lead compounds should have high affinity binding and specificity for a target associated with a disease and in addition they should have favorable pharmacodynamic and pharmacokinetic… ▽ More Drug discovery is the most expensive, time demanding and challenging project in biopharmaceutical companies which aims at the identification and optimization of lead compounds from large-sized chemical libraries. The lead compounds should have high affinity binding and specificity for a target associated with a disease and in addition they should have favorable pharmacodynamic and pharmacokinetic properties (grouped as ADMET properties). Overall, drug discovery is a multivariable optimization and can be carried out in supercomputers using a reliable scoring function which is a measure of binding affinity or inhibition potential of the drug-like compound. The major problem is that the number of compounds in the chemical spaces is huge making the computational drug discovery very demanding. However, it is cheaper and less time consuming when compared to experimental high throughput screening. As the problem is to find the most stable (global) minima for numerous protein-ligand complexes (at the order of 10$^6$ to 10$^{12}$), the parallel implementation of in-silico virtual screening can be exploited to make the drug discovery in affordable time. In this review, we discuss such implementations of parallelization algorithms in virtual screening programs. The nature of different scoring functions and search algorithms are discussed, together with a performance analysis of several docking softwares ported on high-performance computing architectures. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Comments: Submitted to Pharmaceuticals, MPDI journal

arXiv:2111.05654 [pdf, other]

Utilising urgent computing to tackle the spread of mosquito-borne diseases

Authors: Nick Brown, Rupert Nash, Piero Poletti, Giorgio Guzzetta, Mattia Manica, Agnese Zardini, Markus Flatken, Jules Vidal, Charles Gueunet, Evgenij Belikov, Julien Tierny, Artur Podobas, Wei Der Chien, Stefano Markidis, Andreas Gerndt

Abstract: It is estimated that around 80\% of the world's population live in areas susceptible to at-least one major vector borne disease, and approximately 20% of global communicable diseases are spread by mosquitoes. Furthermore, the outbreaks of such diseases are becoming more common and widespread, with much of this driven in recent years by socio-demographic and climatic factors. These trends are causi… ▽ More It is estimated that around 80\% of the world's population live in areas susceptible to at-least one major vector borne disease, and approximately 20% of global communicable diseases are spread by mosquitoes. Furthermore, the outbreaks of such diseases are becoming more common and widespread, with much of this driven in recent years by socio-demographic and climatic factors. These trends are causing significant worry to global health organisations, including the CDC and WHO, and-so an important question is the role that technology can play in addressing them. In this work we describe the integration of an epidemiology model, which simulates the spread of mosquito-borne diseases, with the VESTEC urgent computing ecosystem. The intention of this work is to empower human health professionals to exploit this model and more easily explore the progression of mosquito-borne diseases. Traditionally in the domain of the few research scientists, by leveraging state of the art visualisation and analytics techniques, all supported by running the computational workloads on HPC machines in a seamless fashion, we demonstrate the significant advantages that such an integration can provide. Furthermore we demonstrate the benefits of using an ecosystem such as VESTEC, which provides a framework for urgent computing, in supporting the easy adoption of these technologies by the epidemiologists and disaster response professionals more widely. △ Less

Submitted 10 November, 2021; originally announced November 2021.

Comments: Preprint of paper in 2021 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC)

arXiv:2109.03592 [pdf, ps, other]

Strong Scaling of OpenACC enabled Nek5000 on several GPU based HPC systems

Authors: Jonathan Vincent, Jing Gong, Martin Karp, Adam Peplinski, Niclas Jansson, Artur Podobas, Andreas Jocksch, Jie Yao, Fazle Hussain, Stefano Markidis, Matts Karlsson, Dirk Pleiter, Erwin Laure, Philipp Schlatter

Abstract: We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers $Re_τ=360$ and $Re_τ=550$, based on friction velocity and pipe radius. The strong… ▽ More We present new results on the strong parallel scaling for the OpenACC-accelerated implementation of the high-order spectral element fluid dynamics solver Nek5000. The test case considered consists of a direct numerical simulation of fully-developed turbulent flow in a straight pipe, at two different Reynolds numbers $Re_τ=360$ and $Re_τ=550$, based on friction velocity and pipe radius. The strong scaling is tested on several GPU-enabled HPC systems, including the Swiss Piz Daint system, TACC's Longhorn, Jülich's JUWELS Booster, and Berzelius in Sweden. The performance results show that speed-up between 3-5 can be achieved using the GPU accelerated version compared with the CPU version on these different systems. The run-time for 20 timesteps reduces from 43.5 to 13.2 seconds with increasing the number of GPUs from 64 to 512 for $Re_τ=550$ case on JUWELS Booster system. This illustrates the GPU accelerated version the potential for high throughput. At the same time, the strong scaling limit is significantly larger for GPUs, at about $2000-5000$ elements per rank; compared to about $50-100$ for a CPU-rank. △ Less

Submitted 4 November, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

Comments: 9 pages, 8 figures. Submitted to HPC-Asia 2022 conference, updated to address reviewers comments

ACM Class: G.4; J.2; C.1

arXiv:2108.12188 [pdf, ps, other]

doi 10.1145/3492805.3492808

A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays

Authors: Martin Karp, Artur Podobas, Tobias Kenter, Niclas Jansson, Christian Plessl, Philipp Schlatter, Stefano Markidis

Abstract: The impending termination of Moore's law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand. In this work, we de… ▽ More The impending termination of Moore's law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand. In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior work -- which often focuses on accelerating small kernels -- we target the entire Poisson solver on unstructured meshes based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ Gflop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator. We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies. △ Less

Submitted 2 November, 2021; v1 submitted 27 August, 2021; originally announced August 2021.

Comments: 12 pages, 3 figures, 3 tables, Accepted to HPC Asia 2022

ACM Class: G.4; J.2; C.1

arXiv:2107.06676 [pdf, other]

Higgs Boson Classification: Brain-inspired BCPNN Learning with StreamBrain

Authors: Martin Svedin, Artur Podobas, Steven W. D. Chien, Stefano Markidis

Abstract: One of the most promising approaches for data analysis and exploration of large data sets is Machine Learning techniques that are inspired by brain models. Such methods use alternative learning rules potentially more efficiently than established learning rules. In this work, we focus on the potential of brain-inspired ML for exploiting High-Performance Computing (HPC) resources to solve ML problem… ▽ More One of the most promising approaches for data analysis and exploration of large data sets is Machine Learning techniques that are inspired by brain models. Such methods use alternative learning rules potentially more efficiently than established learning rules. In this work, we focus on the potential of brain-inspired ML for exploiting High-Performance Computing (HPC) resources to solve ML problems: we discuss the BCPNN and an HPC implementation, called StreamBrain, its computational cost, suitability to HPC systems. As an example, we use StreamBrain to analyze the Higgs Boson dataset from High Energy Physics and discriminate between background and signal classes in collisions of high-energy particle colliders. Overall, we reach up to 69.15% accuracy and 76.4% Area Under the Curve (AUC) performance. △ Less

Submitted 17 August, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

Comments: Accepted for publication at The 2nd Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S 2021)

arXiv:2107.02232 [pdf, other]

A Deep Learning-Based Particle-in-Cell Method for Plasma Simulations

Authors: Xavier Aguilar, Stefano Markidis

Abstract: We design and develop a new Particle-in-Cell (PIC) method for plasma simulations using Deep-Learning (DL) to calculate the electric field from the electron phase space. We train a Multilayer Perceptron (MLP) and a Convolutional Neural Network (CNN) to solve the two-stream instability test. We verify that the DL-based MLP PIC method produces the correct results using the two-stream instability: the… ▽ More We design and develop a new Particle-in-Cell (PIC) method for plasma simulations using Deep-Learning (DL) to calculate the electric field from the electron phase space. We train a Multilayer Perceptron (MLP) and a Convolutional Neural Network (CNN) to solve the two-stream instability test. We verify that the DL-based MLP PIC method produces the correct results using the two-stream instability: the DL-based PIC provides the expected growth rate of the two-stream instability. The DL-based PIC does not conserve the total energy and momentum. However, the DL-based PIC method is stable against the cold-beam instability, affecting traditional PIC methods. This work shows that integrating DL technologies into traditional computational methods is a viable approach for developing next-generation PIC algorithms. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: Submitted to AI4S Workshop at Cluster Conference

arXiv:2107.01243 [pdf]

Neko: A Modern, Portable, and Scalable Framework for High-Fidelity Computational Fluid Dynamics

Authors: Niclas Jansson, Martin Karp, Artur Podobas, Stefano Markidis, Philipp Schlatter

Abstract: Recent trends and advancement in including more diverse and heterogeneous hardware in High-Performance Computing is challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim "software outlives hardware" may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. C… ▽ More Recent trends and advancement in including more diverse and heterogeneous hardware in High-Performance Computing is challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim "software outlives hardware" may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. CFD is one of the many application domains affected. In this paper, we present Neko, a portable framework for high-order spectral element flow simulations. Unlike prior works, Neko adopts a modern object-oriented approach, allowing multi-tier abstractions of the solver stack and facilitating hardware backends ranging from general-purpose processors down to exotic vector processors and FPGAs. We show that Neko's performance and accuracy are comparable to NekRS, and thus on-par with Nek5000's successor on modern CPU machines. Furthermore, we develop a performance model, which we use to discuss challenges and opportunities for high-order solvers on emerging hardware. △ Less

Submitted 2 July, 2021; originally announced July 2021.

arXiv:2106.05373 [pdf, other]

doi 10.1145/3468044.3468052

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

Authors: Artur Podobas, Martin Svedin, Steven W. D. Chien, Ivy B. Peng, Naresh Balaji Ravichandran, Pawel Herman, Anders Lansner, Stefano Markidis

Abstract: The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In… ▽ More The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other -- less-known -- machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation Neural Network (BCPNN). In this paper, we introduce StreamBrain -- a framework that allows neural networks based on BCPNN to be practically deployed in High-Performance Computing systems. StreamBrain is a domain-specific language (DSL), similar in concept to existing machine learning (ML) frameworks, and supports backends for CPUs, GPUs, and even FPGAs. We empirically demonstrate that StreamBrain can train the well-known ML benchmark dataset MNIST within seconds, and we are the first to demonstrate BCPNN on STL-10 size networks. We also show how StreamBrain can be used to train with custom floating-point formats and illustrate the impact of using different bfloat variations on BCPNN using FPGAs. △ Less

Submitted 9 June, 2021; originally announced June 2021.

Comments: Accepted for publication at the International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2021)

arXiv:2103.09683 [pdf, other]

Accelerating Radiation Therapy Dose Calculation with Nvidia GPUs

Authors: Felix Liu, Niclas Jansson, Artur Podobas, Albin Fredriksson, Stefano Markidis

Abstract: Radiation Treatment Planning (RTP) is the process of planning the appropriate external beam radiotherapy to combat cancer in human patients. RTP is a complex and compute-intensive task, which often takes a long time (several hours) to compute. Reducing this time allows for higher productivity at clinics and more sophisticated treatment planning, which can materialize in better treatments. The stat… ▽ More Radiation Treatment Planning (RTP) is the process of planning the appropriate external beam radiotherapy to combat cancer in human patients. RTP is a complex and compute-intensive task, which often takes a long time (several hours) to compute. Reducing this time allows for higher productivity at clinics and more sophisticated treatment planning, which can materialize in better treatments. The state-of-the-art in medical facilities uses general-purpose processors (CPUs) to perform many steps in the RTP process. In this paper, we explore the use of accelerators to reduce RTP calculating time. We focus on the step that calculates the dose using the Graphics Processing Unit (GPU), which we believe is an excellent candidate for this computation type. Next, we create a highly optimized implementation for a custom Sparse Matrix-Vector Multiplication (SpMV) that operates on numerical formats unavailable in state-of-the-art SpMV libraries (e.g., Ginkgo and cuSPARSE). We show that our implementation is several times faster than the baseline (up-to 4x) and has a higher operational intensity than similar (but different) versions such as Ginkgo and cuSPARSE. △ Less

Submitted 19 September, 2021; v1 submitted 17 March, 2021; originally announced March 2021.

arXiv:2103.09655 [pdf, other]

The Old and the New: Can Physics-Informed Deep-Learning Replace Traditional Linear Solvers?

Authors: Stefano Markidis

Abstract: Physics-Informed Neural Networks (PINN) are neural networks encoding the problem governing equations, such as Partial Differential Equations (PDE), as a part of the neural network. PINNs have emerged as a new essential tool to solve various challenging problems, including computing linear systems arising from PDEs, a task for which several traditional methods exist. In this work, we focus first on… ▽ More Physics-Informed Neural Networks (PINN) are neural networks encoding the problem governing equations, such as Partial Differential Equations (PDE), as a part of the neural network. PINNs have emerged as a new essential tool to solve various challenging problems, including computing linear systems arising from PDEs, a task for which several traditional methods exist. In this work, we focus first on evaluating the potential of PINNs as linear solvers in the case of the Poisson equation, an omnipresent equation in scientific computing. We characterize PINN linear solvers in terms of accuracy and performance under different network configurations (depth, activation functions, input data set distribution). We highlight the critical role of transfer learning. Our results show that low-frequency components of the solution converge quickly as an effect of the F-principle. In contrast, an accurate solution of the high frequencies requires an exceedingly long time. To address this limitation, we propose integrating PINNs into traditional linear solvers. We show that this integration leads to the development of new solvers whose performance is on par with other high-performance solvers, such as PETSc conjugate gradient linear solvers, in terms of performance and accuracy. Overall, while the accuracy and computational performance are still a limiting factor for the direct use of PINN linear solvers, hybrid strategies combining old traditional linear solver approaches with new emerging deep-learning techniques are among the most promising methods for developing a new class of linear solvers. △ Less

Submitted 5 July, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: Preprint - submitted to Frontiers

arXiv:2010.13463 [pdf]

High-Performance Spectral Element Methods on Field-Programmable Gate Arrays

Authors: Martin Karp, Artur Podobas, Niclas Jansson, Tobias Kenter, Christian Plessl, Philipp Schlatter, Stefano Markidis

Abstract: Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures' comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate… ▽ More Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures' comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs' applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs' performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have? △ Less

Submitted 4 May, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

Comments: 10 pages, IEEE International Parallel and Distributed Processing Symposium 2021 (IPDPS'21)

ACM Class: G.4; J.2; C.1

arXiv:2010.05348 [pdf, other]

Automatic Particle Trajectory Classification in Plasma Simulations

Authors: Stefano Markidis, Ivy Peng, Artur Podobas, Itthinat Jongsuebchoke, Gabriel Bengtsson, Pawel Herman

Abstract: Numerical simulations of plasma flows are crucial for advancing our understanding of microscopic processes that drive the global plasma dynamics in fusion devices, space, and astrophysical systems. Identifying and classifying particle trajectories allows us to determine specific on-going acceleration mechanisms, shedding light on essential plasma processes. Our overall goal is to provide a gener… ▽ More Numerical simulations of plasma flows are crucial for advancing our understanding of microscopic processes that drive the global plasma dynamics in fusion devices, space, and astrophysical systems. Identifying and classifying particle trajectories allows us to determine specific on-going acceleration mechanisms, shedding light on essential plasma processes. Our overall goal is to provide a general workflow for exploring particle trajectory space and automatically classifying particle trajectories from plasma simulations in an unsupervised manner. We combine pre-processing techniques, such as Fast Fourier Transform (FFT), with Machine Learning methods, such as Principal Component Analysis (PCA), k-means clustering algorithms, and silhouette analysis. We demonstrate our workflow by classifying electron trajectories during magnetic reconnection problem. Our method successfully recovers existing results from previous literature without a priori knowledge of the underlying system. Our workflow can be applied to analyzing particle trajectories in different phenomena, from magnetic reconnection, shocks to magnetospheric flows. The workflow has no dependence on any physics model and can identify particle trajectories and acceleration mechanisms that were not detected before. △ Less

Submitted 11 October, 2020; originally announced October 2020.

Comments: Accepted for publication at AI4S: Workshop on Artificial Intelligence and Machine Learning for Scientific Applications

arXiv:2008.04397 [pdf, other]

doi 10.1109/SBAC-PAD49847.2020.00030

sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems

Authors: Steven W. D. Chien, Jonas Nylund, Gabriel Bengtsson, Ivy B. Peng, Artur Podobas, Stefano Markidis

Abstract: Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes re… ▽ More Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters. △ Less

Submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2020)

arXiv:2008.04395 [pdf, other]

doi 10.1109/CLUSTER49012.2020.00046

tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads

Authors: Steven W. D. Chien, Artur Podobas, Ivy B. Peng, Stefano Markidis

Abstract: Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and al… ▽ More Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization. △ Less

Submitted 11 August, 2020; v1 submitted 10 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the 2020 International Conference on Cluster Computing (CLUSTER 2020)

arXiv:2005.13425 [pdf]

Optimization of Tensor-product Operations in Nekbone on GPUs

Authors: Martin Karp, Niclas Jansson, Artur Podobas, Philipp Schlatter, Stefano Markidis

Abstract: In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to… ▽ More In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to make the computations layer by layer. This enables us to use loop unrolling as well as utilize registers and shared memory efficiently. Our implementation is then compared on both the Pascal and Volta GPU architectures to previous GPU versions of Nekbone as well as a measured roofline. The results show that our implementation outperforms previous GPU Nekbone implementations by 6-10%. Compared to the measured roofline, we obtain 77 - 92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024 - 4096 elements and polynomial degree 9. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 4 pages, 4 figures

ACM Class: G.4; J.2

arXiv:1910.09598 [pdf, other]

doi 10.1109/MCHPC49590.2019.00014

Performance Evaluation of Advanced Features in CUDA Unified Memory

Authors: Steven W. D. Chien, Ivy B. Peng, Stefano Markidis

Abstract: CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to eva… ▽ More CUDA Unified Memory improves the GPU programmability and also enables GPU memory oversubscription. Recently, two advanced memory features, memory advises and asynchronous prefetch, have been introduced. In this work, we evaluate the new features on two platforms that feature different CPUs, GPUs, and interconnects. We derive a benchmark suite for the experiments and stress the memory system to evaluate both in-memory and oversubscription performance. The results show that memory advises on the Intel-Volta/Pascal-PCIe platform bring negligible improvement for in-memory executions. However, when GPU memory is oversubscribed by about 50%, using memory advises results in up to 25% performance improvement compared to the basic CUDA Unified Memory. In contrast, the Power9-Volta-NVLink platform can substantially benefit from memory advises, achieving up to 34% performance gain for in-memory executions. However, when GPU memory is oversubscribed on this platform, using memory advises increases GPU page faults and results in considerable performance loss. The CUDA prefetch also shows different performance impact on the two platforms. It improves performance by up to 50% on the Intel-Volta/Pascal-PCI-E platform but brings little benefit to the Power9-Volta-NVLink platform. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: Accepted for publication at Workshop on Memory Centric High Performance Computing (MCHPC'19) in SC19

arXiv:1908.05715 [pdf, other]

doi 10.1029/2021JA029620

Automated classification of plasma regions using 3D particle energy distributions

Authors: Vyacheslav Olshevsky, Yuri V. Khotyaintsev, Ahmad Lalti, Andrey Divin, Gian Luca Delzanno, Sven Anderzen, Pawel Herman, Steven W. D. Chien, Levon Avanov, Andrew P. Dimmock, Stefano Markidis

Abstract: We investigate the properties of the ion sky maps produced by the Dual Ion Spectrometers (DIS) from the Fast Plasma Investigation (FPI). We have trained a convolutional neural network classifier to predict four regions crossed by the MMS on the dayside magnetosphere: solar wind, ion foreshock, magnetosheath, and magnetopause using solely DIS spectrograms. The accuracy of the classifier is >98%. We… ▽ More We investigate the properties of the ion sky maps produced by the Dual Ion Spectrometers (DIS) from the Fast Plasma Investigation (FPI). We have trained a convolutional neural network classifier to predict four regions crossed by the MMS on the dayside magnetosphere: solar wind, ion foreshock, magnetosheath, and magnetopause using solely DIS spectrograms. The accuracy of the classifier is >98%. We use the classifier to detect mixed plasma regions, in particular to find the bow shock regions. A similar approach can be used to identify the magnetopause crossings and reveal regions prone to magnetic reconnection. Data processing through the trained classifier is fast and efficient and thus can be used for classification for the whole MMS database. △ Less

Submitted 21 September, 2021; v1 submitted 15 August, 2019; originally announced August 2019.

Comments: Accepted to JGR: Space Physics

arXiv:1907.05917 [pdf, other]

doi 10.1007/978-3-030-43229-4_26

Posit NPB: Assessing the Precision Improvement in HPC Scientific Applications

Authors: Steven W. D. Chien, Ivy B. Peng, Stefano Markidis

Abstract: Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first expl… ▽ More Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first explored the feasibility of Posit encoding in representative HPC applications by providing a 32-bit Posit NAS Parallel Benchmark (NPB) suite. Then, we evaluate the accuracy improvement in different HPC kernels compared to the IEEE 754 format. Our results indicate that using Posit encoding achieves optimized precision, ranging from 0.6 to 1.4 decimal digit, for all tested kernels and proxy-applications. Also, we quantified the overhead of the current software implementation of Posit encoding as 4x-19x that of IEEE 754 hardware implementation. Our study highlights the potential of hardware implementations of Posit to benefit a broad range of HPC applications. △ Less

Submitted 12 July, 2019; originally announced July 2019.

Comments: Accepted for publication in PPAM 2019 conference

arXiv:1904.03684 [pdf, other]

doi 10.1007/978-3-030-22750-0_58

Multi-GPU Acceleration of the iPIC3D Implicit Particle-in-Cell Code

Authors: Chaitanya Prasad Sishtla, Steven W. D. Chien, Vyacheslav Olshevsky, Erwin Laure, Stefano Markidis

Abstract: iPIC3D is a widely used massively parallel Particle-in-Cell code for the simulation of space plasmas. However, its current implementation does not support execution on multiple GPUs. In this paper, we describe the porting of iPIC3D particle mover to GPUs and the optimization steps to increase the performance and parallel scaling on multiple GPUs. We analyze the strong scaling of the mover on two G… ▽ More iPIC3D is a widely used massively parallel Particle-in-Cell code for the simulation of space plasmas. However, its current implementation does not support execution on multiple GPUs. In this paper, we describe the porting of iPIC3D particle mover to GPUs and the optimization steps to increase the performance and parallel scaling on multiple GPUs. We analyze the strong scaling of the mover on two GPU clusters and evaluate its performance and acceleration. The optimized GPU version which uses pinned memory and asynchronous data prefetching outperform their corresponding CPU versions by 5-10x on two different systems equipped with NVIDIA K80 and V100 GPUs. △ Less

Submitted 7 April, 2019; originally announced April 2019.

Comments: Accepted for publication in ICCS 2019

arXiv:1903.04364 [pdf, other]

doi 10.1109/IPDPSW.2019.00092

TensorFlow Doing HPC

Authors: Steven W. D. Chien, Stefano Markidis, Vyacheslav Olshevsky, Yaroslav Bulatov, Erwin Laure, Jeffrey S. Vetter

Abstract: TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HP… ▽ More TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for developing HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers. △ Less

Submitted 11 March, 2019; originally announced March 2019.

Comments: Accepted for publication at The Ninth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES'19)

arXiv:1810.04150 [pdf, other]

Exploring the Vision Processing Unit as Co-processor for Inference

Authors: Sergio Rivas-Gomez, Antonio J. Peña, David Moloney, Erwin Laure, Stefano Markidis

Abstract: The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore t… ▽ More The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore the so-called Vision Processing Unit (VPU), a highly-parallel vector processor with a power envelope of less than 1W. We evaluate this chip during inference using a pre-trained GoogLeNet convolutional network model and a large image dataset from the ImageNet ILSVRC challenge. Preliminary results indicate that a multi-VPU configuration provides similar performance compared to reference CPU and GPU implementations, while reducing the thermal-design power (TDP) up to 8x in comparison. △ Less

Submitted 9 October, 2018; originally announced October 2018.

arXiv:1810.04146 [pdf, other]

Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks

Authors: Sergio Rivas-Gomez, Sai Narasimhamurthy, Keeran Brabazon, Oliver Perks, Erwin Laure, Stefano Markidis

Abstract: In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process… ▽ More In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process is unexpectedly unbalanced. Using a Word-Count implementation and a large dataset from the Purdue MapReduce Benchmarks Suite (PUMA), we demonstrate that our approach can provide up to 23% performance improvement on average compared to a reference MapReduce implementation that uses state-of-the-art MPI collective communication and I/O. △ Less

Submitted 9 October, 2018; originally announced October 2018.

arXiv:1810.04110 [pdf, other]

MPI Windows on Storage for HPC Applications

Authors: Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor, Sai Narasimhamurthy, Erwin Laure, Stefano Markidis

Abstract: Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we e… ▽ More Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, asymmetric performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications. △ Less

Submitted 9 October, 2018; originally announced October 2018.

arXiv:1810.03035 [pdf, other]

doi 10.1109/PDSW-DISCS.2018.00011

Characterizing Deep-Learning I/O Workloads in TensorFlow

Authors: Steven W. D. Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis Santos, Pawel Herman, Sai Narasimhamurthy, Erwin Laure

Abstract: The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to resta… ▽ More The performance of Deep-Learning (DL) computing frameworks rely on the performance of data ingestion and checkpointing. In fact, during the training, a considerable high number of relatively small files are first loaded and pre-processed on CPUs and then moved to accelerator for computation. In addition, checkpointing and restart operations are carried out to allow DL computing frameworks to restart quickly from a checkpoint. Because of this, I/O affects the performance of DL applications. In this work, we characterize the I/O performance and scaling of TensorFlow, an open-source programming framework developed by Google and specifically designed for solving DL problems. To measure TensorFlow I/O performance, we first design a micro-benchmark to measure TensorFlow reads, and then use a TensorFlow mini-application based on AlexNet to measure the performance cost of I/O and checkpointing in TensorFlow. To improve the checkpointing performance, we design and implement a burst buffer. We find that increasing the number of threads increases TensorFlow bandwidth by a maximum of 2.3x and 7.8x on our benchmark environments. The use of the tensorFlow prefetcher results in a complete overlap of computation on accelerator and input pipeline on CPU eliminating the effective cost of I/O on the overall performance. The use of a burst buffer to checkpoint to a fast small capacity storage and copy asynchronously the checkpoints to a slower large capacity storage resulted in a performance improvement of 2.6x with respect to checkpointing directly to slower storage on our benchmark environment. △ Less

Submitted 6 October, 2018; originally announced October 2018.

Comments: Accepted for publication at pdsw-DISCS 2018

arXiv:1807.03632 [pdf, other]

doi 10.1145/3203217.3205341

The SAGE Project: a Storage Centric Approach for Exascale Computing

Authors: Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan, Steven Wei-der Chien, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Shaun de Witt, Dirk Pleiter, Stefano Markidis

Abstract: SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale r… ▽ More SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a "storage centric" approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Julich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system. △ Less

Submitted 6 July, 2018; originally announced July 2018.

Comments: Submitted to Computing Frontiers 2018. arXiv admin note: substantial text overlap with arXiv:1805.00556

Showing 1–50 of 59 results for author: Markidis, S