-
Interactive and Urgent HPC: Challenges and Opportunities
Authors:
Albert Reuther,
Nick Brown,
William Arndt,
Johannes Blaschke,
Christian Boehme,
Antony Chazapis,
Bjoern Enders,
Robert Henschel,
Julian Kunkel,
Maxime Martinasso
Abstract:
As a broader set of applications from simulations to data analysis and machine learning require more parallel computational capability, the demand for interactive and urgent high performance computing (HPC) continues to increase. This paper overviews the progress made so far and elucidates the challenges and opportunities for greater integration of interactive and urgent HPC policies, techniques,…
▽ More
As a broader set of applications from simulations to data analysis and machine learning require more parallel computational capability, the demand for interactive and urgent high performance computing (HPC) continues to increase. This paper overviews the progress made so far and elucidates the challenges and opportunities for greater integration of interactive and urgent HPC policies, techniques, and technologies into HPC ecosystems.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Oceananigans.jl: A model that achieves breakthrough resolution, memory and energy efficiency in global ocean simulations
Authors:
Simone Silvestri,
Gregory Wagner,
Christopher Hill,
Matin Raayai Ardakani,
Johannes Blaschke,
Jean-Michel Campin,
Valentin Churavy,
Navid Constantinou,
Alan Edelman,
John Marshall,
Ali Ramadhan,
Andre Souza,
Raffaele Ferrari
Abstract:
Climate models must simulate hundreds of future scenarios for hundreds of years at coarse resolutions, and a handful of high-resolution decadal simulations to resolve localized extreme events. Using Oceananigans.jl, written from scratch in Julia, we report several achievements: First, a global ocean simulation with breakthrough horizontal resolution -- 488m -- reaching 15 simulated days per day (0…
▽ More
Climate models must simulate hundreds of future scenarios for hundreds of years at coarse resolutions, and a handful of high-resolution decadal simulations to resolve localized extreme events. Using Oceananigans.jl, written from scratch in Julia, we report several achievements: First, a global ocean simulation with breakthrough horizontal resolution -- 488m -- reaching 15 simulated days per day (0.04 simulated years per day; SYPD). Second, Oceananigans simulates the global ocean at 488m with breakthrough memory efficiency on just 768 Nvidia A100 GPUs, a fraction of the resources available on current and upcoming exascale supercomputers. Third, and arguably most significant for climate modeling, Oceananigans achieves breakthrough energy efficiency reaching 0.95 SYPD at 1.7 km on 576 A100s and 9.9 SYPD at 10 km on 68 A100s -- the latter representing the highest horizontal resolutions employed by current IPCC-class ocean models. Routine climate simulations with 10 km ocean components are within reach.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Bridging HPC Communities through the Julia Programming Language
Authors:
Valentin Churavy,
William F Godoy,
Carsten Bauer,
Hendrik Ranocha,
Michael Schlottke-Lakemper,
Ludovic Räss,
Johannes Blaschke,
Mosè Giordano,
Erik Schnetter,
Samuel Omlin,
Jeffrey S. Vetter,
Alan Edelman
Abstract:
The Julia programming language has evolved into a modern alternative to fill existing gaps in scientific computing and data science applications. Julia leverages a unified and coordinated single-language and ecosystem paradigm and has a proven track record of achieving high performance without sacrificing user productivity. These aspects make Julia a viable alternative to high-performance computin…
▽ More
The Julia programming language has evolved into a modern alternative to fill existing gaps in scientific computing and data science applications. Julia leverages a unified and coordinated single-language and ecosystem paradigm and has a proven track record of achieving high performance without sacrificing user productivity. These aspects make Julia a viable alternative to high-performance computing's (HPC's) existing and increasingly costly many-body workflow composition strategy in which traditional HPC languages (e.g., Fortran, C, C++) are used for simulations, and higher-level languages (e.g., Python, R, MATLAB) are used for data analysis and interactive computing. Julia's rapid growth in language capabilities, package ecosystem, and community make it a promising universal language for HPC. This paper presents the views of a multidisciplinary group of researchers from academia, government, and industry that advocate for an HPC software development paradigm that emphasizes developer productivity, workflow portability, and low barriers for entry. We believe that the Julia programming language, its ecosystem, and its community provide modern and powerful capabilities that enable this group's objectives. Crucially, we believe that Julia can provide a feasible and less costly approach to programming scientific applications and workflows that target HPC facilities. In this work, we examine the current practice and role of Julia as a common, end-to-end programming model to address major challenges in scientific reproducibility, data-driven AI/machine learning, co-design and workflows, scalability and performance portability in heterogeneous computing, network communication, data management, and community education. As a result, the diversification of current investments to fulfill the needs of the upcoming decade is crucial as more supercomputing centers prepare for the exascale era.
△ Less
Submitted 10 November, 2022; v1 submitted 4 November, 2022;
originally announced November 2022.
-
The LBNL Superfacility Project Report
Authors:
Deborah Bard,
Cory Snavely,
Lisa Gerhardt,
Jason Lee,
Becci Totzke,
Katie Antypas,
William Arndt,
Johannes Blaschke,
Suren Byna,
Ravi Cheema,
Shreyas Cholia,
Mark Day,
Bjoern Enders,
Aditi Gaur,
Annette Greiner,
Taylor Groves,
Mariam Kiran,
Quincey Koziol,
Tom Lehman,
Kelly Rowland,
Chris Samuel,
Ashwin Selvarajan,
Alex Sim,
David Skinner,
Laurie Stephey
, et al. (2 additional authors not shown)
Abstract:
The Superfacility model is designed to leverage HPC for experimental science. It is more than simply a model of connected experiment, network, and HPC facilities; it encompasses the full ecosystem of infrastructure, software, tools, and expertise needed to make connected facilities easy to use. The three-year Lawrence Berkeley National Laboratory (LBNL) Superfacility project was initiated in 2019…
▽ More
The Superfacility model is designed to leverage HPC for experimental science. It is more than simply a model of connected experiment, network, and HPC facilities; it encompasses the full ecosystem of infrastructure, software, tools, and expertise needed to make connected facilities easy to use. The three-year Lawrence Berkeley National Laboratory (LBNL) Superfacility project was initiated in 2019 to coordinate work being performed at LBNL to support this model, and to provide a coherent and comprehensive set of science requirements to drive existing and new work.
A key component of the project was the in-depth engagements with eight science teams that represent challenging use cases across the DOE Office of Science. By the close of the project, we met our project goal by enabling our science application engagements to demonstrate automated pipelines that analyze data from remote facilities at large scale, without routine human intervention. In several cases, we have gone beyond demonstrations and now provide production-level services. To achieve this goal, the Superfacility team developed tools, infrastructure, and policies for near-real-time computing support, dynamic high-performance networking, data management and movement tools, API-driven automation, HPC-scale notebooks via Jupyter, authentication using Federated Identity and container-based edge services supported.
The lessons we learned during this project provide a valuable model for future large, complex, cross-disciplinary collaborations. There is a pressing need for a coherent computing infrastructure across national facilities, and LBNL's Superfacility project is a unique model for success in tackling the challenges that will be faced in hardware, software, policies, and services across multiple science domains.
△ Less
Submitted 27 June, 2022; v1 submitted 23 June, 2022;
originally announced June 2022.
-
Accelerating X-Ray Tracing for Exascale Systems using Kokkos
Authors:
Felix Wittwer,
Nicholas K. Sauter,
Derek Mendez,
Billy K. Poon,
Aaron S. Brewster,
James M. Holton,
Michael E. Wall,
William E. Hart,
Deborah J. Bard,
Johannes P. Blaschke
Abstract:
The upcoming exascale computing systems Frontier and Aurora will draw much of their computing power from GPU accelerators. The hardware for these systems will be provided by AMD and Intel, respectively, each supporting their own GPU programming model. The challenge for applications that harness one of these exascale systems will be to avoid lock-in and to preserve performance portability.
We rep…
▽ More
The upcoming exascale computing systems Frontier and Aurora will draw much of their computing power from GPU accelerators. The hardware for these systems will be provided by AMD and Intel, respectively, each supporting their own GPU programming model. The challenge for applications that harness one of these exascale systems will be to avoid lock-in and to preserve performance portability.
We report here on our results of using Kokkos to accelerate a real-world application on NERSC's Perlmutter Phase 1 (using NVIDIA A100 accelerators) and the testbed system for OLCF's Frontier (using AMD MI250X). By porting to Kokkos, we were able to successfully run the same X-ray tracing code on both systems and achieved speed-ups between 13% and 66% compared to the original CUDA code. These results are a highly encouraging demonstration of using Kokkos to accelerate production science code.
△ Less
Submitted 16 May, 2022;
originally announced May 2022.
-
Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis
Authors:
Johannes P. Blaschke,
Aaron S. Brewster,
Daniel W. Paley,
Derek Mendez,
Asmit Bhowmick,
Nicholas K. Sauter,
Wilko Kröger,
Murali Shankar,
Bjoern Enders,
Deborah Bard
Abstract:
X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experi…
▽ More
X-ray scattering experiments using Free Electron Lasers (XFELs) are a powerful tool to determine the molecular structure and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences analyzing data from two experiments at the Linac Coherent Light Source (LCLS) during September 2020. Raw data were analyzed on NERSC's Cori XC40 system, using the Superfacility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed using the software package CCTBX. We achieved real time data analysis with a turnaround time from data acquisition to full molecular reconstruction in as little as 10 min -- sufficient time for the experiment's operators to make informed decisions. By hosting the data analysis on Cori, and by automating LCLS-NERSC interoperability, we achieved a data analysis rate which matches the data acquisition rate. Completing data analysis with 10 mins is a first for XFEL experiments and an important milestone if we are to keep up with data collection trends.
△ Less
Submitted 31 December, 2023; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Accelerating GMRES with Deep Learning in Real-Time
Authors:
Kevin Luna,
Katherine Klymko,
Johannes P. Blaschke
Abstract:
GMRES is a powerful numerical solver used to find solutions to extremely large systems of linear equations. These systems of equations appear in many applications in science and engineering. Here we demonstrate a real-time machine learning algorithm that can be used to accelerate the time-to-solution for GMRES. Our framework is novel in that is integrates the deep learning algorithm in an in situ…
▽ More
GMRES is a powerful numerical solver used to find solutions to extremely large systems of linear equations. These systems of equations appear in many applications in science and engineering. Here we demonstrate a real-time machine learning algorithm that can be used to accelerate the time-to-solution for GMRES. Our framework is novel in that is integrates the deep learning algorithm in an in situ fashion: the AI-accelerator gradually learns how to optimizes the time to solution without requiring user input (such as a pre-trained data set). We describe how our algorithm collects data and optimizes GMRES. We demonstrate our algorithm by implementing an accelerated (MLGMRES) solver in Python. We then use MLGMRES to accelerate a solver for the Poisson equation -- a class of linear problems that appears in may applications.
Informed by the properties of formal solutions to the Poisson equation, we test the performance of different neural networks. Our key takeaway is that networks which are capable of learning non-local relationships perform well, without needing to be scaled with the input problem size, making them good candidates for the extremely large problems encountered in high-performance computing. For the inputs studied, our method provides a roughly 2$\times$ acceleration.
△ Less
Submitted 19 March, 2021;
originally announced March 2021.
-
cuFINUFFT: a load-balanced GPU library for general-purpose nonuniform FFTs
Authors:
Yu-hsuan Shih,
Garrett Wright,
Joakim Andén,
Johannes Blaschke,
Alex H. Barnett
Abstract:
Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy,…
▽ More
Nonuniform fast Fourier transforms dominate the computational cost in many applications including image reconstruction and signal processing. We thus present a general-purpose GPU-based CUDA library for type 1 (nonuniform to uniform) and type 2 (uniform to nonuniform) transforms in dimensions 2 and 3, in single or double precision. It achieves high performance for a given user-requested accuracy, regardless of the distribution of nonuniform points, via cache-aware point reordering, and load-balanced blocked spreading in shared memory. At low accuracies, this gives on-GPU throughputs around $10^9$ nonuniform points per second, and (even including host-device transfer) is typically 4-10$\times$ faster than the latest parallel CPU code FINUFFT (at 28 threads). It is competitive with two established GPU codes, being up to 90$\times$ faster at high accuracy and/or type 1 clustered point distributions. Finally we demonstrate a 5-12$\times$ speedup versus CPU in an X-ray diffraction 3D iterative reconstruction task at $10^{-12}$ accuracy, observing excellent multi-GPU weak scaling up to one rank per GPU.
△ Less
Submitted 25 March, 2021; v1 submitted 16 February, 2021;
originally announced February 2021.