subscribe to arXiv mailings

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Authors: Baodi Shan, Mauricio Araya-Polo

Abstract: Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the b… ▽ More Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the behavior of this type of algorithms for these new accelerators. This knowledge can be leveraged by many scientific applications which involve stencils computations. Further, evaluation of three different programming models: CUDA, OpenACC, and OpenMP target offloading is conducted on aforementioned accelerators. We extensively study the performance and portability of various kernels under each programming model and provide corresponding optimization recommendations. Furthermore, we compare the performance of different programming models on the mentioned architectures. Up to 58% performance improvement was achieved against the previous GPGPU's architecture generation for an highly optimized kernel of the same class, and up to 42% for all classes. In terms of programming models, and keeping portability in mind, optimized OpenACC implementation outperforms OpenMP implementation by 33%. If portability is not a factor, our best tuned CUDA implementation outperforms the optimized OpenACC one by 2.1x. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2312.01267 [pdf, other]

Distributed Reinforcement Learning for Molecular Design: Antioxidant case

Authors: Huanyi Qin, Denis Akhiyarov, Sophie Loehle, Kenneth Chiu, Mauricio Araya-Polo

Abstract: Deep reinforcement learning has successfully been applied for molecular discovery as shown by the Molecule Deep Q-network (MolDQN) algorithm. This algorithm has challenges when applied to optimizing new molecules: training such a model is limited in terms of scalability to larger datasets and the trained model cannot be generalized to different molecules in the same dataset. In this paper, a distr… ▽ More Deep reinforcement learning has successfully been applied for molecular discovery as shown by the Molecule Deep Q-network (MolDQN) algorithm. This algorithm has challenges when applied to optimizing new molecules: training such a model is limited in terms of scalability to larger datasets and the trained model cannot be generalized to different molecules in the same dataset. In this paper, a distributed reinforcement learning algorithm for antioxidants, called DA-MolDQN is proposed to address these problems. State-of-the-art bond dissociation energy (BDE) and ionization potential (IP) predictors are integrated into DA-MolDQN, which are critical chemical properties while optimizing antioxidants. Training time is reduced by algorithmic improvements for molecular modifications. The algorithm is distributed, scalable for up to 512 molecules, and generalizes the model to a diverse set of molecules. The proposed models are trained with a proprietary antioxidant dataset. The results have been reproduced with both proprietary and public datasets. The proposed molecules have been validated with DFT simulations and a subset of them confirmed in public "unseen" datasets. In summary, DA-MolDQN is up to 100x faster than previous algorithms and can discover new optimized molecules from proprietary and public antioxidants. △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2311.06297 [pdf, other]

STRIDE: Structure-guided Generation for Inverse Design of Molecules

Authors: Shehtab Zaman, Denis Akhiyarov, Mauricio Araya-Polo, Kenneth Chiu

Abstract: Machine learning and especially deep learning has had an increasing impact on molecule and materials design. In particular, given the growing access to an abundance of high-quality small molecule data for generative modeling for drug design, results for drug discovery have been promising. However, for many important classes of materials such as catalysts, antioxidants, and metal-organic frameworks… ▽ More Machine learning and especially deep learning has had an increasing impact on molecule and materials design. In particular, given the growing access to an abundance of high-quality small molecule data for generative modeling for drug design, results for drug discovery have been promising. However, for many important classes of materials such as catalysts, antioxidants, and metal-organic frameworks, such large datasets are not available. Such families of molecules with limited samples and structural similarities are especially prevalent for industrial applications. As is well-known, retraining and even fine-tuning are challenging on such small datasets. Novel, practically applicable molecules are most often derivatives of well-known molecules, suggesting approaches to addressing data scarcity. To address this problem, we introduce $\textbf{STRIDE}$, a generative molecule workflow that generates novel molecules with an unconditional generative model guided by known molecules without any retraining. We generate molecules outside of the training data from a highly specialized set of antioxidant molecules. Our generated molecules have on average 21.7% lower synthetic accessibility scores and also reduce ionization potential by 5.9% of generated molecules via guiding. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.00107 [pdf, other]

Deep Compressed Learning for 3D Seismic Inversion

Authors: Maayan Gelboim, Amir Adler, Yen Sun, Mauricio Araya-Polo

Abstract: We consider the problem of 3D seismic inversion from pre-stack data using a very small number of seismic sources. The proposed solution is based on a combination of compressed-sensing and machine learning frameworks, known as compressed-learning. The solution jointly optimizes a dimensionality reduction operator and a 3D inversion encoder-decoder implemented by a deep convolutional neural network… ▽ More We consider the problem of 3D seismic inversion from pre-stack data using a very small number of seismic sources. The proposed solution is based on a combination of compressed-sensing and machine learning frameworks, known as compressed-learning. The solution jointly optimizes a dimensionality reduction operator and a 3D inversion encoder-decoder implemented by a deep convolutional neural network (DCNN). Dimensionality reduction is achieved by learning a sparse binary sensing layer that selects a small subset of the available sources, then the selected data is fed to a DCNN to complete the regression task. The end-to-end learning process provides a reduction by an order-of-magnitude in the number of seismic records used during training, while preserving the 3D reconstruction quality comparable to that obtained by using the entire dataset. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: Presented at The International Meeting for Applied Geoscience & Energy (IMAGE23)

arXiv:2310.04430 [pdf, other]

Joint inversion of Time-Lapse Surface Gravity and Seismic Data for Monitoring of 3D CO$_2$ Plumes via Deep Learning

Authors: Adrian Celaya, Mauricio Araya-Polo

Abstract: We introduce a fully 3D, deep learning-based approach for the joint inversion of time-lapse surface gravity and seismic data for reconstructing subsurface density and velocity models. The target application of this proposed inversion approach is the prediction of subsurface CO2 plumes as a complementary tool for monitoring CO2 sequestration deployments. Our joint inversion technique outperforms de… ▽ More We introduce a fully 3D, deep learning-based approach for the joint inversion of time-lapse surface gravity and seismic data for reconstructing subsurface density and velocity models. The target application of this proposed inversion approach is the prediction of subsurface CO2 plumes as a complementary tool for monitoring CO2 sequestration deployments. Our joint inversion technique outperforms deep learning-based gravity-only and seismic-only inversion models, achieving improved density and velocity reconstruction, accurate segmentation, and higher R-squared coefficients. These results indicate that deep learning-based joint inversion is an effective tool for CO$_2$ storage monitoring. Future work will focus on validating our approach with larger datasets, simulations with other geological storage sites, and ultimately field data. △ Less

Submitted 24 September, 2023; originally announced October 2023.

arXiv:2309.04671 [pdf, other]

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Authors: Ryuichi Sai, John Mellor-Crummey, Jinfan Xu, Mauricio Araya-Polo

Abstract: Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive… ▽ More Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive architectures. To help developers achieve performance, portability, and productivity with stencil computations, we developed StencilPy. With StencilPy, developers write stencil computations in a high-level domain-specific language, which promotes productivity, while its backends generate efficient code for existing and emerging architectures, including modern many-core CPUs (such as AMD Genoa-X, Fujitsu A64FX, and Intel Sapphire Rapids), latest generations of GPUs (including NVIDIA H100 and A100, AMD MI200, and Intel Ponte Vecchio), and accelerators (including Cerebras and STX). StencilPy demonstrates promising performance results on par with hand-written code, maintains cross-architectural performance portability, and enhances productivity. Its modular design enables easy configuration, customization, and extension. A 25-point star-shaped stencil written in StencilPy is one-quarter of the length of a hand-crafted CUDA code and achieves similar performance on an NVIDIA H100 GPU. In addition, the same kernel written using our tool is 7x shorter than hand-optimized code written in Cerebras Software Language (CSL), and it delivers comparable performance that code on a Cerebras CS-2. △ Less

Submitted 7 July, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

arXiv:2306.09648 [pdf, other]

Learning CO$_2$ plume migration in faulted reservoirs with Graph Neural Networks

Authors: Xin Ju, François P. Hamon, Gege Wen, Rayan Kanfar, Mauricio Araya-Polo, Hamdi A. Tchelepi

Abstract: Deep-learning-based surrogate models provide an efficient complement to numerical simulations for subsurface flow problems such as CO$_2$ geological storage. Accurately capturing the impact of faults on CO$_2$ plume migration remains a challenge for many existing deep learning surrogate models based on Convolutional Neural Networks (CNNs) or Neural Operators. We address this challenge with a graph… ▽ More Deep-learning-based surrogate models provide an efficient complement to numerical simulations for subsurface flow problems such as CO$_2$ geological storage. Accurately capturing the impact of faults on CO$_2$ plume migration remains a challenge for many existing deep learning surrogate models based on Convolutional Neural Networks (CNNs) or Neural Operators. We address this challenge with a graph-based neural model leveraging recent developments in the field of Graph Neural Networks (GNNs). Our model combines graph-based convolution Long-Short-Term-Memory (GConvLSTM) with a one-step GNN model, MeshGraphNet (MGN), to operate on complex unstructured meshes and limit temporal error accumulation. We demonstrate that our approach can accurately predict the temporal evolution of gas saturation and pore pressure in a synthetic reservoir with impermeable faults. Our results exhibit a better accuracy and a reduced temporal error accumulation compared to the standard MGN model. We also show the excellent generalizability of our algorithm to mesh configurations, boundary conditions, and heterogeneous permeability fields not included in the training set. This work highlights the potential of GNN-based methods to accurately and rapidly model subsurface flow with complex faults and fractures. △ Less

Submitted 16 June, 2023; originally announced June 2023.

arXiv:2304.11274 [pdf, other]

Massively Distributed Finite-Volume Flux Computation

Authors: Ryuichi Sai, Mathias Jacquelin, François P. Hamon, Mauricio Araya-Polo, Randolph R. Settgast

Abstract: Designing large-scale geological carbon capture and storage projects and ensuring safe long-term CO2 containment - as a climate change mitigation strategy - requires fast and accurate numerical simulations. These simulations involve solving complex PDEs governing subsurface fluid flow using implicit finite-volume schemes widely based on Two-Point Flux Approximation (TPFA). This task is computation… ▽ More Designing large-scale geological carbon capture and storage projects and ensuring safe long-term CO2 containment - as a climate change mitigation strategy - requires fast and accurate numerical simulations. These simulations involve solving complex PDEs governing subsurface fluid flow using implicit finite-volume schemes widely based on Two-Point Flux Approximation (TPFA). This task is computationally and memory expensive, especially when performed on highly detailed geomodels. In most current HPC architectures, memory hierarchy and data management mechanism are insufficient to overcome the challenges of large scale numerical simulations. Therefore, it is crucial to design algorithms that can exploit alternative and more balanced paradigms, such as dataflow and in-memory computing. This work introduces an algorithm for TPFA computations that leverages effectively a dataflow architecture, such as Cerebras CS2, which helps to significantly minimize memory bottlenecks. Our implementation achieves two orders of magnitude speedup compared to multiple reference implementations running on NVIDIA A100 GPUs. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: 10 pages excl. bibliography. Submitted to SuperComputing 2023

arXiv:2304.10867 [pdf, other]

doi 10.1109/QCE57702.2023.00046

Application of quantum-inspired generative models to small molecular datasets

Authors: C. Moussa, H. Wang, M. Araya-Polo, T. Bäck, V. Dunjko

Abstract: Quantum and quantum-inspired machine learning has emerged as a promising and challenging research field due to the increased popularity of quantum computing, especially with near-term devices. Theoretical contributions point toward generative modeling as a promising direction to realize the first examples of real-world quantum advantages from these technologies. A few empirical studies also demons… ▽ More Quantum and quantum-inspired machine learning has emerged as a promising and challenging research field due to the increased popularity of quantum computing, especially with near-term devices. Theoretical contributions point toward generative modeling as a promising direction to realize the first examples of real-world quantum advantages from these technologies. A few empirical studies also demonstrate such potential, especially when considering quantum-inspired models based on tensor networks. In this work, we apply tensor-network-based generative models to the problem of molecular discovery. In our approach, we utilize two small molecular datasets: a subset of $4989$ molecules from the QM9 dataset and a small in-house dataset of $516$ validated antioxidants from TotalEnergies. We compare several tensor network models against a generative adversarial network using different sample-based metrics, which reflect their learning performances on each task, and multiobjective performances using $3$ relevant molecular metrics per task. We also combined the output of the models and demonstrate empirically that such a combination can be beneficial, advocating for the unification of classical and quantum(-inspired) generative learning. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: First version

Journal ref: 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), Bellevue, WA, USA, 2023, pp. 342-348

arXiv:2211.08506 [pdf, other]

ParticleGrid: Enabling Deep Learning using 3D Representation of Materials

Authors: Shehtab Zaman, Ethan Ferguson, Cecile Pereira, Denis Akhiyarov, Mauricio Araya-Polo, Kenneth Chiu

Abstract: From AlexNet to Inception, autoencoders to diffusion models, the development of novel and powerful deep learning models and learning algorithms has proceeded at breakneck speeds. In part, we believe that rapid iteration of model architecture and learning techniques by a large community of researchers over a common representation of the underlying entities has resulted in transferable deep learning… ▽ More From AlexNet to Inception, autoencoders to diffusion models, the development of novel and powerful deep learning models and learning algorithms has proceeded at breakneck speeds. In part, we believe that rapid iteration of model architecture and learning techniques by a large community of researchers over a common representation of the underlying entities has resulted in transferable deep learning knowledge. As a result, model scale, accuracy, fidelity, and compute performance have dramatically increased in computer vision and natural language processing. On the other hand, the lack of a common representation for chemical structure has hampered similar progress. To enable transferable deep learning, we identify the need for a robust 3-dimensional representation of materials such as molecules and crystals. The goal is to enable both materials property prediction and materials generation with 3D structures. While computationally costly, such representations can model a large set of chemical structures. We propose $\textit{ParticleGrid}$, a SIMD-optimized library for 3D structures, that is designed for deep learning applications and to seamlessly integrate with deep learning frameworks. Our highly optimized grid generation allows for generating grids on the fly on the CPU, reducing storage and GPU compute and memory requirements. We show the efficacy of 3D grids generated via $\textit{ParticleGrid}$ and accurately predict molecular energy properties using a 3D convolutional neural network. Our model is able to get 0.006 mean square error and nearly match the values calculated using computationally costly density functional theory at a fraction of the time. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: Published in the 2022 IEEE 18th International Conference on eScience (eScience)

arXiv:2209.02850 [pdf, other]

doi 10.1109/TGRS.2023.3273149

Inversion of Time-Lapse Surface Gravity Data for Detection of 3D CO$_2$ Plumes via Deep Learning

Authors: Adrian Celaya, Bertrand Denel, Yen Sun, Mauricio Araya-Polo, Antony Price

Abstract: We introduce three algorithms that invert simulated gravity data to 3D subsurface rock/flow properties. The first algorithm is a data-driven, deep learning-based approach, the second mixes a deep learning approach with physical modeling into a single workflow, and the third considers the time dependence of surface gravity monitoring. The target application of these proposed algorithms is the predi… ▽ More We introduce three algorithms that invert simulated gravity data to 3D subsurface rock/flow properties. The first algorithm is a data-driven, deep learning-based approach, the second mixes a deep learning approach with physical modeling into a single workflow, and the third considers the time dependence of surface gravity monitoring. The target application of these proposed algorithms is the prediction of subsurface CO$_2$ plumes as a complementary tool for monitoring CO$_2$ sequestration deployments. Each proposed algorithm outperforms traditional inversion methods and produces high-resolution, 3D subsurface reconstructions in near real-time. Our proposed methods achieve Dice scores of up to 0.8 for predicted plume geometry and near perfect data misfit in terms of $μ$Gals. These results indicate that combining 4D surface gravity monitoring with deep learning techniques represents a low-cost, rapid, and non-intrusive method for monitoring CO$_2$ storage sites. △ Less

Submitted 6 September, 2022; originally announced September 2022.

arXiv:2207.14789 [pdf, other]

Encoder-Decoder Architecture for 3D Seismic Inversion

Authors: Maayan Gelboim, Amir Adler, Yen Sun, Mauricio Araya-Polo

Abstract: Inverting seismic data to build 3D geological structures is a challenging task due to the overwhelming amount of acquired seismic data, and the very-high computational load due to iterative numerical solutions of the wave equation, as required by industry-standard tools such as Full Waveform Inversion (FWI). For example, in an area with surface dimensions of 4.5km $\times$ 4.5km, hundreds of seism… ▽ More Inverting seismic data to build 3D geological structures is a challenging task due to the overwhelming amount of acquired seismic data, and the very-high computational load due to iterative numerical solutions of the wave equation, as required by industry-standard tools such as Full Waveform Inversion (FWI). For example, in an area with surface dimensions of 4.5km $\times$ 4.5km, hundreds of seismic shot-gather cubes are required for 3D model reconstruction, leading to Terabytes of recorded data. This paper presents a deep learning solution for the reconstruction of realistic 3D models in the presence of field noise recorded in seismic surveys. We implement and analyze a convolutional encoder-decoder architecture that efficiently processes the entire collection of hundreds of seismic shot-gather cubes. The proposed solution demonstrates that realistic 3D models can be reconstructed with a structural similarity index measure (SSIM) of 0.8554 (out of 1.0) in the presence of field noise at 10dB signal-to-noise ratio. △ Less

Submitted 29 July, 2022; originally announced July 2022.

arXiv:2204.03775 [pdf, other]

Massively scalable stencil algorithm

Authors: Mathias Jacquelin, Mauricio Araya-Polo, Jie Meng

Abstract: Stencil computations lie at the heart of many scientific and industrial applications. Unfortunately, stencil algorithms perform poorly on machines with cache based memory hierarchy, due to low re-use of memory accesses. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the Cerebras WSE-2, which has no cache hierarc… ▽ More Stencil computations lie at the heart of many scientific and industrial applications. Unfortunately, stencil algorithms perform poorly on machines with cache based memory hierarchy, due to low re-use of memory accesses. This work shows that for stencil computation a novel algorithm that leverages a localized communication strategy effectively exploits the Cerebras WSE-2, which has no cache hierarchy. This study focuses on a 25-point stencil finite-difference method for the 3D wave equation, a kernel frequently used in earth modeling as numerical simulation. In essence, the algorithm trades memory accesses for data communication and takes advantage of the fast communication fabric provided by the architecture. The algorithm -- historically memory bound -- becomes compute bound. This allows the implementation to achieve near perfect weak scaling, reaching up to 503 TFLOPs on WSE-2, a figure that only full clusters can eventually yield. △ Less

Submitted 7 April, 2022; originally announced April 2022.

Comments: 10 pages excl. bibliography. Submitted to SuperComputing 2022

arXiv:2009.04619 [pdf, other]

Accelerating High-Order Stencils on GPUs

Authors: Ryuichi Sai, John Mellor-Crummey, Xiaozhu Meng, Mauricio Araya-Polo, Jie Meng

Abstract: Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of proposed enhancements work well for high-order stencils, such as those used for seis… ▽ More Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of proposed enhancements work well for high-order stencils, such as those used for seismic modeling. Furthermore, coping with boundary conditions often requires different computational logic, which complicates efficient exploitation of the thread-level parallelism on GPUs. In this paper, we study high-order stencils and their unique characteristics on GPUs. We manually crafted a collection of implementations of a 25-point seismic modeling stencil in CUDA and related boundary conditions. We evaluate their code shapes, memory hierarchy usage, data-fetching patterns, and other performance attributes. We conducted an empirical evaluation of these stencils using several mature and emerging tools and discuss our quantitative findings. Among our implementations, we achieve twice the performance of a proprietary code developed in C and mapped to GPUs using OpenACC. Additionally, several of our implementations have excellent performance portability. △ Less

Submitted 15 September, 2020; v1 submitted 9 September, 2020; originally announced September 2020.

arXiv:2007.06048 [pdf, other]

Minimod: A Finite Difference solver for Seismic Modeling

Authors: Jie Meng, Andreas Atle, Henri Calandra, Mauricio Araya-Polo

Abstract: This article introduces a benchmark application for seismic modeling using finite difference method, which is namedMiniMod, a mini application for seismic modeling. The purpose is to provide a benchmark suite that is, on one hand easy to build and adapt to the state of the art in programming models and changing high performance hardware landscape. On the other hand, the intention is to have a prox… ▽ More This article introduces a benchmark application for seismic modeling using finite difference method, which is namedMiniMod, a mini application for seismic modeling. The purpose is to provide a benchmark suite that is, on one hand easy to build and adapt to the state of the art in programming models and changing high performance hardware landscape. On the other hand, the intention is to have a proxy application to actual production geophysical exploration workloads for Oil & Gas exploration, and other geosciences applications based on the wave equation. From top to bottom, we describe the design concepts, algorithms, code structure of the application, and present the benchmark results on different current computer architectures. △ Less

Submitted 12 July, 2020; originally announced July 2020.

arXiv:2004.03040 [pdf, other]

Deep Neural Network Learning with Second-Order Optimizers -- a Practical Study with a Stochastic Quasi-Gauss-Newton Method

Authors: Christopher Thiele, Mauricio Araya-Polo, Detlef Hohl

Abstract: Training in supervised deep learning is computationally demanding, and the convergence behavior is usually not fully understood. We introduce and study a second-order stochastic quasi-Gauss-Newton (SQGN) optimization method that combines ideas from stochastic quasi-Newton methods, Gauss-Newton methods, and variance reduction to address this problem. SQGN provides excellent accuracy without the nee… ▽ More Training in supervised deep learning is computationally demanding, and the convergence behavior is usually not fully understood. We introduce and study a second-order stochastic quasi-Gauss-Newton (SQGN) optimization method that combines ideas from stochastic quasi-Newton methods, Gauss-Newton methods, and variance reduction to address this problem. SQGN provides excellent accuracy without the need for experimenting with many hyper-parameter configurations, which is often computationally prohibitive given the number of combinations and the cost of each training process. We discuss the implementation of SQGN with TensorFlow, and we compare its convergence and computational performance to selected first-order methods using the MNIST benchmark and a large-scale seismic tomography application from Earth science. △ Less

Submitted 30 June, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

Comments: 8 pages, 3 figures; added reference to code, fixed formatting of title

ACM Class: G.1.6; I.2.10; I.2.6

arXiv:1608.00636 [pdf, ps, other]

A survey of sparse matrix-vector multiplication performance on large matrices

Authors: Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar

Abstract: We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, BSR, COO… ▽ More We contribute a third-party survey of sparse matrix-vector (SpMV) product performance on industrial-strength, large matrices using: (1) The SpMV implementations in Intel MKL, the Trilinos project (Tpetra subpackage), the CUSPARSE library, and the CUSP library, each running on modern architectures. (2) NVIDIA GPUs and Intel multi-core CPUs (supported by each software package). (3) The CSR, BSR, COO, HYB, and ELL matrix formats (supported by each software package). △ Less

Submitted 1 August, 2016; originally announced August 2016.

Comments: Rice Oil & Gas High Performance Computing Workshop. March 2016

arXiv:1603.03971 [pdf, other]

Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application

Authors: Sri Raj Paul, John Mellor-Crummey, Mauricio Araya-Polo, Detlef Hohl

Abstract: Applications to process seismic data employ scalable parallel systems to produce timely results. To fully exploit emerging processor architectures, application will need to employ threaded parallelism within a node and message passing across nodes. Today, MPI+OpenMP is the preferred programming model for this task. However, tuning hybrid programs for clusters is difficult. Performance tools can he… ▽ More Applications to process seismic data employ scalable parallel systems to produce timely results. To fully exploit emerging processor architectures, application will need to employ threaded parallelism within a node and message passing across nodes. Today, MPI+OpenMP is the preferred programming model for this task. However, tuning hybrid programs for clusters is difficult. Performance tools can help users identify bottlenecks and uncover opportunities for improvement. This poster describes our experiences of applying Rice University's HPCToolkit and hardware performance counters to gain insight into an MPI+OpenMP code that performs Reverse Time Migration (RTM) on a cluster of multicore processors. The tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from the tools, we were able to improve the performance of the RTM code by roughly 30 percent. △ Less

Submitted 12 March, 2016; originally announced March 2016.

Comments: 2 page extended abstract presented at The International Conference for High Performance Computing, Networking, Storage and Analysis (SC) 2015 for ACM Student Research Competition

arXiv:1506.05439 [pdf, other]

Learning with a Wasserstein Loss

Authors: Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya-Polo, Tomaso Poggio

Abstract: Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact… ▽ More Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is efficiently computed. We describe an efficient learning algorithm based on this regularization, as well as a novel extension of the Wasserstein distance from probability measures to unnormalized measures. We also describe a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness of the predictions with respect to a chosen metric on the output space. We demonstrate this property on a real-data tag prediction problem, using the Yahoo Flickr Creative Commons dataset, outperforming a baseline that doesn't use the metric. △ Less

Submitted 29 December, 2015; v1 submitted 17 June, 2015; originally announced June 2015.

Comments: NIPS 2015; v3 updates Algorithm 1 and Equations 6, 8

arXiv:cs/0606042 [pdf, ps, other]

Enabling user-driven Checkpointing strategies in Reverse-mode Automatic Differentiation

Authors: Laurent Hascoet, Mauricio Araya-Polo

Abstract: This paper presents a new functionality of the Automatic Differentiation (AD) tool Tapenade. Tapenade generates adjoint codes which are widely used for optimization or inverse problems. Unfortunately, for large applications the adjoint code demands a great deal of memory, because it needs to store a large set of intermediates values. To cope with that problem, Tapenade implements a sub-optimal v… ▽ More This paper presents a new functionality of the Automatic Differentiation (AD) tool Tapenade. Tapenade generates adjoint codes which are widely used for optimization or inverse problems. Unfortunately, for large applications the adjoint code demands a great deal of memory, because it needs to store a large set of intermediates values. To cope with that problem, Tapenade implements a sub-optimal version of a technique called checkpointing, which is a trade-off between storage and recomputation. Our long-term goal is to provide an optimal checkpointing strategy for every code, not yet achieved by any AD tool. Towards that goal, we first introduce modifications in Tapenade in order to give the user the choice to select the checkpointing strategy most suitable for their code. Second, we conduct experiments in real-size scientific codes in order to gather hints that help us to deduce an optimal checkpointing strategy. Some of the experimental results show memory savings up to 35% and execution time up to 90%. △ Less

Submitted 9 June, 2006; originally announced June 2006.

Showing 1–20 of 20 results for author: Araya-Polo, M