subscribe to arXiv mailings

ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation

Authors: Lucas Alvarenga, Victor Ferrari, Rafael Souza, Marcio Pereira, Guido Araujo

Abstract: Convolution is a compute-intensive operation placed at the heart of Convolution Neural Networks (CNNs). It has led to the development of many high-performance algorithms, such as Im2col-GEMM, Winograd, and Direct-Convolution. However, the comparison of different convolution algorithms is an error-prone task as it requires specific data layouts and system resources. Failure to address these require… ▽ More Convolution is a compute-intensive operation placed at the heart of Convolution Neural Networks (CNNs). It has led to the development of many high-performance algorithms, such as Im2col-GEMM, Winograd, and Direct-Convolution. However, the comparison of different convolution algorithms is an error-prone task as it requires specific data layouts and system resources. Failure to address these requirements might lead to unwanted time penalties. Thus, considering all processing steps within convolution algorithms is essential to comprehensively evaluate and fairly compare their performance. Furthermore, most known convolution benchmarking adopts ad-hoc testing suites with limited coverage and handmade operations. This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms. It assesses 9243 convolution operations derived from 1097 real-world deep learning models, resulting in performance and execution breakdown graphs for a detailed evaluation. ConvBench capability is evaluated across the Sliced Convolution (SConv) algorithm. The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions. However, the use of ConvBench allowed the delving into the remaining 6.4% underperforming convolutions, uncovering a critical slowdown of 79.5% on average of SConv's packing step. This analysis underscores a potential source of optimization for SConv, opening up new paths for convolution designers to improve their algorithms. △ Less

Submitted 15 July, 2024; originally announced July 2024.

Comments: 5 pages, 3 figures, presented on MLArchSys workshop of ISCA'2024

arXiv:2406.17523 [pdf, other]

On the consistency of hyper-parameter selection in value-based deep reinforcement learning

Authors: Johan Obando-Ceron, João G. M. Araújo, Aaron Courville, Pablo Samuel Castro

Abstract: Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed tec… ▽ More Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes. △ Less

Submitted 2 July, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17268 [pdf, other]

Search-based Trace Diagnostic

Authors: Gabriel Araujo, Ricardo Caldas, Federico Formica, Genaína Rodrigues, Patrizio Pelliccione, Claudio Menghi

Abstract: Cyber-physical systems (CPS) development requires verifying whether system behaviors violate their requirements. This analysis often considers system behaviors expressed by execution traces and requirements expressed by signal-based temporal properties. When an execution trace violates a requirement, engineers need to solve the trace diagnostic problem: They need to understand the cause of the bre… ▽ More Cyber-physical systems (CPS) development requires verifying whether system behaviors violate their requirements. This analysis often considers system behaviors expressed by execution traces and requirements expressed by signal-based temporal properties. When an execution trace violates a requirement, engineers need to solve the trace diagnostic problem: They need to understand the cause of the breach. Automated trace diagnostic techniques aim to support engineers in the trace diagnostic activity. This paper proposes search-based trace-diagnostic (SBTD), a novel trace-diagnostic technique for CPS requirements. Unlike existing techniques, SBTD relies on evolutionary search. SBTD starts from a set of candidate diagnoses, applies an evolutionary algorithm iteratively to generate new candidate diagnoses (via mutation, recombination, and selection), and uses a fitness function to determine the qualities of these solutions. Then, a diagnostic generator step is performed to explain the cause of the trace violation. We implemented Diagnosis, an SBTD tool for signal-based temporal logic requirements expressed using the Hybrid Logic of Signals (HLS). We evaluated Diagnosis by performing 34 experiments for 17 trace-requirements combinations leading to a property violation and by assessing the effectiveness of SBTD in producing informative diagnoses and its efficiency in generating them on a time basis. Our results confirm that Diagnosis can produce informative diagnoses in practical time for most of our experiments (33 out of 34). △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 14 pages plus two for references

arXiv:2406.04267 [pdf, other]

Transformers need glasses! Information over-squashing in language tasks

Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G. M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković

Abstract: We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals… ▽ More We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2403.13115 [pdf]

Professional Insights into Benefits and Limitations of Implementing MLOps Principles

Authors: Gabriel Araujo, Marcos Kalinowski, Markus Endler, Fabio Calefato

Abstract: Context: Machine Learning Operations (MLOps) has emerged as a set of practices that combines development, testing, and operations to deploy and maintain machine learning applications. Objective: In this paper, we assess the benefits and limitations of using the MLOps principles in online supervised learning. Method: We conducted two focus group sessions on the benefits and limitations of applying… ▽ More Context: Machine Learning Operations (MLOps) has emerged as a set of practices that combines development, testing, and operations to deploy and maintain machine learning applications. Objective: In this paper, we assess the benefits and limitations of using the MLOps principles in online supervised learning. Method: We conducted two focus group sessions on the benefits and limitations of applying MLOps principles for online machine learning applications with six experienced machine learning developers. Results: The focus group revealed that machine learning developers see many benefits of using MLOps principles but also that these do not apply to all the projects they worked on. According to experts, this investment tends to pay off for larger applications with continuous deployment that require well-prepared automated processes. However, for initial versions of machine learning applications, the effort taken to implement the principles could enlarge the project's scope and increase the time needed to deploy a first version to production. The discussion brought up that most of the benefits are related to avoiding error-prone manual steps, enabling to restore the application to a previous state, and having a robust continuous automated deployment pipeline. Conclusions: It is important to balance the trade-offs of investing time and effort in implementing the MLOps principles considering the scope and needs of the project, favoring such investments for larger applications with continuous model deployment requirements. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Author version of paper accepted for publication at ICEIS 2024

arXiv:2402.15332 [pdf, ps, other]

Position: Categorical Deep Learning is an Algebraic Theory of All Architectures

Authors: Bruno Gavranović, Paul Lessard, Andrew Dudzik, Tamara von Glehn, João G. M. Araújo, Petar Veličković

Abstract: We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the univers… ▽ More We present our position on the elusive quest for a general-purpose framework for specifying and studying deep learning architectures. Our opinion is that the key attempts made so far lack a coherent bridge between specifying constraints which models must satisfy and specifying their implementations. Focusing on building a such a bridge, we propose to apply category theory -- precisely, the universal algebra of monads valued in a 2-category of parametric maps -- as a single theory elegantly subsuming both of these flavours of neural network design. To defend our position, we show how this theory recovers constraints induced by geometric deep learning, as well as implementations of many architectures drawn from the diverse landscape of neural networks, such as RNNs. We also illustrate how the theory naturally encodes many standard constructs in computer science and automata theory. △ Less

Submitted 5 June, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: To appear in ICML 2024. Comments welcome. More info at categoricaldeeplearning.com

arXiv:2402.03046 [pdf, other]

Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning

Authors: Shengyi Huang, Quentin Gallouédec, Florian Felten, Antonin Raffin, Rousslan Fernand Julien Dossa, Yanxiao Zhao, Ryan Sullivan, Viktor Makoviychuk, Denys Makoviichuk, Mohamad H. Danesh, Cyril Roumégous, Jiayi Weng, Chufan Chen, Md Masudur Rahman, João G. M. Araújo, Guorui Quan, Daniel Tan, Timo Klein, Rujikorn Charakorn, Mark Towers, Yann Berthelot, Kinal Mehta, Dipam Chakraborty, Arjun KG, Valentin Charraut , et al. (8 additional authors not shown)

Abstract: In many Reinforcement Learning (RL) papers, learning curves are useful indicators to measure the effectiveness of RL algorithms. However, the complete raw data of the learning curves are rarely available. As a result, it is usually necessary to reproduce the experiments from scratch, which can be time-consuming and error-prone. We present Open RL Benchmark, a set of fully tracked RL experiments, i… ▽ More In many Reinforcement Learning (RL) papers, learning curves are useful indicators to measure the effectiveness of RL algorithms. However, the complete raw data of the learning curves are rarely available. As a result, it is usually necessary to reproduce the experiments from scratch, which can be time-consuming and error-prone. We present Open RL Benchmark, a set of fully tracked RL experiments, including not only the usual data such as episodic return, but also all algorithm-specific and system metrics. Open RL Benchmark is community-driven: anyone can download, use, and contribute to the data. At the time of writing, more than 25,000 runs have been tracked, for a cumulative duration of more than 8 years. Open RL Benchmark covers a wide range of RL libraries and reference implementations. Special care is taken to ensure that each experiment is precisely reproducible by providing not only the full parameters, but also the versions of the dependencies used to generate it. In addition, Open RL Benchmark comes with a command-line interface (CLI) for easy fetching and generating figures to present the results. In this document, we include two case studies to demonstrate the usefulness of Open RL Benchmark in practice. To the best of our knowledge, Open RL Benchmark is the first RL benchmark of its kind, and the authors hope that it will improve and facilitate the work of researchers in the field. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Under review

arXiv:2311.15497 [pdf, other]

Adaptive Image Registration: A Hybrid Approach Integrating Deep Learning and Optimization Functions for Enhanced Precision

Authors: Gabriel De Araujo, Shanlin Sun, Xiaohui Xie

Abstract: Image registration has traditionally been done using two distinct approaches: learning based methods, relying on robust deep neural networks, and optimization-based methods, applying complex mathematical transformations to warp images accordingly. Of course, both paradigms offer advantages and disadvantages, and, in this work, we seek to combine their respective strengths into a single streamlined… ▽ More Image registration has traditionally been done using two distinct approaches: learning based methods, relying on robust deep neural networks, and optimization-based methods, applying complex mathematical transformations to warp images accordingly. Of course, both paradigms offer advantages and disadvantages, and, in this work, we seek to combine their respective strengths into a single streamlined framework, using the outputs of the learning based method as initial parameters for optimization while prioritizing computational power for the image pairs that offer the greatest loss. Our investigations showed improvements of up to 1.6% in test data, while maintaining the same inference time, and a substantial 1.0% points performance gain in deformation field smoothness. △ Less

Submitted 18 January, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

arXiv:2305.18236 [pdf, ps, other]

doi 10.1002/spe.3214

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

Authors: Braedy Kuzma, Ivan Korostelev, João P. L. de Carvalho, José E. Moreira, Christopher Barton, Guido Araujo, José Nelson Amaral

Abstract: The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers, for data (re)organization, and micro kernels that perform the actual computation… ▽ More The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers, for data (re)organization, and micro kernels that perform the actual computations. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to deliver high-performance matrix operations. This paper presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to 22x (Intel), for small matrices, and more than 6x (POWER9), for large matrices, faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only 34% slower than OpenBLAS and on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over 2.6x faster than the vector-extension solution, matches Eigen performance, and achieves up to 96% of BLAS peak performance. △ Less

Submitted 15 May, 2023; originally announced May 2023.

ACM Class: C.4

arXiv:2304.03013 [pdf, other]

doi 10.1016/j.jpdc.2022.12.008

Tensor Slicing and Optimization for Multicore NPUs

Authors: Rafael Sousa, Marcio Pereira, Yongin Kwon, Taeho Kim, Namsoon Jung, Chang Soo Kim, Michael Frank, Guido Araujo

Abstract: Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai\-ned Multicore Neural Processor Units (NPUs) is still a challenging problem. Given the size of convolutions' input/output tensors and the small footprint of NPU on-chip memories, minimizing memory transactions while maximizing… ▽ More Although code generation for Convolution Neural Network (CNN) models has been extensively studied, performing efficient data slicing and parallelization for highly-constrai\-ned Multicore Neural Processor Units (NPUs) is still a challenging problem. Given the size of convolutions' input/output tensors and the small footprint of NPU on-chip memories, minimizing memory transactions while maximizing parallelism and MAC utilization are central to any effective solution. This paper proposes a TensorFlow XLA/LLVM compiler optimization pass for Multicore NPUs, called Tensor Slicing Optimization (TSO), which: (a) maximizes convolution parallelism and memory usage across NPU cores; and (b) reduces data transfers between host and NPU on-chip memories by using DRAM memory burst time estimates to guide tensor slicing. To evaluate the proposed approach, a set of experiments was performed using the NeuroMorphic Processor (NMP), a multicore NPU containing 32 RISC-V cores extended with novel CNN instructions. Experimental results show that TSO is capable of identifying the best tensor slicing that minimizes execution time for a set of CNN models. Speed-ups of up to 21.7\% result when comparing the TSO burst-based technique to a no-burst data slicing approach. To validate the generality of the TSO approach, the algorithm was also ported to the Glow Machine Learning framework. The performance of the models were measured on both Glow and TensorFlow XLA/LLVM compilers, revealing similar results. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Journal ref: Journal of Parallel and Distributed Computing Journal of Parallel and Distributed Computing, Volume 175, May 2023, Pages 66-79

arXiv:2303.04739 [pdf, other]

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Authors: Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo

Abstract: Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduce… ▽ More Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to compute convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on a MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers . This algorithm introduces: (a) Convolution Slicing Analysis (CSA) - a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) - a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) - an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.0x - 3.9x on Intel x86 and 3.6x - 7.2x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 9% - 25% for Intel x86 and 10% - 42% for IBM POWER10 architectures. The total convolution speedup for model inference is 12% - 27% on Intel x86 and 26% - 46% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions, in more than 83% of the 219 tested instances. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 15 pages, 11 figures

arXiv:2301.10835 [pdf, other]

When Layers Play the Lottery, all Tickets Win at Initialization

Authors: Artur Jordao, George Correa de Araujo, Helena de Almeida Maia, Helio Pedrini

Abstract: Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without… ▽ More Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets. △ Less

Submitted 19 March, 2024; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: Published at International Conference on Computer Vision Workshop (ICCV), 2023

arXiv:2210.03743 [pdf, other]

Single Image Super-Resolution Based on Capsule Neural Networks

Authors: George Corrêa de Araújo, Helio Pedrini

Abstract: Single image super-resolution (SISR) is the process of obtaining one high-resolution version of a low-resolution image by increasing the number of pixels per unit area. This method has been actively investigated by the research community, due to the wide variety of real-world problems where it can be applied, from aerial and satellite imaging to compressed image and video enhancement. Despite the… ▽ More Single image super-resolution (SISR) is the process of obtaining one high-resolution version of a low-resolution image by increasing the number of pixels per unit area. This method has been actively investigated by the research community, due to the wide variety of real-world problems where it can be applied, from aerial and satellite imaging to compressed image and video enhancement. Despite the improvements achieved by deep learning in the field, the vast majority of the used networks are based on traditional convolutions, with the solutions focusing on going deeper and/or wider, and innovations coming from jointly employing successful concepts from other fields. In this work, we decided to step up from the traditional convolutions and adopt the concept of capsules. Since their overwhelming results both in image classification and segmentation problems, we question how suitable they are for SISR. We also verify that different solutions share most of their configurations, and argue that this trend leads to fewer explorations of network varieties. During our experiments, we check various strategies to improve results, ranging from new and different loss functions to changes in the capsule layers. Our network achieved good results with fewer convolutional-based layers, showing that capsules might be a concept worth applying in the image super-resolution problem. △ Less

Submitted 6 October, 2022; originally announced October 2022.

Comments: 19 pages, 13 figures

ACM Class: I.2.10; I.4.3; I.5.1

arXiv:2207.05677 [pdf, other]

doi 10.1145/3547276.3548444

The OpenMP Cluster Programming Model

Authors: Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Gustavo Leite, Pedro Rosso, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, Guido Araujo

Abstract: Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programmin… ▽ More Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application. △ Less

Submitted 13 August, 2022; v1 submitted 12 July, 2022; originally announced July 2022.

Comments: 12 pages, 7 figures, 1 listing, to be published in the 51st International Conference on Parallel Processing Workshop Proceedings (ICPP Workshops 22)

ACM Class: D.4.1; D.3.2

arXiv:2207.02700 [pdf, other]

Channel Estimation in RIS-Assisted MIMO Systems Operating Under Imperfections

Authors: Paulo R. B. Gomes, Gilderlan T. de Araújo, Bruno Sokal, André L. F. de Almeida, Behrooz Makki, Gábor Fodor

Abstract: Reconfigurable intelligent surface is a potential technology component of future wireless networks due to its capability of shaping the wireless environment. The promising MIMO systems in terms of extended coverage and enhanced capacity are, however, critically dependent on the accuracy of the channel state information. However, traditional channel estimation schemes are not applicable in RIS-assi… ▽ More Reconfigurable intelligent surface is a potential technology component of future wireless networks due to its capability of shaping the wireless environment. The promising MIMO systems in terms of extended coverage and enhanced capacity are, however, critically dependent on the accuracy of the channel state information. However, traditional channel estimation schemes are not applicable in RIS-assisted MIMO networks, since passive RISs typically lack the signal processing capabilities that are assumed by channel estimation algorithms. This becomes most problematic when physical imperfections or electronic impairments affect the RIS due to its exposition to different environmental effects or caused by hardware limitations from the circuitry. While these real-world effects are typically ignored in the literature, in this paper we propose efficient channel estimation schemes for RIS-assisted MIMO systems taking different imperfections into account. Specifically, we propose two sets of tensor-based algorithms, based on the parallel factor analysis decomposition schemes. First, by assuming a long-term model in which the RIS imperfections, modeled as unknown phase shifts, are static within the channel coherence time we formulate an iterative alternating least squares (ALS)-based algorithm for the joint estimation of the communication channels and the unknown phase deviations. Next, we develop the short-term imperfection model, which allows both amplitude and phase RIS imperfections to be non-static with respect to the channel coherence time. We propose two iterative ALS-based and closed-form higher order singular value decomposition-based algorithms for the joint estimation of the channels and the unknown impairments. Moreover, we analyze the identifiability and computational complexity of the proposed algorithms and study the effects of various imperfections on the channel estimation quality. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: arXiv admin note: text overlap with arXiv:2206.03557

arXiv:2205.10290 [pdf, other]

Semi-Blind Joint Channel and Symbol Estimation for IRS-Assisted MIMO Systems

Authors: Gilderlan Tavares de Araújo, André Lima Férrer de Almeida, Rémy Boyer, Gábor Fodor

Abstract: Intelligent reflecting surface (IRS) is a promising technology for the 6th generation of wireless systems, realizing the smart radio environment concept. In this paper, we present a novel tensor-based receiver for IRS-assisted multiple-input multiple-output communications capable of jointly estimating the channels and the transmitted data streams in a semi-blind fashion. Assuming a fully passive I… ▽ More Intelligent reflecting surface (IRS) is a promising technology for the 6th generation of wireless systems, realizing the smart radio environment concept. In this paper, we present a novel tensor-based receiver for IRS-assisted multiple-input multiple-output communications capable of jointly estimating the channels and the transmitted data streams in a semi-blind fashion. Assuming a fully passive IRS architecture and introducing a simple space-time coding scheme at the transmitter, the received signal model can be advantageously built using the PARATUCK tensor model, which can be seen as a hybrid of parallel factor analysis and Tucker models. Exploiting the algebraic structure of the PARATUCK tensor model, a semi-blind receiver is derived. The proposed receiver is based on a trilinear alternating least squares method that iteratively estimates the two involved - IRS- base station and user terminal-IRS-communication channels and the transmitted symbol matrix. We discuss identifiability conditions that ensure the joint semi-blind recovery of the involved channel and symbol matrices, and propose a joint design of the coding and IRS reflection matrices to optimize the receiver performance. For the proposed semi-blind receiver, the derivation of the expected Cramér-Rao lower bound is also provided. A numerical performance evaluation of the proposed receiver design corroborates its superior performance in terms of the normalized mean squared error of the estimated channels and the achieved symbol error rate. △ Less

Submitted 20 May, 2022; originally announced May 2022.

arXiv:2204.06514 [pdf, other]

Scalable Training of Language Models using JAX pjit and TPUv4

Authors: Joanna Yoo, Kuba Perlin, Siddhartha Rao Kamalakara, João G. M. Araújo

Abstract: Modern large language models require distributed training strategies due to their size. The challenges of efficiently and robustly training them are met with rapid developments on both software and hardware frontiers. In this technical report, we explore challenges and design decisions associated with developing a scalable training framework, and present a quantitative analysis of efficiency impro… ▽ More Modern large language models require distributed training strategies due to their size. The challenges of efficiently and robustly training them are met with rapid developments on both software and hardware frontiers. In this technical report, we explore challenges and design decisions associated with developing a scalable training framework, and present a quantitative analysis of efficiency improvements coming from adopting new software and hardware solutions. △ Less

Submitted 13 April, 2022; originally announced April 2022.

Comments: 5 pages, 4 figures

arXiv:2202.11087 [pdf, other]

Semi-Blind Joint Channel and Symbol Estimation in IRS-Assisted Multi-User MIMO Networks

Authors: Gilderlan Tavares de Araújo, Paulo Ricardo Brboza Gomes, André Lima Férrer de Almeida, Gabor Fodor, Behrooz Makki

Abstract: Intelligent reflecting surface (IRS) is a promising technology for beyond 5th Generation of the wireless communications. In fully passive IRS-assisted systems, channel estimation is challenging and should be carried out only at the base station or at the terminals since the elements of the IRS are incapable of processing signals. In this letter, we formulate a tensor-based semi-blind receiver that… ▽ More Intelligent reflecting surface (IRS) is a promising technology for beyond 5th Generation of the wireless communications. In fully passive IRS-assisted systems, channel estimation is challenging and should be carried out only at the base station or at the terminals since the elements of the IRS are incapable of processing signals. In this letter, we formulate a tensor-based semi-blind receiver that solves the joint channel and symbol estimation problem in an IRS-assisted multi-user multiple-input multiple-output system. The proposed approach relies on a generalized PARATUCK tensor model of the signals reflected by the IRS, based on a two-stage closed-form semi-blind receiver using Khatri-Rao and Kronecker factorizations. Simulation results demonstrate the superior performance of the proposed semi-blind receiver, in terms of the normalized mean squared error and symbol error rate, as well as a lower computational complexity, compared to recently proposed parallel factor analysis-based receivers. △ Less

Submitted 4 May, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

arXiv:2202.04153 [pdf, other]

Source Matching and Rewriting

Authors: Vinicius Couto, Luciano Zago, Hervé Yviquel, Guido Araújo

Abstract: A typical compiler flow relies on a uni-directional sequence of translation/optimization steps that lower the program abstract representation, making it hard to preserve higher-level program information across each transformation step. On the other hand, modern ISA extensions and hardware accelerators can benefit from the compiler's ability to detect and raise program idioms to acceleration instru… ▽ More A typical compiler flow relies on a uni-directional sequence of translation/optimization steps that lower the program abstract representation, making it hard to preserve higher-level program information across each transformation step. On the other hand, modern ISA extensions and hardware accelerators can benefit from the compiler's ability to detect and raise program idioms to acceleration instructions or optimized library calls. Although recent works based on Multi-Level IR (MLIR) have been proposed for code raising, they rely on specialized languages, compiler recompilation, or in-depth dialect knowledge. This paper presents Source Matching and Rewriting (SMR), a user-oriented source-code-based approach for MLIR idiom matching and rewriting that does not require a compiler expert's intervention. SMR uses a two-phase automaton-based DAG-matching algorithm inspired by early work on tree-pattern matching. First, the idiom Control-Dependency Graph (CDG) is matched against the program's CDG to rule out code fragments that do not have a control-flow structure similar to the desired idiom. Second, candidate code fragments from the previous phase have their Data-Dependency Graphs (DDGs) constructed and matched against the idiom DDG. Experimental results show that SMR can effectively match idioms from Fortran (FIR) and C (CIL) programs while raising them as BLAS calls to improve performance. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Comments: 10 pages, 7 figures

arXiv:2110.12609 [pdf, other]

No News is Good News: A Critique of the One Billion Word Benchmark

Authors: Helen Ngo, João G. M. Araújo, Jeffrey Hui, Nicholas Frosst

Abstract: The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of ha… ▽ More The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. We suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets. △ Less

Submitted 24 October, 2021; originally announced October 2021.

arXiv:2108.07790 [pdf, other]

Mitigating harm in language models with conditional-likelihood filtration

Authors: Helen Ngo, Cooper Raterink, João G. M. Araújo, Ivan Zhang, Carol Chen, Adrien Morisot, Nicholas Frosst

Abstract: Language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. We present a methodology for programmatically identifying and removing harmful text from web-scale datasets. A pretrained language model is used to calculate the log-likelihood of researcher-written trigger phrases conditioned on a sp… ▽ More Language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. We present a methodology for programmatically identifying and removing harmful text from web-scale datasets. A pretrained language model is used to calculate the log-likelihood of researcher-written trigger phrases conditioned on a specific document, which is used to identify and filter documents from the dataset. We demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text, with a marginal decrease in performance on standard language modeling benchmarks compared to unfiltered baselines. We provide a partial explanation for this performance gap by surfacing examples of hate speech and other undesirable content from standard language modeling benchmarks. Finally, we discuss the generalization of this method and how trigger phrases which reflect specific values can be used by researchers to build language models which are more closely aligned with their values. △ Less

Submitted 27 November, 2021; v1 submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.00715 [pdf, other]

doi 10.1016/j.comnet.2023.109949

NDN4IVC: A Framework for Simulating and Testing of Applications in Vehicular Named Data Networking

Authors: Guilherme B. Araujo, Maycon L. M. Peixoto, Leobino N. Sampaio

Abstract: This paper presents a customized framework (NDN4IVC) for simulating and testing intelligent transportation systems and applications in vehicular named-data networking (V-NDN). The project uses two popular simulators in the literature for VANET simulation, a network simulator based on discrete events (Ns-3), with ndnSIM module installed, and Sumo, a simulator for urban mobility. NDN4IVC allows bidi… ▽ More This paper presents a customized framework (NDN4IVC) for simulating and testing intelligent transportation systems and applications in vehicular named-data networking (V-NDN). The project uses two popular simulators in the literature for VANET simulation, a network simulator based on discrete events (Ns-3), with ndnSIM module installed, and Sumo, a simulator for urban mobility. NDN4IVC allows bidirectional communication between Sumo and Ns-3 and integrates the NDN stack and the NFD (NDN Forwarding Daemon) code. The project also brings together a comprehensive set of codes, models, functionalities, and technologies to improve proposals and protocols in V-NDN. △ Less

Submitted 9 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

Report number: Computer Networks 1389-1286

Journal ref: 2023

arXiv:2103.10573 [pdf, other]

Enabling OpenMP Task Parallelism on Multi-FPGAs

Authors: R. Nepomuceno, R. Sterle, G. Valarini, M. Pereira, H. Yviquel, G. Araujo

Abstract: FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnecte… ▽ More FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture capable of accelerating a single application. However, programming such architecture is a challenging endeavor that still requires additional research. This paper extends the OpenMP task-based computation offloading model to enable a number of FPGAs to work together as a single Multi-FPGA architecture. Experimental results for a set of OpenMP stencil applications running on a Multi-FPGA platform consisting of 6 Xilinx VC709 boards interconnected through fiber-optic links have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increase. △ Less

Submitted 21 March, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

arXiv:2011.03070 [pdf, ps, other]

doi 10.1007/978-3-031-10461-9_11

Multicloud API binding generation from documentation

Authors: Michał J. Gajda, Vitor Vitali Barrozzi, Gabriel Araujo

Abstract: We present industry experience from implementing retargetable cloud API binding generator. The analysis is implemented in Haskell, using type classes, types a la carte, and code generation monad. It also targets Haskell, and allows us to bind cloud APIs on short notice, and unprecedented scale. We present industry experience from implementing retargetable cloud API binding generator. The analysis is implemented in Haskell, using type classes, types a la carte, and code generation monad. It also targets Haskell, and allows us to bind cloud APIs on short notice, and unprecedented scale. △ Less

Submitted 5 November, 2020; originally announced November 2020.

Comments: Presented on XP 2020: Agility in Microservices workshop

arXiv:2007.14863 [pdf, other]

Automatic Detection of Aedes aegypti Breeding Grounds Based on Deep Networks with Spatio-Temporal Consistency

Authors: Wesley L. Passos, Gabriel M. Araujo, Amaro A. de Lima, Sergio L. Netto, Eduardo A. B. da Silva

Abstract: Every year, the Aedes aegypti mosquito infects millions of people with diseases such as dengue, zika, chikungunya, and urban yellow fever. The main form to combat these diseases is to avoid mosquito reproduction by searching for and eliminating the potential mosquito breeding grounds. In this work, we introduce a comprehensive dataset of aerial videos, acquired with an unmanned aerial vehicle, con… ▽ More Every year, the Aedes aegypti mosquito infects millions of people with diseases such as dengue, zika, chikungunya, and urban yellow fever. The main form to combat these diseases is to avoid mosquito reproduction by searching for and eliminating the potential mosquito breeding grounds. In this work, we introduce a comprehensive dataset of aerial videos, acquired with an unmanned aerial vehicle, containing possible mosquito breeding sites. All frames of the video dataset were manually annotated with bounding boxes identifying all objects of interest. This dataset was employed to develop an automatic detection system of such objects based on deep convolutional networks. We propose the exploitation of the temporal information contained in the videos by the incorporation, in the object detection pipeline, of a spatio-temporal consistency module that can register the detected objects, minimizing most false-positive and false-negative occurrences. Also, we experimentally show that using videos is more beneficial than only composing a mosaic using the frames. Using the ResNet-50-FPN as a backbone, we achieve F$_1$-scores of 0.65 and 0.77 on the object-level detection of `tires' and `water tanks', respectively, illustrating the system capabilities to properly locate potential mosquito breeding objects. △ Less

Submitted 27 November, 2021; v1 submitted 29 July, 2020; originally announced July 2020.

arXiv:1611.00960 [pdf]

doi 10.1117/12.632674

Adaptive mixed norm optical flow estimation

Authors: Vania V. Estrela, Matthias O. Franz, Ricardo T. Lopes, G. P. De Araujo

Abstract: The pel-recursive computation of 2-D optical flow has been extensively studied in computer vision to estimate motion from image sequences, but it still raises a wealth of issues, such as the treatment of outliers, motion discontinuities and occlusion. It relies on spatio-temporal brightness variations due to motion. Our proposed adaptive regularized approach deals with these issues within a common… ▽ More The pel-recursive computation of 2-D optical flow has been extensively studied in computer vision to estimate motion from image sequences, but it still raises a wealth of issues, such as the treatment of outliers, motion discontinuities and occlusion. It relies on spatio-temporal brightness variations due to motion. Our proposed adaptive regularized approach deals with these issues within a common framework. It relies on the use of a data-driven technique called Mixed Norm (MN) to estimate the best motion vector for a given pixel. In our model, various types of noise can be handled, representing different sources of error. The motion vector estimation takes into consideration local image properties and it results from the minimization of a mixed norm functional with a regularization parameter depending on the kurtosis. This parameter determines the relative importance of the fourth norm and makes the functional convex. The main advantage of the developed procedure is that no knowledge of the noise distribution is necessary. Experiments indicate that this approach provides robust estimates of the optical flow. △ Less

Submitted 3 November, 2016; originally announced November 2016.

Comments: 8 pages, 4 figures. arXiv admin note: text overlap with arXiv:1403.7365

Journal ref: Proc. SPIE 5960, Visual Communications and Image Processing 2005, 59603W, July 31, 2006, Beijing, China

arXiv:1505.05135 [pdf]

Network Simulator - Visão Geral da Ferramenta de Simulação de Redes

Authors: Marcos Portnoi, Rafael Gonçalves Bezerra de Araújo

Abstract: This paper describes NS - Network Simulator, the computer networks simulation tool. We offer an overview NS, and also analyze its characteristics and functions. Finally, we present in detail all steps for preparing a simulation of a simple model in NS. This paper describes NS - Network Simulator, the computer networks simulation tool. We offer an overview NS, and also analyze its characteristics and functions. Finally, we present in detail all steps for preparing a simulation of a simple model in NS. △ Less

Submitted 27 April, 2015; originally announced May 2015.

Comments: in Portuguese, Seminário Estudantil de Produção Acadêmica, 2002

Showing 1–27 of 27 results for author: Araujo, G