subscribe to arXiv mailings

HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond

Authors: Stefan Abi-Karam, Rishov Sarkar, Allison Seigler, Sean Lowe, Zhigang Wei, Hanqiu Chen, Nanditha Rao, Lizy John, Aman Arora, Cong Hao

Abstract: Machine learning (ML) techniques have been applied to high-level synthesis (HLS) flows for quality-of-result (QoR) prediction and design space exploration (DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and the complexity of building such datasets present challenges. Existing datasets have limitations in terms of benchmark coverage, design space enumeration, vendor extens… ▽ More Machine learning (ML) techniques have been applied to high-level synthesis (HLS) flows for quality-of-result (QoR) prediction and design space exploration (DSE). Nevertheless, the scarcity of accessible high-quality HLS datasets and the complexity of building such datasets present challenges. Existing datasets have limitations in terms of benchmark coverage, design space enumeration, vendor extensibility, or lack of reproducible and extensible software for dataset construction. Many works also lack user-friendly ways to add more designs, limiting wider adoption of such datasets. In response to these challenges, we introduce HLSFactory, a comprehensive framework designed to facilitate the curation and generation of high-quality HLS design datasets. HLSFactory has three main stages: 1) a design space expansion stage to elaborate single HLS designs into large design spaces using various optimization directives across multiple vendor tools, 2) a design synthesis stage to execute HLS and FPGA tool flows concurrently across designs, and 3) a data aggregation stage for extracting standardized data into packaged datasets for ML usage. This tripartite architecture ensures broad design space coverage via design space expansion and supports multiple vendor tools. Users can contribute to each stage with their own HLS designs and synthesis results and extend the framework itself with custom frontends and tool flows. We also include an initial set of built-in designs from common HLS benchmarks curated open-source HLS designs. We showcase the versatility and multi-functionality of our framework through six case studies: I) Design space sampling; II) Fine-grained parallelism backend speedup; III) Targeting Intel's HLS flow; IV) Adding new auxiliary designs; V) Integrating published HLS data; VI) HLS tool version regression benchmarking. Code at https://github.com/sharc-lab/HLSFactory. △ Less

Submitted 17 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

Comments: Edit to "Section V.E" for proper attribution of open-source HLSyn, AutoDSE, and the Merlin compiler

arXiv:2311.11384 [pdf, other]

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

Authors: Aman Arora, Jian Weng, Siyuan Ma, Tony Nowatzki, Lizy K. John

Abstract: Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements… ▽ More Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for accelerator architectures, for parallel workloads such as Deep Learning (DL), because of its capability to achieve massive data parallelism at a low area overhead and provide orders-of-magnitude data movement savings by moving computational resources closer to the data. While many PIM architectures have been proposed, improvements are needed in communicating intermediate results to consumer kernels, for communication between tiles at scale, for reduction operations, and for efficiently performing bit-serial operations with constants. We present PIMSAB, a scalable architecture that provides spatially aware communication network for efficient intra-tile and inter-tile data movement and provides efficient computation support for generally inefficient bit-serial compute patterns. Our architecture consists of a massive hierarchical array of compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high utilization. The key novelties of our architecture are: (1) providing efficient support for spatially-aware communication by providing local H-tree network for reductions, by adding explicit hardware for shuffling operands, and by deploying systolic broadcasting, and (2) taking advantage of the divisible nature of bit-serial computations through adaptive precision, bit-slicing and efficient handling of constant operations. When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM (SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: Aman Arora and Jian Weng are co-first authors with equal contribution

arXiv:2304.10618 [pdf, other]

ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks

Authors: Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araujo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. Franca, Mauricio Breternitz Jr., Lizy K. John

Abstract: The deployment of AI models on low-power, real-time edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more ef… ▽ More The deployment of AI models on low-power, real-time edge devices requires accelerators for which energy, latency, and area are all first-order concerns. There are many approaches to enabling deep neural networks (DNNs) in this domain, including pruning, quantization, compression, and binary neural networks (BNNs), but with the emergence of the "extreme edge", there is now a demand for even more efficient models. In order to meet the constraints of ultra-low-energy devices, we propose ULEEN, a model architecture based on weightless neural networks. Weightless neural networks (WNNs) are a class of neural model which use table lookups, not arithmetic, to perform computation. The elimination of energy-intensive arithmetic operations makes WNNs theoretically well suited for edge inference; however, they have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by BNNs to make significant strides in improving accuracy and reducing model size. We compare FPGA and ASIC implementations of an inference accelerator for ULEEN against edge-optimized DNN and BNN devices. On a Xilinx Zynq Z-7045 FPGA, we demonstrate classification on the MNIST dataset at 14.3 million inferences per second (13 million inferences/Joule) with 0.21 $μ$s latency and 96.2% accuracy, while Xilinx FINN achieves 12.3 million inferences per second (1.69 million inferences/Joule) with 0.31 $μ$s latency and 95.83% accuracy. In a 45nm ASIC, we achieve 5.1 million inferences/Joule and 38.5 million inferences/second at 98.46% accuracy, while a quantized Bit Fusion model achieves 9230 inferences/Joule and 19,100 inferences/second at 99.35% accuracy. In our search for ever more efficient edge devices, ULEEN shows that WNNs are deserving of consideration. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: 14 pages, 14 figures Portions of this article draw heavily from arXiv:2203.01479, most notably sections 5E and 5F.2

arXiv:2302.10977 [pdf, other]

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

Authors: Zhigang Wei, Aman Arora, Ruihao Li, Lizy K. John

Abstract: Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA design using HLS, ca… ▽ More Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset. The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The Verilog samples are generated with a variety of directives including loop unroll, loop pipeline and array partition to make sure optimized and realistic designs are covered. The total number of generated Verilog samples is nearly 9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we undertake case studies to perform power estimation and resource usage estimation with ML models trained with our dataset. All the codes and dataset are public at the github repo.We believe that HLSDataset can save valuable time for researchers by avoiding the tedious process of running tools, scripting and parsing files to generate the dataset, and enable them to spend more time where it counts, that is, in training ML models. △ Less

Submitted 21 August, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

Comments: 8 pages, 5 figures

arXiv:2203.12521 [pdf, other]

CoMeFa: Compute-in-Memory Blocks for FPGAs

Authors: Aman Arora, Tanmay Anand, Aatman Borda, Rishabh Sehgal, Bagus Hanindhito, Jaydeep Kulkarni, Lizy K. John

Abstract: Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal Processing (DSP) slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-In-Memory Blocks for FPGAs) RAMs. These RAMs provide highly-parallel compute-in-memory by combining computation and storage capabilities in… ▽ More Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal Processing (DSP) slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-In-Memory Blocks for FPGAs) RAMs. These RAMs provide highly-parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual port nature of FPGA BRAMs and contain multiple programmable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute in any precision, which is extremely important for evolving applications like Deep Learning. Adding CoMeFa RAMs to FPGAs significantly increases their compute density. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying SRAM technology like simultaneously activating multiple rows on the same port, and are practical to implement. CoMeFa RAMs are versatile blocks that find applications in numerous diverse parallel applications like Deep Learning, signal processing, databases, etc. By augmenting an Intel Arria-10-like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.5x (1.8x), across several representative benchmarks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of modern compute-intensive workloads. △ Less

Submitted 23 March, 2022; originally announced March 2022.

Comments: 10 pages, 12 figures, 4 tables, FCCM conference

arXiv:2203.01479 [pdf, other]

Weightless Neural Networks for Efficient Edge Inference

Authors: Zachary Susskind, Aman Arora, Igor Dantas Dos Santos Miranda, Luis Armando Quintanilla Villon, Rafael Fontella Katopodis, Leandro Santiago de Araujo, Diego Leonel Cadette Dutra, Priscila Machado Vieira Lima, Felipe Maia Galvao Franca, Mauricio Breternitz Jr., Lizy K. John

Abstract: Weightless Neural Networks (WNNs) are a class of machine learning model which use table lookups to perform inference. This is in contrast with Deep Neural Networks (DNNs), which use multiply-accumulate operations. State-of-the-art WNN architectures have a fraction of the implementation cost of DNNs, but still lag behind them on accuracy for common image recognition tasks. Additionally, many existi… ▽ More Weightless Neural Networks (WNNs) are a class of machine learning model which use table lookups to perform inference. This is in contrast with Deep Neural Networks (DNNs), which use multiply-accumulate operations. State-of-the-art WNN architectures have a fraction of the implementation cost of DNNs, but still lag behind them on accuracy for common image recognition tasks. Additionally, many existing WNN architectures suffer from high memory requirements. In this paper, we propose a novel WNN architecture, BTHOWeN, with key algorithmic and architectural improvements over prior work, namely counting Bloom filters, hardware-friendly hashing, and Gaussian-based nonlinear thermometer encodings to improve model accuracy and reduce area and energy consumption. BTHOWeN targets the large and growing edge computing sector by providing superior latency and energy efficiency to comparable quantized DNNs. Compared to state-of-the-art WNNs across nine classification datasets, BTHOWeN on average reduces error by more than than 40% and model size by more than 50%. We then demonstrate the viability of the BTHOWeN architecture by presenting an FPGA-based accelerator, and compare its latency and resource usage against similarly accurate quantized DNN accelerators, including Multi-Layer Perceptron (MLP) and convolutional models. The proposed BTHOWeN models consume almost 80% less energy than the MLP models, with nearly 85% reduction in latency. In our quest for efficient ML on the edge, WNNs are clearly deserving of additional attention. △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2109.06133 [pdf, other]

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization

Authors: Zachary Susskind, Bryce Arden, Lizy K. John, Patrick Stockton, Eugene B. John

Abstract: Neuro-symbolic artificial intelligence is a novel area of AI research which seeks to combine traditional rules-based AI approaches with modern deep learning techniques. Neuro-symbolic models have already demonstrated the capability to outperform state-of-the-art deep learning models in domains such as image and video reasoning. They have also been shown to obtain high accuracy with significantly l… ▽ More Neuro-symbolic artificial intelligence is a novel area of AI research which seeks to combine traditional rules-based AI approaches with modern deep learning techniques. Neuro-symbolic models have already demonstrated the capability to outperform state-of-the-art deep learning models in domains such as image and video reasoning. They have also been shown to obtain high accuracy with significantly less training data than traditional models. Due to the recency of the field's emergence and relative sparsity of published results, the performance characteristics of these models are not well understood. In this paper, we describe and analyze the performance characteristics of three recent neuro-symbolic models. We find that symbolic models have less potential parallelism than traditional neural models due to complex control flow and low-operational-intensity operations, such as scalar multiplication and tensor addition. However, the neural aspect of computation dominates the symbolic part in cases where they are clearly separable. We also find that data movement poses a potential bottleneck, as it does in many ML workloads. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: 11 pages, 7 figures

ACM Class: C.4; I.2.m

arXiv:2107.09178 [pdf, other]

Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

Authors: Aman Arora, Bagus Hanindhito, Lizy K. John

Abstract: The configurable building blocks of current FPGAs -- Logic blocks (LBs), Digital Signal Processing (DSP) slices, and Block RAMs (BRAMs) -- make them efficient hardware accelerators for the rapid-changing world of Deep Learning (DL). Communication between these blocks happens through an interconnect fabric consisting of switching elements spread throughout the FPGA. In this paper, a new block, Comp… ▽ More The configurable building blocks of current FPGAs -- Logic blocks (LBs), Digital Signal Processing (DSP) slices, and Block RAMs (BRAMs) -- make them efficient hardware accelerators for the rapid-changing world of Deep Learning (DL). Communication between these blocks happens through an interconnect fabric consisting of switching elements spread throughout the FPGA. In this paper, a new block, Compute RAM, is proposed. Compute RAMs provide highly-parallel processing-in-memory (PIM) by combining computation and storage capabilities in one block. Compute RAMs can be integrated in the FPGA fabric just like the existing FPGA blocks and provide two modes of operation (storage or compute) that can be dynamically chosen. They reduce power consumption by reducing data movement, provide adaptable precision support, and increase the effective on-chip memory bandwidth. Compute RAMs also help increase the compute density of FPGAs. In our evaluation of addition, multiplication and dot-product operations across multiple data precisions (int4, int8 and bfloat16), we observe an average savings of 80% in energy consumption, and an improvement in execution time ranging from 20% to 80%. Adding Compute RAMs can benefit non-DL applications as well, and make FPGAs more efficient, flexible, and performant accelerators. △ Less

Submitted 30 September, 2021; v1 submitted 19 July, 2021; originally announced July 2021.

Comments: 8 pages, IEEE Signal Processing Society's ASILOMAR Conference on Signals, Systems and Computers

arXiv:2106.07087 [pdf, other]

Koios: A Deep Learning Benchmark Suite for FPGA Architecture and CAD Research

Authors: Aman Arora, Andrew Boutros, Daniel Rauch, Aishwarya Rajen, Aatman Borda, Seyed Alireza Damghani, Samidh Mehta, Sangram Kate, Pragnesh Patel, Kenneth B. Kent, Vaughn Betz, Lizy K. John

Abstract: With the prevalence of deep learning (DL) in many applications, researchers are investigating different ways of optimizing FPGA architecture and CAD to achieve better quality-of-results (QoR) on DL-based workloads. In this optimization process, benchmark circuits are an essential component; the QoR achieved on a set of benchmarks is the main driver for architecture and CAD design choices. However,… ▽ More With the prevalence of deep learning (DL) in many applications, researchers are investigating different ways of optimizing FPGA architecture and CAD to achieve better quality-of-results (QoR) on DL-based workloads. In this optimization process, benchmark circuits are an essential component; the QoR achieved on a set of benchmarks is the main driver for architecture and CAD design choices. However, current academic benchmark suites are inadequate, as they do not capture any designs from the DL domain. This work presents a new suite of DL acceleration benchmark circuits for FPGA architecture and CAD research, called Koios. This suite of 19 circuits covers a wide variety of accelerated neural networks, design sizes, implementation styles, abstraction levels, and numerical precisions. These designs are larger, more data parallel, more heterogeneous, more deeply pipelined, and utilize more FPGA architectural features compared to existing open-source benchmarks. This enables researchers to pin-point architectural inefficiencies for this class of workloads and optimize CAD tools on more realistic benchmarks that stress the CAD algorithms in different ways. In this paper, we describe the designs in our benchmark suite, present results of running them through the Verilog-to-Routing (VTR) flow using a recent FPGA architecture model, and identify key insights from the resulting metrics. On average, our benchmarks have 3.7x more netlist primitives, 1.8x and 4.7x higher DSP and BRAM densities, and 1.7x higher frequency with 1.9x more near-critical paths compared to the widely-used VTR suite. Finally, we present two example case studies showing how architectural exploration for DL-optimized FPGAs can be performed using our new benchmark suite. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2012.05181 [pdf, ps, other]

Virtual-Link: A Scalable Multi-Producer, Multi-Consumer Message Queue Architecture for Cross-Core Communication

Authors: Qinzhe Wu, Jonathan Beard, Ashen Ekanayake, Andreas Gerstlauer, Lizy K. John

Abstract: Cross-core communication is increasingly a bottleneck as the number of processing elements increase per system-on-chip. Typical hardware solutions to cross-core communication are often inflexible; while software solutions are flexible, they have performance scaling limitations. A key problem, as we will show, is that of shared state in software-based message queue mechanisms. This paper proposes V… ▽ More Cross-core communication is increasingly a bottleneck as the number of processing elements increase per system-on-chip. Typical hardware solutions to cross-core communication are often inflexible; while software solutions are flexible, they have performance scaling limitations. A key problem, as we will show, is that of shared state in software-based message queue mechanisms. This paper proposes Virtual-Link (VL), a novel light-weight communication mechanism with hardware support to facilitate M:N lock-free data movement. VL reduces the amount of coherent shared state, which is a bottleneck for many approaches, to zero. VL provides further latency benefit by keeping data on the fast path (i.e., within the on-chip interconnect). VL enables directed cache-injection (stashing) between PEs on the coherence bus, reducing the latency for core-to-core communication. VL is particularly effective for fine-grain tasks on streaming data. Evaluation on a full system simulator with 7 benchmarks shows that VL achieves a 2.09x speedup over state-of-the-art software-based communication mechanisms, while reducing memory traffic by 61%. △ Less

Submitted 19 January, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

arXiv:2008.07361 [pdf]

How little data do we need for patient-level prediction?

Authors: Luis H. John, Jan A. Kors, Jenna M. Reps, Patrick B. Ryan, Peter R. Rijnbeek

Abstract: Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements. Materials and Methods: We empirically assess the effect of sample size on prediction performance and model comple… ▽ More Objective: Provide guidance on sample size considerations for developing predictive models by empirically establishing the adequate sample size, which balances the competing objectives of improving model performance and reducing model complexity as well as computational requirements. Materials and Methods: We empirically assess the effect of sample size on prediction performance and model complexity by generating learning curves for 81 prediction problems in three large observational health databases, requiring training of 17,248 prediction models. The adequate sample size was defined as the sample size for which the performance of a model equalled the maximum model performance minus a small threshold value. Results: The adequate sample size achieves a median reduction of the number of observations between 9.5% and 78.5% for threshold values between 0.001 and 0.02. The median reduction of the number of predictors in the models at the adequate sample size varied between 8.6% and 68.3%, respectively. Discussion: Based on our results a conservative, yet significant, reduction in sample size and model complexity can be estimated for future prediction work. Though, if a researcher is willing to generate a learning curve a much larger reduction of the model complexity may be possible as suggested by a large outcome-dependent variability. Conclusion: Our results suggest that in most cases only a fraction of the available data was sufficient to produce a model close to the performance of one developed on the full data set, but with a substantially reduced model complexity. △ Less

Submitted 14 August, 2020; originally announced August 2020.

arXiv:1908.09207 [pdf, ps, other]

Demystifying the MLPerf Benchmark Suite

Authors: Snehil Verma, Qinzhe Wu, Bagus Hanindhito, Gunjan Jha, Eugene B. John, Ramesh Radhakrishnan, Lizy K. John

Abstract: MLPerf, an emerging machine learning benchmark suite strives to cover a broad range of applications of machine learning. We present a study on its characteristics and how the MLPerf benchmarks differ from some of the previous deep learning benchmarks like DAWNBench and DeepBench. We find that application benchmarks such as MLPerf (although rich in kernels) exhibit different features compared to ke… ▽ More MLPerf, an emerging machine learning benchmark suite strives to cover a broad range of applications of machine learning. We present a study on its characteristics and how the MLPerf benchmarks differ from some of the previous deep learning benchmarks like DAWNBench and DeepBench. We find that application benchmarks such as MLPerf (although rich in kernels) exhibit different features compared to kernel benchmarks such as DeepBench. MLPerf benchmark suite contains a diverse set of models which allows unveiling various bottlenecks in the system. Based on our findings, dedicated low latency interconnect between GPUs in multi-GPU systems is required for optimal distributed deep learning training. We also observe variation in scaling efficiency across the MLPerf models. The variation exhibited by the different models highlight the importance of smart scheduling strategies for multi-GPU training. Another observation is that CPU utilization increases with increase in number of GPUs used for training. Corroborating prior work we also observe and quantify improvements possible by compiler optimizations, mixed-precision training and use of Tensor Cores. △ Less

Submitted 24 August, 2019; originally announced August 2019.

arXiv:1805.12305 [pdf, other]

Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction

Authors: Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, Lizy K. John

Abstract: Graph processing systems are important in the big data domain. However, processing graphs in parallel often introduces redundant computations in existing algorithms and models. Prior work has proposed techniques to optimize redundancies for the out-of-core graph systems, rather than the distributed graph systems. In this paper, we study various state-of-the-art distributed graph systems and observ… ▽ More Graph processing systems are important in the big data domain. However, processing graphs in parallel often introduces redundant computations in existing algorithms and models. Prior work has proposed techniques to optimize redundancies for the out-of-core graph systems, rather than the distributed graph systems. In this paper, we study various state-of-the-art distributed graph systems and observe root causes for these pervasively existing redundancies. To reduce redundancies without sacrificing parallelism, we further propose SLFE, a distributed graph processing system, designed with the principle of "start late or finish early". SLFE employs a novel preprocessing stage to obtain a graph's topological knowledge with negligible overhead. SLFE's redundancy-aware vertex-centric computation model can then utilize such knowledge to reduce the redundant computations at runtime. SLFE also provides a set of APIs to improve the programmability. Our experiments on an 8-node high-performance cluster show that SLFE outperforms all well-known distributed graph processing systems on real-world graphs (yielding up to 74.8x speedup). SLFE's redundancy-reduction schemes are generally applicable to other vertex-centric graph processing systems. △ Less

Submitted 30 May, 2018; originally announced May 2018.

Comments: 11 pages, 10 figures

arXiv:1511.02134 [pdf, other]

A quantitative performance analysis for Stokes solvers at the extreme scale

Authors: Björn Gmeiner, Markus Huber, Lorenz John, Ulrich Rüde, Barbara Wohlmuth

Abstract: This article presents a systematic quantitative performance analysis for large finite element computations on extreme scale computing systems. Three parallel iterative solvers for the Stokes system, discretized by low order tetrahedral elements, are compared with respect to their numerical efficiency and their scalability running on up to $786\,432$ parallel threads. A genuine multigrid method for… ▽ More This article presents a systematic quantitative performance analysis for large finite element computations on extreme scale computing systems. Three parallel iterative solvers for the Stokes system, discretized by low order tetrahedral elements, are compared with respect to their numerical efficiency and their scalability running on up to $786\,432$ parallel threads. A genuine multigrid method for the saddle point system using an Uzawa-type smoother provides the best overall performance with respect to memory consumption and time-to-solution. The largest system solved on a Blue Gene/Q system has more than ten trillion ($1.1 \cdot 10 ^{13}$) unknowns and requires about 13 minutes compute time. Despite the matrix free and highly optimized implementation, the memory requirement for the solution vector and the auxiliary vectors is about 200 TByte. Brandt's notion of "textbook multigrid efficiency" is employed to study the algorithmic performance of iterative solvers. A recent extension of this paradigm to "parallel textbook multigrid efficiency" makes it possible to assess also the efficiency of parallel iterative solvers for a given hardware architecture in absolute terms. The efficiency of the method is demonstrated for simulating incompressible fluid flow in a pipe filled with spherical obstacles. △ Less

Submitted 6 November, 2015; originally announced November 2015.

MSC Class: 65N55; 65Y05; 68Q25

arXiv:1504.02205 [pdf, other]

BigDataBench-MT: A Benchmark Tool for Generating Realistic Mixed Data Center Workloads

Authors: Rui Han, Shulin Zhan, Chenrong Shao, Junwei Wang, Lizy K. John, Jiangtao Xu, Gang Lu, Lei Wang

Abstract: Long-running service workloads (e.g. web search engine) and short-term data analysis workloads (e.g. Hadoop MapReduce jobs) co-locate in today's data centers. Developing realistic benchmarks to reflect such practical scenario of mixed workload is a key problem to produce trustworthy results when evaluating and comparing data center systems. This requires using actual workloads as well as guarantee… ▽ More Long-running service workloads (e.g. web search engine) and short-term data analysis workloads (e.g. Hadoop MapReduce jobs) co-locate in today's data centers. Developing realistic benchmarks to reflect such practical scenario of mixed workload is a key problem to produce trustworthy results when evaluating and comparing data center systems. This requires using actual workloads as well as guaranteeing their submissions to follow patterns hidden in real-world traces. However, existing benchmarks either generate actual workloads based on probability models, or replay real-world workload traces using basic I/O operations. To fill this gap, we propose a benchmark tool that is a first step towards generating a mix of actual service and data analysis workloads on the basis of real workload traces. Our tool includes a combiner that enables the replaying of actual workloads according to the workload traces, and a multi-tenant generator that flexibly scales the workloads up and down according to users' requirements. Based on this, our demo illustrates the workload customization and generation process using a visual interface. The proposed tool, called BigDataBench-MT, is a multi-tenant version of our comprehensive benchmark suite BigDataBench and it is publicly available from http://prof.ict.ac.cn/BigDataBench/multi-tenancyversion/. △ Less

Submitted 4 December, 2015; v1 submitted 9 April, 2015; originally announced April 2015.

Comments: 12 pages, 5 figures

arXiv:1306.1572 [pdf, other]

Algorithms for detecting dependencies and rigid subsystems for CAD

Authors: James Farre, Helena Kleinschmidt, Jessica Sidman, Audrey Lee-St. John, Stephanie Stark, Louis Theran, Xilin Yu

Abstract: Geometric constraint systems underly popular Computer Aided Design soft- ware. Automated approaches for detecting dependencies in a design are critical for developing robust solvers and providing informative user feedback, and we provide algorithms for two types of dependencies. First, we give a pebble game algorithm for detecting generic dependencies. Then, we focus on identifying the "special po… ▽ More Geometric constraint systems underly popular Computer Aided Design soft- ware. Automated approaches for detecting dependencies in a design are critical for developing robust solvers and providing informative user feedback, and we provide algorithms for two types of dependencies. First, we give a pebble game algorithm for detecting generic dependencies. Then, we focus on identifying the "special positions" of a design in which generically independent constraints become dependent. We present combinatorial algorithms for identifying subgraphs associated to factors of a particular polynomial, whose vanishing indicates a special position and resulting dependency. Further factoring in the Grassmann- Cayley algebra may allow a geometric interpretation giving conditions (e.g., "these two lines being parallel cause a dependency") determining the special position. △ Less

Submitted 1 October, 2015; v1 submitted 6 June, 2013; originally announced June 2013.

Comments: 37 pages, 14 figures (v2 is an expanded version of an AGD'14 abstract based on v1)

arXiv:1210.0451 [pdf, other]

Combinatorics and the Rigidity of CAD Systems

Authors: Audrey Lee-St. John, Jessica Sidman

Abstract: We study the rigidity of body-and-cad frameworks which capture the majority of the geometric constraints used in 3D mechanical engineering CAD software. We present a combinatorial characterization of the generic minimal rigidity of a subset of body-and-cad frameworks in which we treat 20 of the 21 body-and-cad constraints, omitting only point-point coincidences. While the handful of classical comb… ▽ More We study the rigidity of body-and-cad frameworks which capture the majority of the geometric constraints used in 3D mechanical engineering CAD software. We present a combinatorial characterization of the generic minimal rigidity of a subset of body-and-cad frameworks in which we treat 20 of the 21 body-and-cad constraints, omitting only point-point coincidences. While the handful of classical combinatorial characterizations of rigidity focus on distance constraints between points, this is the first result simultaneously addressing coincidence, angular, and distance constraints. Our result is stated in terms of the partitioning of a graph into edge-disjoint spanning trees. This combinatorial approach provides the theoretical basis for the development of deterministic algorithms (that will not depend on numerical methods) for analyzing the rigidity of body-and-cad frameworks. △ Less

Submitted 17 October, 2012; v1 submitted 1 October, 2012; originally announced October 2012.

Comments: 17 pages, 7 figures, version to appear in Symposium on Solid and Physical Modeling '12 and associated special issue of Computer Aided Design

MSC Class: 68R10; 05C50 ACM Class: G.2.1; J.6; I.3.5

arXiv:1108.5250 [pdf]

doi 10.1109/IEMBS.2011.6091552

Single-trial EEG Discrimination between Wrist and Finger Movement Imagery and Execution in a Sensorimotor BCI

Authors: A. K. Mohamed, T. Marwala, L. R. John

Abstract: A brain-computer interface (BCI) may be used to control a prosthetic or orthotic hand using neural activity from the brain. The core of this sensorimotor BCI lies in the interpretation of the neural information extracted from electroencephalogram (EEG). It is desired to improve on the interpretation of EEG to allow people with neuromuscular disorders to perform daily activities. This paper investi… ▽ More A brain-computer interface (BCI) may be used to control a prosthetic or orthotic hand using neural activity from the brain. The core of this sensorimotor BCI lies in the interpretation of the neural information extracted from electroencephalogram (EEG). It is desired to improve on the interpretation of EEG to allow people with neuromuscular disorders to perform daily activities. This paper investigates the possibility of discriminating between the EEG associated with wrist and finger movements. The EEG was recorded from test subjects as they executed and imagined five essential hand movements using both hands. Independent component analysis (ICA) and time-frequency techniques were used to extract spectral features based on event-related (de)synchronisation (ERD/ERS), while the Bhattacharyya distance (BD) was used for feature reduction. Mahalanobis distance (MD) clustering and artificial neural networks (ANN) were used as classifiers and obtained average accuracies of 65 % and 71 % respectively. This shows that EEG discrimination between wrist and finger movements is possible. The research introduces a new combination of motor tasks to BCI research. △ Less

Submitted 26 August, 2011; originally announced August 2011.

Comments: 33rd Annual International IEEE EMBS Conference 2011

arXiv:1006.1126 [pdf, other]

Body-and-cad Geometric Constraint Systems

Authors: Kirk Haller, Audrey Lee-St. John, Meera Sitharam, Ileana Streinu, Neil White

Abstract: Motivated by constraint-based CAD software, we develop the foundation for the rigidity theory of a very general model: the body-and-cad structure, composed of rigid bodies in 3D constrained by pairwise coincidence, angular and distance constraints. We identify 21 relevant geometric constraints and develop the corresponding infinitesimal rigidity theory for these structures. The classical body-and-… ▽ More Motivated by constraint-based CAD software, we develop the foundation for the rigidity theory of a very general model: the body-and-cad structure, composed of rigid bodies in 3D constrained by pairwise coincidence, angular and distance constraints. We identify 21 relevant geometric constraints and develop the corresponding infinitesimal rigidity theory for these structures. The classical body-and-bar rigidity model can be viewed as a body-and-cad structure that uses only one constraint from this new class. As a consequence, we identify a new, necessary, but not sufficient, counting condition for minimal rigidity of body-and-cad structures: nested sparsity. This is a slight generalization of the well-known sparsity condition of Maxwell. △ Less

Submitted 6 June, 2010; originally announced June 2010.

Comments: 33 pages, to appear in Computational Geometry: Theory and Applications (an abbreviated version appeared in: 24th Annual ACM Symposium on Applied Computing, Technical Track on Geometric Constraints and Reasoning GCR'09, Honolulu, HI, 2009)

MSC Class: 68R10; 05C85; 05C50 ACM Class: I.3.5; J.6; G.2.1

arXiv:0807.1765 [pdf]

doi 10.1007/978-3-642-03354-4_7

Archer: A Community Distributed Computing Infrastructure for Computer Architecture Research and Education

Authors: Renato Figueiredo, P. Oscar Boykin, Jose A. B. Fortes, Tao Li, Jie-Kwon Peir, David Wolinsky, Lizy John, David Kaeli, David Lilja, Sally McKee, Gokhan Memik, Alain Roy, Gary Tyson

Abstract: This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources aggregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusse… ▽ More This paper introduces Archer, a community-based computing resource for computer architecture research and education. The Archer infrastructure integrates virtualization and batch scheduling middleware to deliver high-throughput computing resources aggregated from resources distributed across wide-area networks and owned by different participating entities in a seamless manner. The paper discusses the motivations leading to the design of Archer, describes its core middleware components, and presents an analysis of the functionality and performance of a prototype wide-area deployment running a representative computer architecture simulation workload. △ Less

Submitted 10 July, 2008; originally announced July 2008.

Comments: 11 pages, 2 figures. Describes the Archer project, http://archer-project.org

ACM Class: C.0; I.6.3; C.2.4

Showing 1–20 of 20 results for author: John, L