subscribe to arXiv mailings

TDRAM: Tag-enhanced DRAM for Efficient Caching

Authors: Maryam Babaie, Ayaz Akram, Wendy Elsasser, Brent Haukness, Michael Miller, Taeksang Song, Thomas Vogelsang, Steven Woo, Jason Lowe-Power

Abstract: As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM… ▽ More As SRAM-based caches are hitting a scaling wall, manufacturers are integrating DRAM-based caches into system designs to continue increasing cache sizes. While DRAM caches can improve the performance of memory systems, existing DRAM cache designs suffer from high miss penalties, wasted data movement, and interference between misses and demand requests. In this paper, we propose TDRAM, a novel DRAM microarchitecture tailored for caching. TDRAM enhances HBM3 by adding a set of small low-latency mats to store tags and metadata on the same die as the data mats. These mats enable fast parallel tag and data access, on-DRAM-die tag comparison, and conditional data response based on comparison result (reducing wasted data transfers) akin to SRAM caches mechanism. TDRAM further optimizes the hit and miss latencies by performing opportunistic early tag probing. Moreover, TDRAM introduces a flush buffer to store conflicting dirty data on write misses, eliminating turnaround delays on data bus. We evaluate TDRAM using a full-system simulator and a set of HPC workloads with large memory footprints showing TDRAM provides at least 2.6$\times$ faster tag check, 1.2$\times$ speedup, and 21% less energy consumption, compared to the state-of-the-art commercial and research designs. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.03155 [pdf, other]

TEGRA -- Scaling Up Terascale Graph Processing with Disaggregated Computing

Authors: William Shaddix, Mahyar Samani, Marjan Fariborz, S. J. Ben Yoo, Jason Lowe-Power, Venkatesh Akella

Abstract: Graphs are essential for representing relationships in various domains, driving modern AI applications such as graph analytics and neural networks across science, engineering, cybersecurity, transportation, and economics. However, the size of modern graphs are rapidly expanding, posing challenges for traditional CPUs and GPUs in meeting real-time processing demands. As a result, hardware accelerat… ▽ More Graphs are essential for representing relationships in various domains, driving modern AI applications such as graph analytics and neural networks across science, engineering, cybersecurity, transportation, and economics. However, the size of modern graphs are rapidly expanding, posing challenges for traditional CPUs and GPUs in meeting real-time processing demands. As a result, hardware accelerators for graph processing have been proposed. However, the largest graphs that can be handled by these systems is still modest often targeting Twitter graph(1.4B edges approximately). This paper aims to address this limitation by developing a graph accelerator capable of terascale graph processing. Scale out architectures, architectures where nodes are replicated to expand to larger datasets, are natural for handling larger graphs. We argue that this approach is not appropriate for very large-scale graphs because it leads to under utilization of both memory resources and compute resources. Additionally, vertex and edge processing have different access patterns. Communication overheads also pose further challenges in designing scalable architectures. To overcome these issues, this paper proposes TEGRA, a scale-up architecture for terascale graph processing. TEGRA leverages a composable computing system with disaggregated resources and a communication architecture inspired by Active Messages. By employing direct communication between cores and optimizing memory interconnect utilization, TEGRA effectively reduces communication overhead and improves resource utilization, therefore enabling efficient processing of terascale graphs. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: Presented at the 3rd Workshop on Heterogeneous Composable and Disaggregated Systems (HCDS 2024)

arXiv:2307.00143 [pdf, other]

Centauri: Practical Rowhammer Fingerprinting

Authors: Hari Venugopalan, Kaustav Goswami, Zainul Abi Din, Jason Lowe-Power, Samuel T. King, Zubair Shafiq

Abstract: Fingerprinters leverage the heterogeneity in hardware and software configurations to extract a device fingerprint. Fingerprinting countermeasures attempt to normalize these attributes such that they present a uniform fingerprint across different devices or present different fingerprints for the same device each time. We present Centauri, a Rowhammer fingerprinting approach that can build a unique… ▽ More Fingerprinters leverage the heterogeneity in hardware and software configurations to extract a device fingerprint. Fingerprinting countermeasures attempt to normalize these attributes such that they present a uniform fingerprint across different devices or present different fingerprints for the same device each time. We present Centauri, a Rowhammer fingerprinting approach that can build a unique and stable fingerprints even across devices with homogeneous or normalized/obfuscated hardware and software configurations. To this end, Centauri leverages the process variation in the underlying manufacturing process that gives rise to unique distributions of Rowhammer-induced bit flips across different DRAM modules. Centauri's design and implementation is able to overcome memory allocation constrains without requiring root privileges. Our evaluation on a test bed of about one hundred DRAM modules shows that system achieves 99.91% fingerprinting accuracy. Centauri's fingerprints are also stable with daily experiments over a period of 10 days revealing no loss in fingerprinting accuracy. We show that Centauri is efficient, taking as little as 9.92 seconds to extract a fingerprint. Centauri is the first practical Rowhammer fingerprinting approach that is able to extract unique and stable fingerprints efficiently and at-scale. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2303.13029 [pdf, other]

Enabling Design Space Exploration of DRAM Caches in Emerging Memory Systems

Authors: Maryam Babaie, Ayaz Akram, Jason Lowe-Power

Abstract: The increasing growth of applications' memory capacity and performance demands has led the CPU vendors to deploy heterogeneous memory systems either within a single system or via disaggregation. For instance, systems like Intel's Knights Landing and Sapphire Rapids can be configured to use high bandwidth memory as a cache to main memory. While there is significant research investigating the design… ▽ More The increasing growth of applications' memory capacity and performance demands has led the CPU vendors to deploy heterogeneous memory systems either within a single system or via disaggregation. For instance, systems like Intel's Knights Landing and Sapphire Rapids can be configured to use high bandwidth memory as a cache to main memory. While there is significant research investigating the designs of DRAM caches, there has been little research investigating DRAM caches from a full system point of view, because there is not a suitable model available to the community to accurately study largescale systems with DRAM caches at a cycle-level. In this work we describe a new cycle-level DRAM cache model in the gem5 simulator which can be used for heterogeneous and disaggregated systems. We believe this model enables the community to perform a design space exploration for future generation of memory systems supporting DRAM caches. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2303.13026 [pdf, other]

A Cycle-level Unified DRAM Cache Controller Model for 3DXPoint Memory Systems in gem5

Authors: Maryam Babaie, Ayaz Akram, Jason Lowe-Power

Abstract: To accommodate the growing memory footprints of today's applications, CPU vendors have employed large DRAM caches, backed by large non-volatile memories like Intel Optane (e.g., Intel's Cascade Lake). The existing computer architecture simulators do not provide support to model and evaluate systems which use DRAM devices as a cache to the non-volatile main memory. In this work, we present a cycle-… ▽ More To accommodate the growing memory footprints of today's applications, CPU vendors have employed large DRAM caches, backed by large non-volatile memories like Intel Optane (e.g., Intel's Cascade Lake). The existing computer architecture simulators do not provide support to model and evaluate systems which use DRAM devices as a cache to the non-volatile main memory. In this work, we present a cycle-level DRAM cache model which is integrated with gem5. This model leverages the flexibility of gem5's memory devices models and full system support to enable exploration of many different DRAM cache designs. We demonstrate the usefulness of this new tool by exploring the design space of a DRAM cache controller through several case studies including the impact of scheduling policies, required buffering, combining different memory technologies (e.g., HBM, DDR3/4/5, 3DXPoint, High latency) as the cache and main memory, and the effect of wear-leveling when DRAM cache is backed by NVM main memory. We also perform experiments with real workloads in full-system simulations to validate the proposed model and show the sensitivity of these workloads to the DRAM cache sizes. △ Less

Submitted 23 March, 2023; originally announced March 2023.

arXiv:2012.04105 [pdf, other]

The Tribes of Machine Learning and the Realm of Computer Architecture

Authors: Ayaz Akram, Jason Lowe-Power

Abstract: Machine learning techniques have influenced the field of computer architecture like many other fields. This paper studies how the fundamental machine learning techniques can be applied towards computer architecture problems. We also provide a detailed survey of computer architecture research that employs different machine learning methods. Finally, we present some future opportunities and the outs… ▽ More Machine learning techniques have influenced the field of computer architecture like many other fields. This paper studies how the fundamental machine learning techniques can be applied towards computer architecture problems. We also provide a detailed survey of computer architecture research that employs different machine learning methods. Finally, we present some future opportunities and the outstanding challenges that need to be overcome to exploit full potential of machine learning for computer architecture. △ Less

Submitted 7 December, 2020; originally announced December 2020.

arXiv:2010.13216 [pdf, other]

Performance Analysis of Scientific Computing Workloads on Trusted Execution Environments

Authors: Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert

Abstract: Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. To study the applicability of hardware-based trusted execution environments (TEEs) to enable secure scientific computing, we deeply analyze the performance impa… ▽ More Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. To study the applicability of hardware-based trusted execution environments (TEEs) to enable secure scientific computing, we deeply analyze the performance impact of AMD SEV and Intel SGX for diverse HPC benchmarks including traditional scientific computing, machine learning, graph analytics, and emerging scientific computing workloads. We observe three main findings: 1) SEV requires careful memory placement on large scale NUMA machines (1$\times$$-$3.4$\times$ slowdown without and 1$\times$$-$1.15$\times$ slowdown with NUMA aware placement), 2) virtualization$-$a prerequisite for SEV$-$results in performance degradation for workloads with irregular memory accesses and large working sets (1$\times$$-$4$\times$ slowdown compared to native execution for graph applications) and 3) SGX is inappropriate for HPC given its limited secure memory size and inflexible programming model (1.2$\times$$-$126$\times$ slowdown over unsecure execution). Finally, we discuss forthcoming new TEE designs and their potential impact on scientific computing. △ Less

Submitted 25 October, 2020; originally announced October 2020.

arXiv:2007.03152 [pdf, other]

The gem5 Simulator: Version 20.0+

Authors: Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi , et al. (53 additional authors not shown)

Abstract: The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 si… ▽ More The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research. △ Less

Submitted 29 September, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: Source, comments, and feedback: https://github.com/darchr/gem5-20-paper

arXiv:1608.07485 [pdf, ps, other]

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads

Authors: Jason Lowe-Power, Mark D. Hill, David A. Wood

Abstract: Response time requirements for big data processing systems are shrinking. To meet this strict response time requirement, many big data systems store all or most of their data in main memory to reduce the access latency. Main memory capacities have grown, and systems with 2 TB of main memory capacity available today. However, the rate at which processors can access this data--the memory bandwidth--… ▽ More Response time requirements for big data processing systems are shrinking. To meet this strict response time requirement, many big data systems store all or most of their data in main memory to reduce the access latency. Main memory capacities have grown, and systems with 2 TB of main memory capacity available today. However, the rate at which processors can access this data--the memory bandwidth--has not grown at the same rate. In fact, some of these big-memory systems can access less than 10% of their main memory capacity in one second (billions of processor cycles). 3D die-stacking is one promising solution to this bandwidth problem, and industry is investing significantly in 3D die-stacking. We use a simple back-of-the-envelope-style model to characterize if and when the 3D die-stacked architecture is more cost-effective than current architectures for in-memory big data workloads. We find that die-stacking has much higher performance than current systems (up to 256x lower response times), and it does not require expensive memory over provisioning to meet real-time (10 ms) response time service-level agreements. However, the power requirements of the die-stacked systems are significantly higher (up to 50x) than current systems, and its memory capacity is lower in many cases. Even in this limited case study, we find 3D die-stacking is not a panacea. Today, die-stacking is the most cost-effective solution for strict SLAs and by reducing the power of the compute chip and increasing memory densities die-stacking can be cost-effective under other constraints in the future. △ Less

Submitted 26 August, 2016; originally announced August 2016.

Comments: Originally presented The Seventh workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE-7). http://www.bafst.com/events/asplos16/bpoe7/

Showing 1–9 of 9 results for author: Lowe-Power, J