subscribe to arXiv mailings

Towards Data-center Level Carbon Modeling and Optimization for Deep Learning Inference

Authors: Shixin Ji, Zhuoping Yang, Xingzhen Chen, Jingtong Hu, Yiyu Shi, Alex K. Jones, Peipei Zhou

Abstract: Recently, the increasing need for computing resources has led to the prosperity of data centers, which poses challenges to the environmental impacts and calls for improvements in data center provisioning strategies. In this work, we show a comprehensive analysis based on profiling a variety of deep-learning inference applications on different generations of GPU servers. Our analysis reveals severa… ▽ More Recently, the increasing need for computing resources has led to the prosperity of data centers, which poses challenges to the environmental impacts and calls for improvements in data center provisioning strategies. In this work, we show a comprehensive analysis based on profiling a variety of deep-learning inference applications on different generations of GPU servers. Our analysis reveals several critical factors which can largely affect the design space of provisioning strategies including the hardware embodied cost estimation, application-specific features, and the distribution of carbon cost each year, which prior works have omitted. Based on the observations, we further present a first-order modeling and optimization tool for data center provisioning and scheduling and highlight the importance of environmental impacts from data center management. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 12 pages, 9 figures

arXiv:2401.16694 [pdf, other]

EdgeOL: Efficient in-situ Online Learning on Edge Devices

Authors: Sheng Li, Geng Yuan, Yawen Wu, Yue Dai, Chao Wu, Alex K. Jones, Jingtong Hu, Yanzhi Wang, Xulong Tang

Abstract: Emerging applications, such as robot-assisted eldercare and object recognition, generally employ deep learning neural networks (DNNs) and naturally require: i) handling streaming-in inference requests and ii) adapting to possible deployment scenario changes. Online model fine-tuning is widely adopted to satisfy these needs. However, an inappropriate fine-tuning scheme could involve significant ene… ▽ More Emerging applications, such as robot-assisted eldercare and object recognition, generally employ deep learning neural networks (DNNs) and naturally require: i) handling streaming-in inference requests and ii) adapting to possible deployment scenario changes. Online model fine-tuning is widely adopted to satisfy these needs. However, an inappropriate fine-tuning scheme could involve significant energy consumption, making it challenging to deploy on edge devices. In this paper, we propose EdgeOL, an edge online learning framework that optimizes inference accuracy, fine-tuning execution time, and energy efficiency through both inter-tuning and intra-tuning optimizations. Experimental results show that, on average, EdgeOL reduces overall fine-tuning execution time by 64%, energy consumption by 52%, and improves average inference accuracy by 1.75% over the immediate online learning strategy. △ Less

Submitted 15 March, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.10417 [pdf, other]

doi 10.1145/3626202.3637569

SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Authors: Jinming Zhuang, Zhuoping Yang, Shixin Ji, Heng Huang, Alex K. Jones, Jingtong Hu, Yiyu Shi, Peipei Zhou

Abstract: With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latenc… ▽ More With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX. △ Less

Submitted 18 February, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Journal ref: 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '24)

arXiv:2401.06270 [pdf, other]

SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators

Authors: Shixin Ji, Zhuoping Yang, Xingzhen Chen, Stephen Cahoon, Jingtong Hu, Yiyu Shi, Alex K. Jones, Peipei Zhou

Abstract: Embodied carbon has been widely reported as a significant component in the full system lifecycle of various computing systems' green house gas emissions. Many efforts have been undertaken to quantify the elements that comprise this embodied carbon, from tools that evaluate semiconductor manufacturing to those that can quantify different elements of the computing system from commercial and academic… ▽ More Embodied carbon has been widely reported as a significant component in the full system lifecycle of various computing systems' green house gas emissions. Many efforts have been undertaken to quantify the elements that comprise this embodied carbon, from tools that evaluate semiconductor manufacturing to those that can quantify different elements of the computing system from commercial and academic sources. However, these tools cannot easily reproduce results reported by server vendors' product carbon reports and the accuracy can vary substantially due to various assumptions. Furthermore, attempts to determine green house gas contributions using bottom-up methodologies often do not agree with system-level studies and are hard to rectify. Nonetheless, given there is a need to consider all contributions to green house gas emissions in datacenters, we propose SCARIF, the Server Carbon including Accelerator Reporter with Intelligence-based Formulation tool. SCARIF has three main contributions: (1) We first collect reported carbon cost data from server vendors and design statistic models to predict the embodied carbon cost so that users can get the embodied carbon cost for their server configurations. (2) We provide embodied carbon cost if users configure servers with accelerators including GPUs, and FPGAs. (3) By using case studies, we show that certain design choices of data center management might flip by the insight and observation from using SCARIF. Thus, SCARIF provides an opportunity for large-scale datacenter and hyperscaler design. We release SCARIF as an open-source tool at https://github.com/arc-research-lab/SCARIF. △ Less

Submitted 22 May, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: 6 pages; 6 figures; 3 tables. Accepted by ISVLSI' 24

arXiv:2312.02991 [pdf, other]

REFRESH FPGAs: Sustainable FPGA Chiplet Architectures

Authors: Peipei Zhou, Jinming Zhuang, Stephen Cahoon, Yue Tang, Zhuoping Yang, Xingzhen Chen, Yiyu Shi, Jingtong Hu, Alex K. Jones

Abstract: There is a growing call for greater amounts of increasingly agile computational power for edge and cloud infrastructure to serve the computationally complex needs of ubiquitous computing devices. Thus, an important challenge is addressing the holistic environmental impacts of these next-generation computing systems. To accomplish this, a life-cycle view of sustainability for computing advancements… ▽ More There is a growing call for greater amounts of increasingly agile computational power for edge and cloud infrastructure to serve the computationally complex needs of ubiquitous computing devices. Thus, an important challenge is addressing the holistic environmental impacts of these next-generation computing systems. To accomplish this, a life-cycle view of sustainability for computing advancements is necessary to reduce environmental impacts such as greenhouse warming gas emissions from these computing choices. Unfortunately, decadal efforts to address operational energy efficiency in computing devices have ignored and in some cases exacerbated embodied impacts from manufacturing these edge and cloud systems, particularly their integrated circuits. During this time FPGA architectures have not changed dramatically except to increase in size. Given this context, we propose REFRESH FPGAs to build new FPGA devices and architectures from recently retired FPGA dies using 2.5D integration. To build REFRESH FPGAs requires creative architectures that leverage existing chiplet pins with an inexpensive to-manufacture interposer coupled with creative design automation. In this paper, we discuss how REFRESH FPGAs can leverage industry trends for renewable energy integration into data centers while providing an overall improvement for sustainability and amortizing their significant embodied cost investment over a much longer ``first'' lifetime. △ Less

Submitted 27 November, 2023; originally announced December 2023.

arXiv:2309.12275 [pdf, other]

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

Authors: Zhuoping Yang, Jinming Zhuang, Jiaqi Yin, Cunxi Yu, Alex K. Jones, Peipei Zhou

Abstract: Arbitrary-precision integer multiplication is the core kernel of many applications in simulation, cryptography, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. Among these accelerators, FPGAs are promised to provide both good energy efficiency and flexibility. Surprisingly, in our implementations, FPGA has the lowest energy efficiency… ▽ More Arbitrary-precision integer multiplication is the core kernel of many applications in simulation, cryptography, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. Among these accelerators, FPGAs are promised to provide both good energy efficiency and flexibility. Surprisingly, in our implementations, FPGA has the lowest energy efficiency, i.e., 0.29x of the CPU and 0.17x of the GPU with the same generation fabrication. Therefore, key questions arise: Where do the energy efficiency gains of CPUs and GPUs come from? Can reconfigurable computing do better? If can, how to achieve that? We identify that the biggest energy efficiency gains of the CPUs and GPUs come from the dedicated vector units. FPGA uses DSPs and lookup tables to compose the needed computation, which incurs overhead when compared to using vector units directly. New reconfigurable computing, e.g., 'FPGA+vector units' is a novel and feasible solution to improve energy efficiency. In this paper, we propose to map arbitrary-precision integer multiplication onto such a heterogeneous platform, i.e., AMD/Xilinx Versal ACAP architecture. Designing on Versal ACAP incurs several challenges and we propose AIM: Arbitrary-precision Integer Multiplication on Versal ACAP to automate and optimize the design. AIM framework includes design space exploration and AIM automatic code generation to facilitate the system design and verification. We deploy the AIM framework on three different applications, including large integer multiplication (LIM), RSA, and Mandelbrot, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experimental results show that AIM achieves up to 12.6x, and 2.1x energy efficiency gains over the Intel Xeon Ice Lake 6346 CPU, and NVidia A5000 GPU respectively, which brings reconfigurable computing the most energy-efficient platform among CPUs and GPUs. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2207.01209 [pdf, other]

Sustainable AI Processing at the Edge

Authors: Sébastien Ollivier, Sheng Li, Yue Tang, Chayanika Chaudhuri, Peipei Zhou, Xulong Tang, Jingtong Hu, Alex K. Jones

Abstract: Edge computing is a popular target for accelerating machine learning algorithms supporting mobile devices without requiring the communication latencies to handle them in the cloud. Edge deployments of machine learning primarily consider traditional concerns such as SWaP constraints (Size, Weight, and Power) for their installations. However, such metrics are not entirely sufficient to consider envi… ▽ More Edge computing is a popular target for accelerating machine learning algorithms supporting mobile devices without requiring the communication latencies to handle them in the cloud. Edge deployments of machine learning primarily consider traditional concerns such as SWaP constraints (Size, Weight, and Power) for their installations. However, such metrics are not entirely sufficient to consider environmental impacts from computing given the significant contributions from embodied energy and carbon. In this paper we explore the tradeoffs of convolutional neural network acceleration engines for both inference and on-line training. In particular, we explore the use of processing-in-memory (PIM) approaches, mobile GPU accelerators, and recently released FPGAs, and compare them with novel Racetrack memory PIM. Replacing PIM-enabled DDR3 with Racetrack memory PIM can recover its embodied energy as quickly as 1 year. For high activity ratios, mobile GPUs can be more sustainable but have higher embodied energy to overcome compared to PIM-enabled Racetrack memory. △ Less

Submitted 4 July, 2022; originally announced July 2022.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2205.12494 [pdf, other]

A Multi-domain Magneto Tunnel Junction for Racetrack Nanowire Strips

Authors: Prayash Dutta, Albert Lee, Kang L. Wang, Alex K. Jones, Sanjukta Bhanja

Abstract: Domain-wall memory (DWM) has SRAM class access performance, low energy, high endurance, high density, and CMOS compatibility. Recently, shift reliability and processing-using-memory (PuM) proposals developed a need to count the number of parallel or anti-parallel domains in a portion of the DWM nanowire. In this paper we propose a multi-domain magneto-tunnel junction (MTJ) that can detect differen… ▽ More Domain-wall memory (DWM) has SRAM class access performance, low energy, high endurance, high density, and CMOS compatibility. Recently, shift reliability and processing-using-memory (PuM) proposals developed a need to count the number of parallel or anti-parallel domains in a portion of the DWM nanowire. In this paper we propose a multi-domain magneto-tunnel junction (MTJ) that can detect different resistance levels as a function of a the number of parallel or anti-parallel domains. Using detailed micromagnetic simulation with LLG, we demonstrate the multi-domain MTJ, study the benefit of its macro-size on resilience to process variation and present a macro-model for scaling the size of the multi-domain MTJ. Our results indicate scalability to seven-domains while maintaining a 16.3mV sense margin. △ Less

Submitted 25 May, 2022; originally announced May 2022.

Comments: This paper is under review for possible publication by the IEEE

arXiv:2205.02046 [pdf, other]

doi 10.1109/LCA.2022.3194263

DNA Pre-alignment Filter using Processing Near Racetrack Memory

Authors: Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, Jeronimo Castrillon

Abstract: Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)--an emerging non-volatile memory that promises higher storage density, faster access latency, an… ▽ More Recent DNA pre-alignment filter designs employ DRAM for storing the reference genome and its associated meta-data. However, DRAM incurs increasingly high energy consumption background and refresh energy as devices scale. To overcome this problem, this paper explores a design with racetrack memory (RTM)--an emerging non-volatile memory that promises higher storage density, faster access latency, and lower energy consumption. Multi-bit storage cells in RTM are inherently sequential and thus require data placement strategies to mitigate the performance and energy impacts of shifting during data accesses. We propose a near-memory pre-alignment filter with a novel data mapping and several shift reduction strategies designed explicitly for RTM. On a set of four input genomes from the 1000 Genome Project, our approach improves performance and energy efficiency by 68% and 52%, respectively, compared to the state of the art proposed DRAM-based architecture. △ Less

Submitted 4 May, 2022; originally announced May 2022.

Report number: Volume 21, Issue 2

Journal ref: IEEE Computer Architecture Letters 2022

arXiv:2204.13788 [pdf, other]

FPIRM: Floating-point Processing in Racetrack Memories

Authors: Sébastien Ollivier, Xinyi Zhang, Yue Tang, Chayanika Choudhuri, Jingtong Hu, Alex K. Jones

Abstract: Convolutional neural networks (CNN) have become a ubiquitous algorithm with growing applications in mobile and edge settings. We describe a compute-in-memory (CIM) technique called FPIRM using Racetrack Memory (RM) to accelerate CNNs for edge systems. Using transverse read, a technique that can determine the number of '1's multiple adjacent domains, FPIRM can efficiently implement multi-operand bu… ▽ More Convolutional neural networks (CNN) have become a ubiquitous algorithm with growing applications in mobile and edge settings. We describe a compute-in-memory (CIM) technique called FPIRM using Racetrack Memory (RM) to accelerate CNNs for edge systems. Using transverse read, a technique that can determine the number of '1's multiple adjacent domains, FPIRM can efficiently implement multi-operand bulk-bitwise and addition computations, and two-operand multiplication. We discuss how FPIRM can implement both variable precision integer and floating point arithmetic. This allows both CNN inference and on-device training without expensive data movement to the cloud. Based on these functions we demonstrate implementation of several CNNs with back propagation using RM CIM and compare these to state-of-the-art implementations of CIM inference and training in Field-Programmable Gate Arrays. During training FPIRM improves by 2$\times$ the efficiency, by reducing the energy consumption by at least 27% and increasing the throughput by at least 18% against FPGA. △ Less

Submitted 1 August, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

Comments: This paper is accepted to the IEEE Micro Magazine with the title "POD-RACING: Bulk-Bitwise to Floating-point Compute In Racetrack Memory for Machine Learning at the Edge"

arXiv:2203.08303 [pdf, other]

doi 10.1109/TCSII.2022.3161594

Pinning Fault Mode Modeling for DWM Shifting

Authors: Kawsher Roxy, Stephen Longofono, Sebastien Olliver, Sanjukta Bhanja, Alex K. Jones

Abstract: Extreme scaling for purposes of achieving higher density and lower energy continues to increase the probability of memory faults. For domain wall (DW) memories, misalignment faults arise when aligning domains with access points. A previously understudied type of shifting fault, a pinning fault may occur due to non-uniform pinning potential distribution caused by notches with fabrication imperfecti… ▽ More Extreme scaling for purposes of achieving higher density and lower energy continues to increase the probability of memory faults. For domain wall (DW) memories, misalignment faults arise when aligning domains with access points. A previously understudied type of shifting fault, a pinning fault may occur due to non-uniform pinning potential distribution caused by notches with fabrication imperfections. This non-uniformity can pin a wall during current-induced DW motion. This paper provides a model of geometric variations varying width, depth, and curvature variations of a notch, their impacts on the critical shift current, and a study of the resulting impact on fault rates of DW memory systems. An increase in the effective critical shift current due to 5% variation predicts a pinning fault rate on the order of $10^{-8}$ per shift, which results in a mean-time-to-failure of circa 2s for a DW memory system. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: IEEE Transactions on Circuits and Systems--II, 2022

arXiv:2112.12692 [pdf, other]

doi 10.1109/TNANO.2022.3158889

XDWM: A 2D Domain Wall Memory

Authors: Arifa Hoque, Alex K. Jones, Sanjukta Bhanja

Abstract: Domain-Wall Memory (DWM) structures typically bundle nanowires shifted together for parallel access. Ironically, this organization does not allow the natural shifting of DWM to realize \textit{logical shifting} within data elements. We describe a novel 2-D DWM cross-point (X-Cell) that allows two individual nanowires placed orthogonally to share the X-Cell. Each nanowire can operate independently… ▽ More Domain-Wall Memory (DWM) structures typically bundle nanowires shifted together for parallel access. Ironically, this organization does not allow the natural shifting of DWM to realize \textit{logical shifting} within data elements. We describe a novel 2-D DWM cross-point (X-Cell) that allows two individual nanowires placed orthogonally to share the X-Cell. Each nanowire can operate independently while sharing the value at the X-Cell. Using X-Cells, we propose an orthogonal nanowire in the Y dimension overlaid on a bundle of X dimension nanowires for a cross-DWM or XDWM. We demonstrate that the bundle shifts correctly in the X-Direction, and that data can be logically shifted in the Y-direction providing novel data movement and supporting processing-in-memory. We conducted studies on the requirements for physical cell dimensions and shift currents for XDWM. Due to the non-standard domain, our micro-magnetic studies demonstrate that XDWM introduces a shift current penalty of 6.25% while shifting happens in one nanowire compared to a standard nanowire. We also demonstrate correct shifting using nanowire bundles in both the X- and Y- dimensions. Using magnetic simulation to derive the values for SPICE simulation we show the maximum leakage current between nanowires when shifting the bundle together is $\le3$% indicating that sneak paths are not problematic for XDWM. △ Less

Submitted 23 December, 2021; originally announced December 2021.

Comments: in IEEE Transactions on Nanotechnology

Journal ref: IEEE Transactions on Nanotechnology

arXiv:2112.01658 [pdf, other]

Virtual Coset Coding for Encrypted Non-Volatile Memories with Multi-Level Cells

Authors: Stephen Longofono, Seyed Mohammad Seyedzadeh, Alex K. Jones

Abstract: PCM is a popular backing memory for DRAM main memory in tiered memory systems. PCM has asymmetric access energy; writes dominate reads. MLC asymmetry can vary by an order of magnitude. Many schemes have been developed to take advantage of the asymmetric patterns of 0s and 1s in the data to reduce write energy. Because the memory is non-volatile, data can be recovered via physical attack or across… ▽ More PCM is a popular backing memory for DRAM main memory in tiered memory systems. PCM has asymmetric access energy; writes dominate reads. MLC asymmetry can vary by an order of magnitude. Many schemes have been developed to take advantage of the asymmetric patterns of 0s and 1s in the data to reduce write energy. Because the memory is non-volatile, data can be recovered via physical attack or across system reboot cycles. To protect information stored in PCM against these attacks requires encryption. Unfortunately, most encryption algorithms scramble 0s and 1s in the data, effectively removing any patterns and negatively impacting schemes that leverage data bias and similarity to reduce write energy. In this paper, we introduce Virtual Coset Coding (VCC) as a workload-independent approach that reduces costly symbol transitions for storing encrypted data. VCC is based on two ideas. First, using coset encoding with random coset candidates, it is possible to effectively reduce the frequency of costly bit/symbol transitions when writing encrypted data. Second, a small set of random substrings can be used to achieve the same encoding efficiency as a large number of random coset candidates, but at a much lower encoding/decoding cost. Additionally, we demonstrate how VCC can be leveraged for energy reduction in combination with fault-mitigation and fault-tolerance to dramatically increase the lifetimes of endurance-limited NVMs, such as PCM. We evaluate the design of VCC and demonstrate that it can be implemented on-chip with only a nominal area overhead. VCC reduces dynamic energy by 22-28% while maintaining the same performance. Using our multi-objective optimization approach achieves at least a 36% improvement in lifetime over the state-of-the-art and at least a 50% improvement in lifetime vs. an unencoded memory, while maintaining its energy savings and system performance. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Comments: Preprint: Accepted to HPCA 2022

arXiv:2111.02246 [pdf, other]

doi 10.1145/3524071

Brain-inspired Cognition in Next Generation Racetrack Memories

Authors: Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, Alex K. Jones

Abstract: Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and i… ▽ More Hyperdimensional computing (HDC) is an emerging computational framework inspired by the brain that operates on vectors with thousands of dimensions to emulate cognition. Unlike conventional computational frameworks that operate on numbers, HDC, like the brain, uses high dimensional random vectors and is capable of one-shot learning. HDC is based on a well-defined set of arithmetic operations and is highly error-resilient. The core operations of HDC manipulate HD vectors in bulk bit-wise fashion, offering many opportunities to leverage parallelism. Unfortunately, on conventional Von-Neuman architectures, the continuous movement of HD vectors among the processor and the memory can make the cognition task prohibitively slow and energy-intensive. Hardware accelerators only marginally improve related metrics. On the contrary, only partial implementation of an HDC framework inside memory, using emerging memristive devices, has reported considerable performance/energy gains. This paper presents an architecture based on racetrack memory (RTM) to conduct and accelerate the entire HDC framework within the memory. The proposed solution requires minimal additional CMOS circuitry and uses a read operation across multiple domains in RTMs called transverse read (TR) to realize exclusive-or (XOR) and addition operations. To minimize the overhead the CMOS circuitry, we propose an RTM nanowires-based counting mechanism that leverages the TR operation and the standard RTM operations. Using language recognition as the use case demonstrates 7.8x and 5.3x reduction in the overall runtime and energy consumption compared to the FPGA design, respectively. Compared to the state-of-the-art in-memory implementation, the proposed HDC system reduces the energy consumption by 8.6x. △ Less

Submitted 15 March, 2022; v1 submitted 3 November, 2021; originally announced November 2021.

Comments: Preprint, accepted for publication, ACM Transactions on Embedded Computing Systems. ACM Trans. Embed. Comput. Syst. (March 2022)

arXiv:2108.01202 [pdf, other]

PIRM: Processing In Racetrack Memories

Authors: Sebastien Ollivier, Stephen Longofono, Prayash Dutta, Jingtong Hu, Sanjukta Bhanja, Alex K. Jones

Abstract: The growth in data needs of modern applications has created significant challenges for modern systems leading a "memory wall." Spintronic Domain Wall Memory (DWM), related to Spin-Transfer Torque Memory (STT-MRAM), provides near-SRAM read/write performance, energy savings and nonvolatility, potential for extremely high storage density, and does not have significant endurance limitations. However,… ▽ More The growth in data needs of modern applications has created significant challenges for modern systems leading a "memory wall." Spintronic Domain Wall Memory (DWM), related to Spin-Transfer Torque Memory (STT-MRAM), provides near-SRAM read/write performance, energy savings and nonvolatility, potential for extremely high storage density, and does not have significant endurance limitations. However, DWM's benefits cannot address data access latency and throughput limitations of memory bus bandwidth. We propose PIRM, a DWM-based in-memory computing solution that leverages the properties of DWM nanowires and allows them to serve as polymorphic gates. While normally DWM is accessed by applying spin polarized currents orthogonal to the nanowire at access points to read individual bits, transverse access along the DWM nanowire allows the differentiation of the aggregate resistance of multiple bits in the nanowire, akin to a multilevel cell. PIRM leverages this transverse reading to directly provide bulk-bitwise logic of multiple adjacent operands in the nanowire, simultaneously. Based on this in-memory logic, PIRM provides a technique to conduct multi-operand addition and two operand multiplication using transverse access. PIRM provides a 1.6x speedup compared to the leading DRAM PIM technique for query applications that leverage bulk bitwise operations. Compared to the leading PIM technique for DWM, PIRM improves performance by 6.9x, 2.3x and energy by 5.5x, 3.4x for 8-bit addition and multiplication, respectively. For arithmetic heavy benchmarks, PIRM reduces access latency by 2.1x, while decreasing energy consumption by 25.2x for a reasonable 10% area overhead versus non-PIM DWM. △ Less

Submitted 1 August, 2022; v1 submitted 2 August, 2021; originally announced August 2021.

Comments: This paper is accepted to the IEEE/ACM Symposium on Microarchitecture, October 2022 under the title "CORUSCANT: Fast Efficient Processing-in-Racetrack Memories"

arXiv:2005.01588 [pdf]

Workshops on Extreme Scale Design Automation (ESDA) Challenges and Opportunities for 2025 and Beyond

Authors: R. Iris Bahar, Alex K. Jones, Srinivas Katkoori, Patrick H. Madden, Diana Marculescu, Igor L. Markov

Abstract: Integrated circuits and electronic systems, as well as design technologies, are evolving at a great rate -- both quantitatively and qualitatively. Major developments include new interconnects and switching devices with atomic-scale uncertainty, the depth and scale of on-chip integration, electronic system-level integration, the increasing significance of software, as well as more effective means o… ▽ More Integrated circuits and electronic systems, as well as design technologies, are evolving at a great rate -- both quantitatively and qualitatively. Major developments include new interconnects and switching devices with atomic-scale uncertainty, the depth and scale of on-chip integration, electronic system-level integration, the increasing significance of software, as well as more effective means of design entry, compilation, algorithmic optimization, numerical simulation, pre- and post-silicon design validation, and chip test. Application targets and key markets are also shifting substantially from desktop CPUs to mobile platforms to an Internet-of-Things infrastructure. In light of these changes in electronic design contexts and given EDA's significant dependence on such context, the EDA community must adapt to these changes and focus on the opportunities for research and commercial success. The CCC workshop series on Extreme-Scale Design Automation, organized with the support of ACM SIGDA, studied challenges faced by the EDA community as well as new and exciting opportunities currently available. This document represents a summary of the findings from these meetings. △ Less

Submitted 4 May, 2020; originally announced May 2020.

Comments: A Computing Community Consortium (CCC) workshop report, 32 pages

Report number: ccc2014report_1

arXiv:1806.02498 [pdf, other]

Mitigating Wordline Crosstalk using Adaptive Trees of Counters

Authors: Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem

Abstract: High access frequency of certain rows in the DRAM may cause data loss in cells of physically adjacent rows due to crosstalk. The malicious exploit of this crosstalk by repeatedly accessing a row to induce this effect is known as row hammering. Additionally, inadvertent row hammering may also occur due to the natural weighted nature of applications' access patterns. In this paper, we analyze the… ▽ More High access frequency of certain rows in the DRAM may cause data loss in cells of physically adjacent rows due to crosstalk. The malicious exploit of this crosstalk by repeatedly accessing a row to induce this effect is known as row hammering. Additionally, inadvertent row hammering may also occur due to the natural weighted nature of applications' access patterns. In this paper, we analyze the efficiency of existing approaches for mitigating wordline crosstalk and demonstrate that they have been conservatively designed. Given the unbalanced nature of DRAM accesses, a small group of dynamically allocated counters in banks can deterministically detect hot rows and mitigate crosstalk. Based on our findings, we propose a Counter-based Adaptive Tree (CAT) approach to mitigate wordline crosstalk using adaptive trees of counters to guide appropriate refreshing of vulnerable rows. The key idea is to tune the distribution of the counters to the rows in a bank based on the memory reference patterns. In contrast to deterministic solutions, CAT utilizes fewer counters, making it practically feasible to be implemented on-chip. Compared to existing probabilistic approaches, CAT more precisely refreshes rows vulnerable to crosstalk based on their access frequency. Experimental results on workloads from four benchmark suites show that CAT reduces the Crosstalk Mitigation Refresh Power Overhead in quad-core systems to 7%, which is an improvement over the 21% and 18% incurred in the leading deterministic and probabilistic approaches, respectively. Moreover, CAT incurs very low performance overhead (0.5%). Hardware synthesis evaluation shows that CAT can be implemented on-chip with only a nominal area overhead. △ Less

Submitted 6 June, 2018; originally announced June 2018.

Comments: 12 pages

arXiv:1711.08572 [pdf, other]

Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM

Authors: Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem

Abstract: Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediat… ▽ More Phase change memory (PCM) has recently emerged as a promising technology to meet the fast growing demand for large capacity memory in computer systems, replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC) PCM offers high density with low per-byte fabrication cost. However, despite many advantages, such as scalability and low leakage, the energy for programming intermediate states is considerably larger than programing single-level cell PCM. In this paper, we study encoding techniques to reduce write energy for MLC PCM when the encoding granularity is lowered below the typical cache line size. We observe that encoding data blocks at small granularity to reduce write energy actually increases the write energy because of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new Word-Level Compression (WLC) which compresses more than 91% of the memory lines and provides enough room to store the auxiliary data using a novel restricted coset encoding applied at small data block granularities. Experimental results show that the proposed encoding at 16-bit data granularity reduces the write energy by 39%, on average, versus the leading encoding approach for write energy reduction. Furthermore, it improves endurance by 20% and is more reliable than the leading approach. Hardware synthesis evaluation shows that the proposed encoding can be implemented on-chip with only a nominal area overhead. △ Less

Submitted 22 November, 2017; originally announced November 2017.

Comments: 12 pages

arXiv:1710.08940 [pdf, other]

A Variable Length Coding Framework for Cost Function Reduction in Non-Volatile Memory Systems

Authors: Seyed Mohammad Seyedzadeh, Alex K. Jones, Rami Melhem

Abstract: Variable length coding for Non-Volatile Memory (NVM) technologies is a promising method to improve memory capacity and system performance through compressing memory blocks. However, compression techniques used to improve capacity or bandwidth utilization do not take into consideration the asymmetric costs of writing 1's and 0's in NVMs. Taking into account this asymmetry, we propose a variable len… ▽ More Variable length coding for Non-Volatile Memory (NVM) technologies is a promising method to improve memory capacity and system performance through compressing memory blocks. However, compression techniques used to improve capacity or bandwidth utilization do not take into consideration the asymmetric costs of writing 1's and 0's in NVMs. Taking into account this asymmetry, we propose a variable length encoding framework that reduces the cost of writing data into NVM. Our experimental results on 12 workloads of the SPEC CPU2006 benchmark suite show that, when the cost asymmetry is 1:2, the proposed framework is capable of reducing the NVM programming cost by up to 24% more than leading compression approaches and by 12.5% more than the flip-and-write approach which selects between the data and its complement based on the programming cost. △ Less

Submitted 24 October, 2017; originally announced October 2017.

Comments: NVMW2017

Showing 1–19 of 19 results for author: Jones, A K