Unraveling codes: fast, robust, beyond-bound error correction for DRAM
Authors:
Mike Hamburg,
Eric Linstadt,
Danny Moore,
Thomas Vogelsang
Abstract:
Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add $2t$ extra parity symbols to a block of memory, and can efficiently and reliably correct up to $t$ symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound…
▽ More
Generalized Reed-Solomon (RS) codes are a common choice for efficient, reliable error correction in memory and communications systems. These codes add $2t$ extra parity symbols to a block of memory, and can efficiently and reliably correct up to $t$ symbol errors in that block. Decoding is possible beyond this bound, but it is imperfectly reliable and often computationally expensive. Beyond-bound decoding is an important problem to solve for error-correcting Dynamic Random Access Memory (DRAM). These memories are often designed so that each access touches two extra memory devices, so that a failure in any one device can be corrected. But system architectures increasingly require DRAM to store metadata in addition to user data. When the metadata replaces parity data, a single-device failure is then beyond-bound. An error-correction system can either protect each access with a single RS code, or divide it into several segments protected with a shorter code, usually in an Interleaved Reed-Solomon (IRS) configuration. The full-block RS approach is more reliable, both at correcting errors and at preventing silent data corruption (SDC). The IRS option is faster, and is especially efficient at beyond-bound correction of single- or double-device failures. Here we describe a new family of "unraveling" Reed-Solomon codes that bridges the gap between these options. Our codes are full-block generalized RS codes, but they can also be decoded using an IRS decoder. As a result, they combine the speed and beyond-bound correction capabilities of interleaved codes with the robustness of full-block codes, including the ability of the latter to reliably correct failures across multiple devices. We show that unraveling codes are an especially good fit for high-reliability DRAM error correction.
△ Less
Submitted 27 May, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
RAMPART: RowHammer Mitigation and Repair for Server Memory Systems
Authors:
Steven C. Woo,
Wendy Elsasser,
Mike Hamburg,
Eric Linstadt,
Michael R. Miller,
Taeksang Song,
James Tringali
Abstract:
RowHammer attacks are a growing security and reliability concern for DRAMs and computer systems as they can induce many bit errors that overwhelm error detection and correction capabilities. System-level solutions are needed as process technology and circuit improvements alone are unlikely to provide complete protection against RowHammer attacks in the future. This paper introduces RAMPART, a nove…
▽ More
RowHammer attacks are a growing security and reliability concern for DRAMs and computer systems as they can induce many bit errors that overwhelm error detection and correction capabilities. System-level solutions are needed as process technology and circuit improvements alone are unlikely to provide complete protection against RowHammer attacks in the future. This paper introduces RAMPART, a novel approach to mitigating RowHammer attacks and improving server memory system reliability by remapping addresses in each DRAM in a way that confines RowHammer bit flips to a single device for any victim row address. When RAMPART is paired with Single Device Data Correction (SDDC) and patrol scrub, error detection and correction methods in use today, the system can detect and correct bit flips from a successful attack, allowing the memory system to heal itself. RAMPART is compatible with DDR5 RowHammer mitigation features, as well as a wide variety of algorithmic and probabilistic tracking methods. We also introduce BRC-VL, a variation of DDR5 Bounded Refresh Configuration (BRC) that improves system performance by reducing mitigation overhead and show that it works well with probabilistic sampling methods to combat traditional and victim-focused mitigation attacks like Half-Double. The combination of RAMPART, SDDC, and scrubbing enables stronger RowHammer resistance by correcting bit flips from one successful attack. Uncorrectable errors are much less likely, requiring two successful attacks before the memory system is scrubbed.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.