-
Wavefront Threading Enables Effective High-Level Synthesis
Authors:
Blake Pelton,
Adam Sapek,
Ken Eguro,
Daniel Lo,
Alessandro Forin,
Matt Humphrey,
Jinwen Xi,
David Cox,
Rajas Karandikar,
Johannes de Fine Licht,
Evgeny Babin,
Adrian Caulfield,
Doug Burger
Abstract:
Digital systems are growing in importance and computing hardware is growing more heterogeneous. Hardware design, however, remains laborious and expensive, in part due to the limitations of conventional hardware description languages (HDLs) like VHDL and Verilog. A longstanding research goal has been programming hardware like software, with high-level languages that can generate efficient hardware…
▽ More
Digital systems are growing in importance and computing hardware is growing more heterogeneous. Hardware design, however, remains laborious and expensive, in part due to the limitations of conventional hardware description languages (HDLs) like VHDL and Verilog. A longstanding research goal has been programming hardware like software, with high-level languages that can generate efficient hardware designs. This paper describes Kanagawa, a language that takes a new approach to combine the programmer productivity benefits of traditional High-Level Synthesis (HLS) approaches with the expressibility and hardware efficiency of Register-Transfer Level (RTL) design. The language's concise syntax, matched with a hardware design-friendly execution model, permits a relatively simple toolchain to map high-level code into efficient hardware implementations.
△ Less
Submitted 10 June, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
SMaRTT-REPS: Sender-based Marked Rapidly-adapting Trimmed & Timed Transport with Recycled Entropies
Authors:
Tommaso Bonato,
Abdul Kabbani,
Daniele De Sensi,
Rong Pan,
Yanfang Le,
Costin Raiciu,
Mark Handley,
Timo Schneider,
Nils Blach,
Ahmad Ghalayini,
Daniel Alves,
Michael Papamichael,
Adrian Caulfield,
Torsten Hoefler
Abstract:
With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple…
▽ More
With the rapid growth of machine learning (ML) workloads in datacenters, existing congestion control (CC) algorithms fail to deliver the required performance at scale. ML traffic is bursty and bulk-synchronous and thus requires quick reaction and strong fairness. We show that existing CC algorithms that use delay as a main signal react too slowly and are not always fair. We design SMaRTT, a simple sender-based CC algorithm that combines delay, ECN, and optional packet trimming for fast and precise window adjustments. At the core of SMaRTT lies the novel QuickAdapt algorithm that accurately estimates the bandwidth at the receiver. We show how to combine SMaRTT with a new per-packet traffic load-balancing algorithm called REPS to effectively reroute packets around congested hotspots as well as flaky or failing links. Our evaluation shows that SMaRTT alone outperforms EQDS, Swift, BBR, and MPRDMA by up to 50% on modern datacenter networks.
△ Less
Submitted 27 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Penalties and Rewards for Fair Learning in Paired Kidney Exchange Programs
Authors:
Margarida Carvalho,
Alison Caulfield,
Yi Lin,
Adrian Vetta
Abstract:
A kidney exchange program, also called a kidney paired donation program, can be viewed as a repeated, dynamic trading and allocation mechanism. This suggests that a dynamic algorithm for transplant exchange selection may have superior performance in comparison to the repeated use of a static algorithm. We confirm this hypothesis using a full scale simulation of the Canadian Kidney Paired Donation…
▽ More
A kidney exchange program, also called a kidney paired donation program, can be viewed as a repeated, dynamic trading and allocation mechanism. This suggests that a dynamic algorithm for transplant exchange selection may have superior performance in comparison to the repeated use of a static algorithm. We confirm this hypothesis using a full scale simulation of the Canadian Kidney Paired Donation Program: learning algorithms, that attempt to learn optimal patient-donor weights in advance via dynamic simulations, do lead to improved outcomes. Specifically, our learning algorithms, designed with the objective of fairness (that is, equity in terms of transplant accessibility across cPRA groups), also lead to an increased number of transplants and shorter average waiting times. Indeed, our highest performing learning algorithm improves egalitarian fairness by 10% whilst also increasing the number of transplants by 6% and decreasing waiting times by 24%. However, our main result is much more surprising. We find that the most critical factor in determining the performance of a kidney exchange program is not the judicious assignment of positive weights (rewards) to patient-donor pairs. Rather, the key factor in increasing the number of transplants, decreasing waiting times and improving group fairness is the judicious assignment of a negative weight (penalty) to the small number of non-directed donors in the kidney exchange program.
△ Less
Submitted 23 September, 2023;
originally announced September 2023.
-
DiCA: A Hardware-Software Co-Design for Differential Checkpointing in Intermittently Powered Devices
Authors:
Antonio Joia Neto,
Adam Caulfield,
Chistabelle Alvares,
Ivan De Oliveira Nunes
Abstract:
Intermittently powered devices rely on opportunistic energy-harvesting to function, leading to recurrent power interruptions. This paper introduces DiCA, a proposal for a hardware/software co-design to create differential check-points in intermittent devices. DiCA leverages an affordable hardware module that simplifies the check-pointing process, reducing the check-point generation time and energy…
▽ More
Intermittently powered devices rely on opportunistic energy-harvesting to function, leading to recurrent power interruptions. This paper introduces DiCA, a proposal for a hardware/software co-design to create differential check-points in intermittent devices. DiCA leverages an affordable hardware module that simplifies the check-pointing process, reducing the check-point generation time and energy consumption. This hardware module continuously monitors volatile memory, efficiently tracking modifications and determining optimal check-point times. To minimize energy waste, the module dynamically estimates the energy required to create and store the check-point based on tracked memory modifications, triggering the check-pointing routine optimally via a nonmaskable interrupt. Experimental results show the cost-effectiveness and energy efficiency of DiCA, enabling extended application activity cycles in intermittently powered embedded devices.
△ Less
Submitted 25 August, 2023; v1 submitted 24 August, 2023;
originally announced August 2023.
-
ACFA: Secure Runtime Auditing & Guaranteed Device Healing via Active Control Flow Attestation
Authors:
Adam Caulfield,
Norrathep Rattanavipanon,
Ivan De Oliveira Nunes
Abstract:
Low-end embedded devices are increasingly used in various smart applications and spaces. They are implemented under strict cost and energy budgets, using microcontroller units (MCUs) that lack security features available in general-purpose processors. In this context, Remote Attestation (RA) was proposed as an inexpensive security service to enable a verifier (Vrf) to remotely detect illegal modif…
▽ More
Low-end embedded devices are increasingly used in various smart applications and spaces. They are implemented under strict cost and energy budgets, using microcontroller units (MCUs) that lack security features available in general-purpose processors. In this context, Remote Attestation (RA) was proposed as an inexpensive security service to enable a verifier (Vrf) to remotely detect illegal modifications to a software binary installed on a low-end prover MCU (Prv). Since attacks that hijack the software's control flow can evade RA, Control Flow Attestation (CFA) augments RA with information about the exact order in which instructions in the binary are executed, enabling detection of control flow attacks. We observe that current CFA architectures can not guarantee that Vrf ever receives control flow reports in case of attacks. In turn, while they support exploit detection, they provide no means to pinpoint the exploit origin. Furthermore, existing CFA requires either binary instrumentation, incurring significant runtime overhead and code size increase, or relatively expensive hardware support, such as hash engines. In addition, current techniques are neither continuous (only meant to attest self-contained operations) nor active (offer no secure means to remotely remediate detected compromises). To jointly address these challenges, we propose ACFA: a hybrid (hardware/software) architecture for Active CFA. ACFA enables continuous monitoring of all control flow transfers in the MCU and does not require binary instrumentation. It also leverages the recently proposed concept of Active Roots-of-Trust to enable secure auditing of vulnerability sources and guaranteed remediation when a compromise is detected. We provide an open-source reference implementation of ACFA on top of a commodity low-end MCU (TI MSP430) and evaluate it to demonstrate its security and cost-effectiveness.
△ Less
Submitted 19 October, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.
-
ASAP: Reconciling Asynchronous Real-Time Operations and Proofs of Execution in Simple Embedded Systems
Authors:
Adam Caulfield,
Norrathep Rattanavipanon,
Ivan De Oliveira Nunes
Abstract:
Embedded devices are increasingly ubiquitous and their importance is hard to overestimate. While they often support safety-critical functions (e.g., in medical devices and sensor-alarm combinations), they are usually implemented under strict cost/energy budgets, using low-end microcontroller units (MCUs) that lack sophisticated security mechanisms. Motivated by this issue, recent work developed ar…
▽ More
Embedded devices are increasingly ubiquitous and their importance is hard to overestimate. While they often support safety-critical functions (e.g., in medical devices and sensor-alarm combinations), they are usually implemented under strict cost/energy budgets, using low-end microcontroller units (MCUs) that lack sophisticated security mechanisms. Motivated by this issue, recent work developed architectures capable of generating Proofs of Execution (PoX) for the correct/expected software in potentially compromised low-end MCUs. In practice, this capability can be leveraged to provide "integrity from birth" to sensor data, by binding the sensed results/outputs to an unforgeable cryptographic proof of execution of the expected sensing process. Despite this significant progress, current PoX schemes for low-end MCUs ignore the real-time needs of many applications. In particular, security of current PoX schemes precludes any interrupts during the execution being proved. We argue that lack of asynchronous capabilities (i.e., interrupts within PoX) can obscure PoX usefulness, as several applications require processing real-time and asynchronous events. To bridge this gap, we propose, implement, and evaluate an Architecture for Secure Asynchronous Processing in PoX (ASAP). ASAP is secure under full software compromise, enables asynchronous PoX, and incurs less hardware overhead than prior work.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.