subscribe to arXiv mailings

Scalable and RISC-V Programmable Near-Memory Computing Architectures for Edge Nodes

Authors: Michele Caon, Clément Choné, Pasquale Davide Schiavone, Alexandre Levisse, Guido Masera, Maurizio Martina, David Atienza

Abstract: The widespread adoption of data-centric algorithms, particularly Artificial Intelligence (AI) and Machine Learning (ML), has exposed the limitations of centralized processing infrastructures, driving a shift towards edge computing. This necessitates stringent constraints on energy efficiency, which traditional von Neumann architectures struggle to meet. The Compute-In-Memory (CIM) paradigm has eme… ▽ More The widespread adoption of data-centric algorithms, particularly Artificial Intelligence (AI) and Machine Learning (ML), has exposed the limitations of centralized processing infrastructures, driving a shift towards edge computing. This necessitates stringent constraints on energy efficiency, which traditional von Neumann architectures struggle to meet. The Compute-In-Memory (CIM) paradigm has emerged as a superior candidate due to its efficient exploitation of available memory bandwidth. However, existing CIM solutions require high implementation effort and lack flexibility from a software integration standpoint. This work proposes a novel, software-friendly, general-purpose, and low-integration-effort Near-Memory Computing (NMC) approach, paving the way for the adoption of CIM-based systems in the next generation of edge computing nodes. Two architectural variants, NM-Caesar and NM-Carus, are proposed and characterized to target different trade-offs in area efficiency, performance, and flexibility, covering a wide range of embedded microcontrollers. Post-layout simulations show up to $25.8\times$ and $50.0\times$ lower execution time and $23.2\times$ and $33.1\times$ higher energy efficiency at the system level, respectively, compared to executing the same tasks on a state-of-the-art RISC-V CPU (RV32IMC). NM-Carus achieves a peak energy efficiency of $306.7$ GOPS/W in 8-bit matrix multiplications, surpassing recent state-of-the-art in- and near-memory circuits. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 14 pages, 12 figures, submitted to IEEE Transactions on Emerging Topics in Computing

arXiv:2402.09780 [pdf, other]

TinyCL: An Efficient Hardware Architecture for Continual Learning on Autonomous Systems

Authors: Eugenio Ressa, Alberto Marchisio, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: The Continuous Learning (CL) paradigm consists of continuously evolving the parameters of the Deep Neural Network (DNN) model to progressively learn to perform new tasks without reducing the performance on previous tasks, i.e., avoiding the so-called catastrophic forgetting. However, the DNN parameter update in CL-based autonomous systems is extremely resource-hungry. The existing DNN accelerators… ▽ More The Continuous Learning (CL) paradigm consists of continuously evolving the parameters of the Deep Neural Network (DNN) model to progressively learn to perform new tasks without reducing the performance on previous tasks, i.e., avoiding the so-called catastrophic forgetting. However, the DNN parameter update in CL-based autonomous systems is extremely resource-hungry. The existing DNN accelerators cannot be directly employed in CL because they only support the execution of the forward propagation. Only a few prior architectures execute the backpropagation and weight update, but they lack the control and management for CL. Towards this, we design a hardware architecture, TinyCL, to perform CL on resource-constrained autonomous systems. It consists of a processing unit that executes both forward and backward propagation, and a control unit that manages memory-based CL workload. To minimize the memory accesses, the sliding window of the convolutional layer moves in a snake-like fashion. Moreover, the Multiply-and-Accumulate units can be reconfigured at runtime to execute different operations. As per our knowledge, our proposed TinyCL represents the first hardware accelerator that executes CL on autonomous systems. We synthesize the complete TinyCL architecture in a 65 nm CMOS technology node with the conventional ASIC design flow. It executes 1 epoch of training on a Conv + ReLU + Dense model on the CIFAR10 dataset in 1.76 s, while 1 training epoch of the same model using an Nvidia Tesla P100 GPU takes 103 s, thus achieving a 58 x speedup, consuming 86 mW in a 4.74 mm2 die. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2304.04995 [pdf, other]

doi 10.5281/zenodo.7824870

Custom Memory Design for Logic-in-Memory: Drawbacks and Improvements over Conventional Memories

Authors: Fabrizio Ottati, Giovanna Turvani, Marco Vacca, Guido Masera

Abstract: The speed of modern digital systems is severely limited by memory latency (the ``Memory Wall'' problem). Data exchange between Logic and Memory is also responsible for a large part of the system energy consumption. Logic--In--Memory (LiM) represents an attractive solution to this problem. By performing part of the computations directly inside the memory the system speed can be improved while reduc… ▽ More The speed of modern digital systems is severely limited by memory latency (the ``Memory Wall'' problem). Data exchange between Logic and Memory is also responsible for a large part of the system energy consumption. Logic--In--Memory (LiM) represents an attractive solution to this problem. By performing part of the computations directly inside the memory the system speed can be improved while reducing its energy consumption. LiM solutions that offer the major boost in performance are based on the modification of the memory cell. However, what is the cost of such modifications? How do these impact the memory array performance? In this work, this question is addressed by analysing a LiM memory array implementing an algorithm for the maximum/minimum value computation. The memory array is designed at physical level using the FreePDK $\SI{45}{\nano\meter}$ CMOS process, with three memory cell variants, and its performance is compared to SRAM and CAM memories. Results highlight that read and write operations performance is worsened but in--memory operations result to be very efficient: a 55.26\% reduction in the energy--delay product is measured for the AND operation with respect to the SRAM read one; therefore, the LiM approach represents a very promising solution for low--density and high--performance memories. △ Less

Submitted 11 April, 2023; originally announced April 2023.

arXiv:2304.03986 [pdf, other]

SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers

Authors: Alberto Marchisio, Davide Dura, Maurizio Capra, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices. As an established neural network compression technique, quantization reduces the hardware computational and memory resources. In particular, fixed-point quantization is desirable to ease the computations using lightweight blocks, like adders and multipliers, of… ▽ More Transformers' compute-intensive operations pose enormous challenges for their deployment in resource-constrained EdgeAI / tinyML devices. As an established neural network compression technique, quantization reduces the hardware computational and memory resources. In particular, fixed-point quantization is desirable to ease the computations using lightweight blocks, like adders and multipliers, of the underlying hardware. However, deploying fully-quantized Transformers on existing general-purpose hardware, generic AI accelerators, or specialized architectures for Transformers with floating-point units might be infeasible and/or inefficient. Towards this, we propose SwiftTron, an efficient specialized hardware accelerator designed for Quantized Transformers. SwiftTron supports the execution of different types of Transformers' operations (like Attention, Softmax, GELU, and Layer Normalization) and accounts for diverse scaling factors to perform correct computations. We synthesize the complete SwiftTron architecture in a $65$ nm CMOS technology with the ASIC design flow. Our Accelerator executes the RoBERTa-base model in 1.83 ns, while consuming 33.64 mW power, and occupying an area of 273 mm^2. To ease the reproducibility, the RTL of our SwiftTron architecture is released at https://github.com/albertomarchisio/SwiftTron. △ Less

Submitted 25 April, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

Comments: To appear at the 2023 International Joint Conference on Neural Networks (IJCNN), Queensland, Australia, June 2023

arXiv:2208.02253 [pdf, other]

LaneSNNs: Spiking Neural Networks for Lane Detection on the Loihi Neuromorphic Processor

Authors: Alberto Viale, Alberto Marchisio, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Autonomous Driving (AD) related features represent important elements for the next generation of mobile robots and autonomous vehicles focused on increasingly intelligent, autonomous, and interconnected systems. The applications involving the use of these features must provide, by definition, real-time decisions, and this property is key to avoid catastrophic accidents. Moreover, all the decision… ▽ More Autonomous Driving (AD) related features represent important elements for the next generation of mobile robots and autonomous vehicles focused on increasingly intelligent, autonomous, and interconnected systems. The applications involving the use of these features must provide, by definition, real-time decisions, and this property is key to avoid catastrophic accidents. Moreover, all the decision processes must require low power consumption, to increase the lifetime and autonomy of battery-driven systems. These challenges can be addressed through efficient implementations of Spiking Neural Networks (SNNs) on Neuromorphic Chips and the use of event-based cameras instead of traditional frame-based cameras. In this paper, we present a new SNN-based approach, called LaneSNN, for detecting the lanes marked on the streets using the event-based camera input. We develop four novel SNN models characterized by low complexity and fast response, and train them using an offline supervised learning rule. Afterward, we implement and map the learned SNNs models onto the Intel Loihi Neuromorphic Research Chip. For the loss function, we develop a novel method based on the linear composition of Weighted binary Cross Entropy (WCE) and Mean Squared Error (MSE) measures. Our experimental results show a maximum Intersection over Union (IoU) measure of about 0.62 and very low power consumption of about 1 W. The best IoU is achieved with an SNN implementation that occupies only 36 neurocores on the Loihi processor while providing a low latency of less than 8 ms to recognize an image, thereby enabling real-time performance. The IoU measures provided by our networks are comparable with the state-of-the-art, but at a much low power consumption of 1 W. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: To appear at the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022)

arXiv:2208.00331 [pdf, other]

CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks

Authors: Muhammad Abdullah Hanif, Giuseppe Maria Sarda, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique

Abstract: In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations… ▽ More In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations achieved through post-training quantization are commonly used to curtail the energy consumption of these networks. However, the uniform quantization intervals in FP restrict the bit-width of data structures to large values due to the need to represent most of the numbers with sufficient resolution and avoid high quantization errors. In this paper, we leverage the key insight that (in most of the scenarios) DNN weights and activations are mostly concentrated near zero and only a few of them have large magnitudes. We propose CoNLoCNN, a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting: (1) non-uniform quantization of weights enabling simplification of complex multiplication operations; and (2) correlation between activation values enabling partial compensation of quantization errors at low cost without any run-time overheads. To significantly benefit from non-uniform quantization, we also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights while ensuring direct use of the encoded weight for processing using a novel multiply-and-accumulate (MAC) unit design. △ Less

Submitted 30 July, 2022; originally announced August 2022.

Comments: 8 pages, 15 figures, 2 tables

arXiv:2206.10200 [pdf, other]

Enabling Capsule Networks at the Edge through Approximate Softmax and Squash Operations

Authors: Alberto Marchisio, Beatrice Bussolino, Edoardo Salvati, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and c… ▽ More Complex Deep Neural Networks such as Capsule Networks (CapsNets) exhibit high learning capabilities at the cost of compute-intensive operations. To enable their deployment on edge devices, we propose to leverage approximate computing for designing approximate variants of the complex operations like softmax and squash. In our experiments, we evaluate tradeoffs between area, power consumption, and critical path delay of the designs implemented with the ASIC design flow, and the accuracy of the quantized CapsNets, compared to the exact functions. △ Less

Submitted 21 June, 2022; originally announced June 2022.

Comments: To appear at the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), August 2022, Boston, MA, USA

arXiv:2109.00533 [pdf, other]

R-SNN: An Analysis and Design Methodology for Robustifying Spiking Neural Networks against Adversarial Attacks through Noise Filters for Dynamic Vision Sensors

Authors: Alberto Marchisio, Giacomo Pira, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Spiking Neural Networks (SNNs) aim at providing energy-efficient learning capabilities when implemented on neuromorphic chips with event-based Dynamic Vision Sensors (DVS). This paper studies the robustness of SNNs against adversarial attacks on such DVS-based systems, and proposes R-SNN, a novel methodology for robustifying SNNs through efficient DVS-noise filtering. We are the first to generate… ▽ More Spiking Neural Networks (SNNs) aim at providing energy-efficient learning capabilities when implemented on neuromorphic chips with event-based Dynamic Vision Sensors (DVS). This paper studies the robustness of SNNs against adversarial attacks on such DVS-based systems, and proposes R-SNN, a novel methodology for robustifying SNNs through efficient DVS-noise filtering. We are the first to generate adversarial attacks on DVS signals (i.e., frames of events in the spatio-temporal domain) and to apply noise filters for DVS sensors in the quest for defending against adversarial attacks. Our results show that the noise filters effectively prevent the SNNs from being fooled. The SNNs in our experiments provide more than 90% accuracy on the DVS-Gesture and NMNIST datasets under different adversarial threat models. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: To appear at the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021). arXiv admin note: text overlap with arXiv:2107.00415

arXiv:2107.00415 [pdf, other]

DVS-Attacks: Adversarial Attacks on Dynamic Vision Sensors for Spiking Neural Networks

Authors: Alberto Marchisio, Giacomo Pira, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Spiking Neural Networks (SNNs), despite being energy-efficient when implemented on neuromorphic hardware and coupled with event-based Dynamic Vision Sensors (DVS), are vulnerable to security threats, such as adversarial attacks, i.e., small perturbations added to the input for inducing a misclassification. Toward this, we propose DVS-Attacks, a set of stealthy yet efficient adversarial attack meth… ▽ More Spiking Neural Networks (SNNs), despite being energy-efficient when implemented on neuromorphic hardware and coupled with event-based Dynamic Vision Sensors (DVS), are vulnerable to security threats, such as adversarial attacks, i.e., small perturbations added to the input for inducing a misclassification. Toward this, we propose DVS-Attacks, a set of stealthy yet efficient adversarial attack methodologies targeted to perturb the event sequences that compose the input of the SNNs. First, we show that noise filters for DVS can be used as defense mechanisms against adversarial attacks. Afterwards, we implement several attacks and test them in the presence of two types of noise filters for DVS cameras. The experimental results show that the filters can only partially defend the SNNs against our proposed DVS-Attacks. Using the best settings for the noise filters, our proposed Mask Filter-Aware Dash Attack reduces the accuracy by more than 20% on the DVS-Gesture dataset and by more than 65% on the MNIST dataset, compared to the original clean frames. The source code of all the proposed DVS-Attacks and noise filters is released at https://github.com/albertomarchisio/DVS-Attacks. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: Accepted for publication at IJCNN 2021

arXiv:2107.00401 [pdf, other]

CarSNN: An Efficient Spiking Neural Network for Event-Based Autonomous Cars on the Loihi Neuromorphic Research Processor

Authors: Alberto Viale, Alberto Marchisio, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Autonomous Driving (AD) related features provide new forms of mobility that are also beneficial for other kind of intelligent and autonomous systems like robots, smart transportation, and smart industries. For these applications, the decisions need to be made fast and in real-time. Moreover, in the quest for electric mobility, this task must follow low power policy, without affecting much the auto… ▽ More Autonomous Driving (AD) related features provide new forms of mobility that are also beneficial for other kind of intelligent and autonomous systems like robots, smart transportation, and smart industries. For these applications, the decisions need to be made fast and in real-time. Moreover, in the quest for electric mobility, this task must follow low power policy, without affecting much the autonomy of the mean of transport or the robot. These two challenges can be tackled using the emerging Spiking Neural Networks (SNNs). When deployed on a specialized neuromorphic hardware, SNNs can achieve high performance with low latency and low power consumption. In this paper, we use an SNN connected to an event-based camera for facing one of the key problems for AD, i.e., the classification between cars and other objects. To consume less power than traditional frame-based cameras, we use a Dynamic Vision Sensor (DVS). The experiments are made following an offline supervised learning rule, followed by mapping the learnt SNN model on the Intel Loihi Neuromorphic Research Chip. Our best experiment achieves an accuracy on offline implementation of 86%, that drops to 83% when it is ported onto the Loihi Chip. The Neuromorphic Hardware implementation has maximum 0.72 ms of latency for every sample, and consumes only 310 mW. To the best of our knowledge, this work is the first implementation of an event-based car classifier on a Neuromorphic Chip. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: Accepted for publication at IJCNN 2021

arXiv:2012.11233 [pdf, other]

doi 10.1109/ACCESS.2020.3039858

Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

Authors: Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique

Abstract: Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational… ▽ More Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute- and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them. △ Less

Submitted 21 December, 2020; originally announced December 2020.

Comments: Accepted for publication in IEEE Access

arXiv:2004.07116 [pdf, other]

doi 10.1109/DAC18072.2020.9218746

Q-CapsNets: A Specialized Framework for Quantizing Capsule Networks

Authors: Alberto Marchisio, Beatrice Bussolino, Alessio Colucci, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Capsule Networks (CapsNets), recently proposed by the Google Brain team, have superior learning capabilities in machine learning tasks, like image classification, compared to the traditional CNNs. However, CapsNets require extremely intense computations and are difficult to be deployed in their original form at the resource-constrained edge devices. This paper makes the first attempt to quantize C… ▽ More Capsule Networks (CapsNets), recently proposed by the Google Brain team, have superior learning capabilities in machine learning tasks, like image classification, compared to the traditional CNNs. However, CapsNets require extremely intense computations and are difficult to be deployed in their original form at the resource-constrained edge devices. This paper makes the first attempt to quantize CapsNet models, to enable their efficient edge implementations, by developing a specialized quantization framework for CapsNets. We evaluate our framework for several benchmarks. On a deep CapsNet model for the CIFAR10 dataset, the framework reduces the memory footprint by 6.2x, with only 0.15% accuracy loss. We will open-source our framework at https://git.io/JvDIF in August 2020. △ Less

Submitted 17 April, 2020; v1 submitted 15 April, 2020; originally announced April 2020.

Comments: Accepted for publication at Design Automation Conference 2020 (DAC 2020)

arXiv:1905.10142 [pdf, other]

doi 10.1109/IJCNN48605.2020.9207533

FasTrCaps: An Integrated Framework for Fast yet Accurate Training of Capsule Networks

Authors: Alberto Marchisio, Beatrice Bussolino, Alessio Colucci, Muhammad Abdullah Hanif, Maurizio Martina, Guido Masera, Muhammad Shafique

Abstract: Recently, Capsule Networks (CapsNets) have shown improved performance compared to the traditional Convolutional Neural Networks (CNNs), by encoding and preserving spatial relationships between the detected features in a better way. This is achieved through the so-called Capsules (i.e., groups of neurons) that encode both the instantiation probability and the spatial information. However, one of th… ▽ More Recently, Capsule Networks (CapsNets) have shown improved performance compared to the traditional Convolutional Neural Networks (CNNs), by encoding and preserving spatial relationships between the detected features in a better way. This is achieved through the so-called Capsules (i.e., groups of neurons) that encode both the instantiation probability and the spatial information. However, one of the major hurdles in the wide adoption of CapsNets is their gigantic training time, which is primarily due to the relatively higher complexity of their new constituting elements that are different from CNNs. In this paper, we implement different optimizations in the training loop of the CapsNets, and investigate how these optimizations affect their training speed and the accuracy. Towards this, we propose a novel framework FasTrCaps that integrates multiple lightweight optimizations and a novel learning rate policy called WarmAdaBatch (that jointly performs warm restarts and adaptive batch size), and steers them in an appropriate way to provide high training-loop speedup at minimal accuracy loss. We also propose weight sharing for capsule layers. The goal is to reduce the hardware requirements of CapsNets by removing unused/redundant connections and capsules, while keeping high accuracy through tests of different learning rate policies and batch sizes. We demonstrate that one of the solutions generated by the FasTrCaps framework can achieve 58.6% reduction in the training time, while preserving the accuracy (even 0.12% accuracy improvement for the MNIST dataset), compared to the CapsNet by Google Brain. The Pareto-optimal solutions generated by FasTrCaps can be leveraged to realize trade-offs between training time and achieved accuracy. We have open-sourced our framework on https://github.com/Alexei95/FasTrCaps. △ Less

Submitted 18 May, 2020; v1 submitted 24 May, 2019; originally announced May 2019.

Comments: Accepted for publication at the 2020 International Joint Conference on Neural Networks (IJCNN)

arXiv:1802.00580 [pdf, ps, other]

A Multi-Kernel Multi-Code Polar Decoder Architecture

Authors: Gabriele Coppolino, Carlo Condo, Guido Masera, Warren J. Gross

Abstract: Polar codes have received increasing attention in the past decade, and have been selected for the next generation of wireless communication standard. Most research on polar codes has focused on codes constructed from a $2\times2$ polarization matrix, called binary kernel: codes constructed from binary kernels have code lengths that are bound to powers of $2$. A few recent works have proposed const… ▽ More Polar codes have received increasing attention in the past decade, and have been selected for the next generation of wireless communication standard. Most research on polar codes has focused on codes constructed from a $2\times2$ polarization matrix, called binary kernel: codes constructed from binary kernels have code lengths that are bound to powers of $2$. A few recent works have proposed construction methods based on multiple kernels of different dimensions, not only binary ones, allowing code lengths different from powers of $2$. In this work, we design and implement the first multi-kernel successive cancellation polar code decoder in literature. It can decode any code constructed with binary and ternary kernels: the architecture, sized for a maximum code length $N_{max}$, is fully flexible in terms of code length, code rate and kernel sequence. The decoder can achieve frequency of more than $1$ GHz in $65$ nm CMOS technology, and a throughput of $615$ Mb/s. The area occupation ranges between $0.11$ mm$^2$ for $N_{max}=256$ and $2.01$ mm$^2$ for $N_{max}=4096$. Implementation results show an unprecedented degree of flexibility: with $N_{max}=4096$, up to $55$ code lengths can be decoded with the same hardware, along with any kernel sequence and code rate. △ Less

Submitted 2 February, 2018; originally announced February 2018.

arXiv:1301.1465

A joint communication and application simulator for NoC-based SoCs

Authors: Carlo Condo, Amer Baghdadi, Guido Masera

Abstract: NoCs have become a widespread paradigm in the system-on-chip design world, not only for multi-purpose SoCs, but also for application-specific ICs. The common approach in the NoC design world is to separate the design of the interconnection from the design of the processing elements: this is well suited for a large number of developments, but the need for joint application and NoC design is not unc… ▽ More NoCs have become a widespread paradigm in the system-on-chip design world, not only for multi-purpose SoCs, but also for application-specific ICs. The common approach in the NoC design world is to separate the design of the interconnection from the design of the processing elements: this is well suited for a large number of developments, but the need for joint application and NoC design is not uncommon, especially in the application specific case. The correlation between processing and communication tasks can be strong, and separate or trace-based simulations fall often short of the desired precision. In this work, the OMNET++ based JANoCS simulator is presented: concurrent simulation of processing and communication allow cycle-accurate evaluation of the system. Two cases of study are presented, showing both the need for joint simulations and the effectiveness of JANoCS. △ Less

Submitted 31 May, 2013; v1 submitted 8 January, 2013; originally announced January 2013.

Comments: Withdrawn, due to extended and revised version being published

arXiv:1105.2624 [pdf, other]

A Flexible LDPC code decoder with a Network on Chip as underlying interconnect architecture

Authors: Carlo Condo, Guido Masera

Abstract: LDPC (Low Density Parity Check) codes are among the most powerful and widely adopted modern error correcting codes. The iterative decoding algorithms required for these codes involve high computational complexity and high processing throughput is achieved by allocating a sufficient number of processing elements (PEs). Supporting multiple heterogeneous LDPC codes on a parallel decoder poses serious… ▽ More LDPC (Low Density Parity Check) codes are among the most powerful and widely adopted modern error correcting codes. The iterative decoding algorithms required for these codes involve high computational complexity and high processing throughput is achieved by allocating a sufficient number of processing elements (PEs). Supporting multiple heterogeneous LDPC codes on a parallel decoder poses serious problems in the design of the interconnect structure for such PEs. The aim of this work is to explore the feasibility of NoC (Network on Chip) based decoders, where full flexibility in terms of supported LDPC codes is obtained resorting to an NoC to connect PEs. NoC based LDPC decoders have been previously considered unfeasible because of the cost overhead associated to packet management and routing. On the contrary, the designed NoC adopts a low complexity routing, which introduces a very limited cost overhead with respect to architectures dedicated to specific classes of codes. Moreover the paper proposes an efficient configuration technique, which allows for fast on--the--fly switching among different codes. The decoder architecture is scalable and VLSI synthesis results are presented for several cases of study, including the whole set of WiMAX LDPC codes, WiFi codes and DVB-S2 standard. △ Less

Submitted 13 May, 2011; originally announced May 2011.

arXiv:1105.1014 [pdf, ps, other]

Improving Network-on-Chip-based turbo decoder architectures

Authors: Maurizio Martina, Guido Masera

Abstract: In this work novel results concerning Network-on-Chip-based turbo decoder architectures are presented. Stemming from previous publications, this work concentrates first on improving the throughput by exploiting adaptive-bandwidth reduction techniques. This technique shows in the best case an improvement of more than 60 Mb/s. Moreover, it is known that double-binary turbo decoders require higher ar… ▽ More In this work novel results concerning Network-on-Chip-based turbo decoder architectures are presented. Stemming from previous publications, this work concentrates first on improving the throughput by exploiting adaptive-bandwidth reduction techniques. This technique shows in the best case an improvement of more than 60 Mb/s. Moreover, it is known that double-binary turbo decoders require higher area than binary ones. This characteristic has the negative effect of increasing the data width of the network nodes. Thus, the second contribution of this work is to reduce the network complexity to support doublebinary codes, by exploiting bit-level and pseudo-floating-point representation of the extrinsic information. These two techniques allow for an area reduction of up to more than the 40% with a performance degradation of about 0.2 dB. △ Less

Submitted 5 May, 2011; originally announced May 2011.

arXiv:1006.4030 [pdf]

A Novel VLSI Architecture of Fixed-complexity Sphere Decoder

Authors: Bin Wu, Guido Masera

Abstract: Fixed-complexity Sphere Decoder (FSD) is a recently proposed technique for Multiple-Input Multiple-Output (MIMO) detection. It has several outstanding features such as constant throughput and large potential parallelism, which makes it suitable for efficient VLSI implementation. However, to our best knowledge, no VLSI implementation of FSD has been reported in the literature, although some FPGA pr… ▽ More Fixed-complexity Sphere Decoder (FSD) is a recently proposed technique for Multiple-Input Multiple-Output (MIMO) detection. It has several outstanding features such as constant throughput and large potential parallelism, which makes it suitable for efficient VLSI implementation. However, to our best knowledge, no VLSI implementation of FSD has been reported in the literature, although some FPGA prototypes of FSD with pipeline architecture have been developed. These solutions achieve very high throughput but at very high cost of hardware resources, making them impractical in real applications. In this paper, we present a novel four-nodes-per-cycle parallel architecture of FSD, with a breadth-first processing that allows for short critical path. The implementation achieves a throughput of 213.3 Mbps at 400 MHz clock frequency, at a cost of 0.18 mm2 Silicon area on 0.13μm CMOS technology. The proposed solution is much more economical compared with the existing FPGA implementations, and very suitable for practicl applications because of its balanced performance and hardware-complexity; moreover it has the flexibility to be expanded into an eight-nodes-per-cycle version in order to double the throughput. △ Less

Submitted 21 June, 2010; originally announced June 2010.

Comments: 8 pages, this paper has been accepted by the conference DSD 2010

arXiv:1001.4694 [pdf]

VLSI Architectures for WIMAX Channel Decoders

Authors: Maurizio Martina, Guido Masera

Abstract: This chapter describes the main architectures proposed in the literature to implement the channel decoders required by the WiMax standard, namely convolutional codes, turbo codes (both block and convolutional) and LDPC. Then it shows a complete design of a convolutional turbo code encoder/decoder system for WiMax. This chapter describes the main architectures proposed in the literature to implement the channel decoders required by the WiMax standard, namely convolutional codes, turbo codes (both block and convolutional) and LDPC. Then it shows a complete design of a convolutional turbo code encoder/decoder system for WiMax. △ Less

Submitted 26 January, 2010; originally announced January 2010.

Comments: To appear in the book "WIMAX, New Developments", M. Upena, D. Dalal, Y. Kosta (Ed.), ISBN978-953-7619-53-4

arXiv:0909.1876 [pdf, ps, other]

doi 10.1109/TCSI.2010.2046257

Turbo NOC: a framework for the design of Network On Chip based turbo decoder architectures

Authors: Maurizio Martina, Guido Masera

Abstract: This work proposes a general framework for the design and simulation of network on chip based turbo decoder architectures. Several parameters in the design space are investigated, namely the network topology, the parallelism degree, the rate at which messages are sent by processing nodes over the network and the routing strategy. The main results of this analysis are: i) the most suited topologi… ▽ More This work proposes a general framework for the design and simulation of network on chip based turbo decoder architectures. Several parameters in the design space are investigated, namely the network topology, the parallelism degree, the rate at which messages are sent by processing nodes over the network and the routing strategy. The main results of this analysis are: i) the most suited topologies to achieve high throughput with a limited complexity overhead are generalized de-Bruijn and generalized Kautz topologies; ii) depending on the throughput requirements different parallelism degrees, message injection rates and routing algorithms can be used to minimize the network area overhead. △ Less

Submitted 10 September, 2009; originally announced September 2009.

Comments: submitted to IEEE Trans. on Circuits and Systems I (submission date 27 may 2009)

arXiv:0711.2383 [pdf, ps, other]

Decoding the Golden Code: a VLSI design

Authors: Barbara Cerato, Guido Masera, Emanuele Viterbo

Abstract: The recently proposed Golden code is an optimal space-time block code for 2 X 2 multiple-input multiple-output (MIMO) systems. The aim of this work is the design of a VLSI decoder for a MIMO system coded with the Golden code. The architecture is based on a rearrangement of the sphere decoding algorithm that achieves maximum-likelihood (ML) decoding performance. Compared to other approaces, the p… ▽ More The recently proposed Golden code is an optimal space-time block code for 2 X 2 multiple-input multiple-output (MIMO) systems. The aim of this work is the design of a VLSI decoder for a MIMO system coded with the Golden code. The architecture is based on a rearrangement of the sphere decoding algorithm that achieves maximum-likelihood (ML) decoding performance. Compared to other approaces, the proposed solution exhibits an inherent flexibility in terms of modulation schemes QAM modulation size and this makes our architecture particularly suitable for adaptive modulation schemes. △ Less

Submitted 15 November, 2007; originally announced November 2007.

Comments: 25 pages, 10 figures

ACM Class: B.7.1

arXiv:0710.4840 [pdf]

Testing Logic Cores using a BIST P1500 Compliant Approach: A Case of Study

Authors: P. Bernardi, G. Masera, F. Quaglio, M. Sonza Reorda

Abstract: In this paper we describe how we applied a BIST-based approach to the test of a logic core to be included in System-on-a-chip (SoC) environments. The approach advantages are the ability to protect the core IP, the simple test interface (thanks also to the adoption of the P1500 standard), the possibility to run the test at-speed, the reduced test time, and the good diagnostic capabilities. The pa… ▽ More In this paper we describe how we applied a BIST-based approach to the test of a logic core to be included in System-on-a-chip (SoC) environments. The approach advantages are the ability to protect the core IP, the simple test interface (thanks also to the adoption of the P1500 standard), the possibility to run the test at-speed, the reduced test time, and the good diagnostic capabilities. The paper reports figures about the achieved fault coverage, the required area overhead, and the performance slowdown, and compares the figures with those for alternative approaches, such as those based on full scan and sequential ATPG. △ Less

Submitted 25 October, 2007; originally announced October 2007.

Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

Journal ref: Dans Design, Automation and Test in Europe | Designers'Forum - DATE'05, Munich : Allemagne (2005)

Showing 1–22 of 22 results for author: Masera, G