Skip to main content

Showing 1–17 of 17 results for author: Borghesi, A

  1. arXiv:2308.15481  [pdf, other

    cs.DC cs.AI cs.LG

    Online Job Failure Prediction in an HPC System

    Authors: Francesco Antici, Andrea Borghesi, Zeynep Kiziltan

    Abstract: Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarante… ▽ More

    Submitted 30 June, 2023; originally announced August 2023.

  2. Design of an energy aware petaflops class high performance cluster based on power architecture

    Authors: W. A. Ahmad, A. Bartolini, F. Beneventi, L. Benini, A. Borghesi, M. Cicala, P. Forestieri, C. Gianfreda, D. Gregori, A. Libri, F. Spiga, S. Tinti

    Abstract: In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), an innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE (Partnership for Advanced Computing in Europe). D.A.V.I.D.E. is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 10… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

  3. arXiv:2208.13169  [pdf, other

    cs.LG cs.AI

    RUAD: unsupervised anomaly detection in HPC systems

    Authors: Martin Molan, Andrea Borghesi, Daniele Cesarini, Luca Benini, Andrea Bartolini

    Abstract: The increasing complexity of modern high-performance computing (HPC) systems necessitates the introduction of automated and data-driven methodologies to support system administrators' effort toward increasing the system's availability. Anomaly detection is an integral part of improving the availability as it eases the system administrator's burden and reduces the time between an anomaly and its re… ▽ More

    Submitted 28 August, 2022; originally announced August 2022.

    MSC Class: 68T07 (Primary) 68U01; 68T01 (Secondary) ACM Class: I.2; I.2.6

  4. arXiv:2205.10157  [pdf, ps, other

    cs.LG

    Machine Learning for Combinatorial Optimisation of Partially-Specified Problems: Regret Minimisation as a Unifying Lens

    Authors: Stefano Teso, Laurens Bliek, Andrea Borghesi, Michele Lombardi, Neil Yorke-Smith, Tias Guns, Andrea Passerini

    Abstract: It is increasingly common to solve combinatorial optimisation problems that are partially-specified. We survey the case where the objective function or the relations between variables are not known or are only partially specified. The challenge is to learn them from available data, while taking into account a set of hard constraints that a solution must satisfy, and that solving the optimisation p… ▽ More

    Submitted 20 May, 2022; originally announced May 2022.

  5. arXiv:2103.02346  [pdf, ps, other

    cs.LG

    Deep Learning for Virus-Spreading Forecasting: a Brief Survey

    Authors: Federico Baldo, Lorenzo Dall'Olio, Mattia Ceccarelli, Riccardo Scheda, Michele Lombardi, Andrea Borghesi, Stefano Diciotti, Michela Milano

    Abstract: The advent of the coronavirus pandemic has sparked the interest in predictive models capable of forecasting virus-spreading, especially for boosting and supporting decision-making processes. In this paper, we will outline the main Deep Learning approaches aimed at predicting the spreading of a disease in space and time. The aim is to show the emerging trends in this area of research and provide a… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

  6. A Machine Learning Approach to Online Fault Classification in HPC Systems

    Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

    Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which i… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: arXiv admin note: text overlap with arXiv:1807.10056, arXiv:1810.11208

    Journal ref: Future Generation Computer Systems, Volume 110, September 2020, Pages 1009-1022

  7. arXiv:2006.04603  [pdf, other

    eess.IV cs.CV cs.LG

    BS-Net: learning COVID-19 pneumonia severity on a large Chest X-Ray dataset

    Authors: Alberto Signoroni, Mattia Savardi, Sergio Benini, Nicola Adami, Riccardo Leonardi, Paolo Gibellini, Filippo Vaccher, Marco Ravanelli, Andrea Borghesi, Roberto Maroldi, Davide Farina

    Abstract: In this work we design an end-to-end deep learning architecture for predicting, on Chest X-rays images (CXR), a multi-regional score conveying the degree of lung compromise in COVID-19 patients. Such semi-quantitative scoring system, namely Brixia~score, is applied in serial monitoring of such patients, showing significant prognostic value, in one of the hospitals that experienced one of the highe… ▽ More

    Submitted 3 April, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 28 pages, 11 figures, preprint of accepted paper to Medical Image Analysis, Project page with Code and Dataset Available at https://brixia.github.io/

    MSC Class: 68T45 ACM Class: I.2.10; I.5; I.4; J.3

  8. arXiv:2005.10691  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Improving Deep Learning Models via Constraint-Based Domain Knowledge: a Brief Survey

    Authors: Andrea Borghesi, Federico Baldo, Michela Milano

    Abstract: Deep Learning (DL) models proved themselves to perform extremely well on a wide variety of learning tasks, as they can learn useful patterns from large data sets. However, purely data-driven models might struggle when very difficult functions need to be learned or when there is not enough available training data. Fortunately, in many domains prior information can be retrieved and used to boost the… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

  9. arXiv:2005.10674  [pdf, ps, other

    cs.LG cs.AI stat.ML

    An Analysis of Regularized Approaches for Constrained Machine Learning

    Authors: Michele Lombardi, Federico Baldo, Andrea Borghesi, Michela Milano

    Abstract: Regularization-based approaches for injecting constraints in Machine Learning (ML) were introduced to improve a predictive model via expert knowledge. We tackle the issue of finding the right balance between the loss (the accuracy of the learner) and the regularization term (the degree of constraint satisfaction). The key results of this paper is the formal demonstration that this type of approach… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

  10. Combining Learning and Optimization for Transprecision Computing

    Authors: Andrea Borghesi, Giuseppe Tagliavini, Michele Lombardi, Luca Benini, Michela Milano

    Abstract: The growing demands of the worldwide IT infrastructure stress the need for reduced power consumption, which is addressed in so-called transprecision computing by improving energy efficiency at the expense of precision. For example, reducing the number of bits for some floating-point operations leads to higher efficiency, but also to a non-linear decrease of the computation accuracy. Depending on t… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of the 17th ACM International Conference on Computing Frontiers, May 2020, Pages 10-18

  11. Injective Domain Knowledge in Neural Networks for Transprecision Computing

    Authors: Andrea Borghesi, Federico Baldo, Michele Lombardi, Michela Milano

    Abstract: Machine Learning (ML) models are very effective in many learning tasks, due to the capability to extract meaningful information from large data sets. Nevertheless, there are learning problems that cannot be easily solved relying on pure data, e.g. scarce data or very complex functions to be approximated. Fortunately, in many contexts domain knowledge is explicitly available and can be used to trai… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

    Journal ref: Nicosia G. et al. (eds) Machine Learning, Optimization, and Data Science. LOD 2020. Lecture Notes in Computer Science, vol 12565. Springer, Cham

  12. arXiv:1909.12684  [pdf, other

    cs.DC

    COUNTDOWN Slack: a Run-time Library to Reduce Energy Footprint in Large-scale MPI Applications

    Authors: Daniele Cesarini, Andrea Bartolini, Andrea Borghesi, Carlo Cavazzoni, Mathieu Luisier, Luca Benini

    Abstract: The power consumption of supercomputers is a major challenge for system owners, users, and society. It limits the capacity of system installations, it requires large cooling infrastructures, and it is the cause of a large carbon footprint. Reducing power during application execution without changing the application source code or increasing time-to-completion is highly desirable in real-life high-… ▽ More

    Submitted 27 September, 2019; originally announced September 2019.

    Comments: 13 pages, 4 figures, 3 tables

  13. Online Anomaly Detection in HPC Systems

    Authors: Andrea Borghesi, Antonio Libri, Luca Benini, Andrea Bartolini

    Abstract: Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale super… ▽ More

    Submitted 22 February, 2019; originally announced February 2019.

    Comments: Preprint of paper submitted and accepted AICAS2019 Conference (1st IEEE International Conference on Artificial Intelligence Circuits and Systems)

    Journal ref: 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 2019, pp. 229-233

  14. Anomaly Detection using Autoencoders in High Performance Computing Systems

    Authors: Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini

    Abstract: Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unh… ▽ More

    Submitted 13 November, 2018; originally announced November 2018.

    Comments: 9 pages, 3 figures

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pages 9428-9433, 2019

  15. arXiv:1810.11208  [pdf, other

    cs.DC

    Online Fault Classification in HPC Systems through Machine Learning

    Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

    Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for… ▽ More

    Submitted 11 July, 2019; v1 submitted 26 October, 2018; originally announced October 2018.

    Comments: Accepted for publication at the Euro-Par 2019 conference

  16. arXiv:1807.10056  [pdf, other

    cs.DC

    FINJ: A Fault Injection Tool for HPC Systems

    Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

    Abstract: We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing use… ▽ More

    Submitted 1 September, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

    Comments: To be presented at the 11th Resilience Workshop in the 2018 Euro-Par conference

  17. Pricing Schemes for Energy-Efficient HPC Systems: Design and Exploration

    Authors: Andrea Borghesi, Andrea Bartolini, Michela Milano, Luca Benini

    Abstract: Energy efficiency is of paramount importance for the sustainability of HPC systems. Energy consumption limits the peak performance of supercomputers and accounts for a large share of total cost of ownership. Consequently, system owners and final users have started exploring mechanisms to trade off performance for power consumption, for example through frequency and voltage scaling. However, only… ▽ More

    Submitted 13 June, 2018; originally announced June 2018.

    Journal ref: The International Journal of High Performance Computing Applications. Volume: 33 issue: 4, page(s): 716-734 , 2019