subscribe to arXiv mailings

doi 10.1007/s00521-023-09356-5

U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts

Authors: Silvia Zottin, Axel De Nardin, Emanuela Colombi, Claudio Piciarelli, Filippo Pavan, Gian Luca Foresti

Abstract: Document Layout Analysis, which is the task of identifying different semantic regions inside of a document page, is a subject of great interest for both computer scientists and humanities scholars as it represents a fundamental step towards further analysis tasks for the former and a powerful tool to improve and facilitate the study of the documents for the latter. However, many of the works curre… ▽ More Document Layout Analysis, which is the task of identifying different semantic regions inside of a document page, is a subject of great interest for both computer scientists and humanities scholars as it represents a fundamental step towards further analysis tasks for the former and a powerful tool to improve and facilitate the study of the documents for the latter. However, many of the works currently present in the literature, especially when it comes to the available datasets, fail to meet the needs of both worlds and, in particular, tend to lean towards the needs and common practices of the computer science side, leading to resources that are not representative of the humanities real needs. For this reason, the present paper introduces U-DIADS-Bib, a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities. Furthermore, we propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation, necessary for the generation of the ground truth segmentation maps. Finally, we present a standardized few-shot version of the dataset (U-DIADS-BibFS), with the aim of encouraging the development of models and solutions able to address this task with as few samples as possible, which would allow for more effective use in a real-world scenario, where collecting a large number of segmentations is not always feasible. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Neural Comput & Applic (2024)

arXiv:2308.00155 [pdf, other]

Federated Learning for Data and Model Heterogeneity in Medical Imaging

Authors: Hussain Ahmad Madni, Rao Muhammad Umer, Gian Luca Foresti

Abstract: Federated Learning (FL) is an evolving machine learning method in which multiple clients participate in collaborative learning without sharing their data with each other and the central server. In real-world applications such as hospitals and industries, FL counters the challenges of data heterogeneity and model heterogeneity as an inevitable part of the collaborative training. More specifically,… ▽ More Federated Learning (FL) is an evolving machine learning method in which multiple clients participate in collaborative learning without sharing their data with each other and the central server. In real-world applications such as hospitals and industries, FL counters the challenges of data heterogeneity and model heterogeneity as an inevitable part of the collaborative training. More specifically, different organizations, such as hospitals, have their own private data and customized models for local training. To the best of our knowledge, the existing methods do not effectively address both problems of model heterogeneity and data heterogeneity in FL. In this paper, we exploit the data and model heterogeneity simultaneously, and propose a method, MDH-FL (Exploiting Model and Data Heterogeneity in FL) to solve such problems to enhance the efficiency of the global model in FL. We use knowledge distillation and a symmetric loss to minimize the heterogeneity and its impact on the model performance. Knowledge distillation is used to solve the problem of model heterogeneity, and symmetric loss tackles with the data and label heterogeneity. We evaluate our method on the medical datasets to conform the real-world scenario of hospitals, and compare with the existing methods. The experimental results demonstrate the superiority of the proposed approach over the other existing methods. △ Less

Submitted 31 July, 2023; originally announced August 2023.

Comments: Published in ICIAP2023 Workshop on Federated Learning in Medical Imaging and Vision

arXiv:2210.15570 [pdf, other]

doi 10.1109/WACV56688.2023.00367

Efficient few-shot learning for pixel-precise handwritten document layout analysis

Authors: Axel De Nardin, Silvia Zottin, Matteo Paier, Gian Luca Foresti, Emanuela Colombi, Claudio Piciarelli

Abstract: Layout analysis is a task of uttermost importance in ancient handwritten document analysis and represents a fundamental step toward the simplification of subsequent tasks such as optical character recognition and automatic transcription. However, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm. While these systems achieve very good performance on t… ▽ More Layout analysis is a task of uttermost importance in ancient handwritten document analysis and represents a fundamental step toward the simplification of subsequent tasks such as optical character recognition and automatic transcription. However, many of the approaches adopted to solve this problem rely on a fully supervised learning paradigm. While these systems achieve very good performance on this task, the drawback is that pixel-precise text labeling of the entire training set is a very time-consuming process, which makes this type of information rarely available in a real-world scenario. In the present paper, we address this problem by proposing an efficient few-shot learning framework that achieves performances comparable to current state-of-the-art fully supervised methods on the publicly available DIVA-HisDB dataset. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted for publication at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023

arXiv:2210.15540 [pdf, other]

doi 10.1142/S0129065722500307

Masked Transformer for image Anomaly Localization

Authors: Axel De Nardin, Pankaj Mishra, Gian Luca Foresti, Claudio Piciarelli

Abstract: Image anomaly detection consists in detecting images or image portions that are visually different from the majority of the samples in a dataset. The task is of practical importance for various real-life applications like biomedical image analysis, visual inspection in industrial production, banking, traffic management, etc. Most of the current deep learning approaches rely on image reconstruction… ▽ More Image anomaly detection consists in detecting images or image portions that are visually different from the majority of the samples in a dataset. The task is of practical importance for various real-life applications like biomedical image analysis, visual inspection in industrial production, banking, traffic management, etc. Most of the current deep learning approaches rely on image reconstruction: the input image is projected in some latent space and then reconstructed, assuming that the network (mostly trained on normal data) will not be able to reconstruct the anomalous portions. However, this assumption does not always hold. We thus propose a new model based on the Vision Transformer architecture with patch masking: the input image is split in several patches, and each patch is reconstructed only from the surrounding data, thus ignoring the potentially anomalous information contained in the patch itself. We then show that multi-resolution patches and their collective embeddings provide a large improvement in the model's performance compared to the exclusive use of the traditional square patches. The proposed model has been tested on popular anomaly detection datasets such as MVTec and head CT and achieved good results when compared to other state-of-the-art approaches. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Journal ref: Int J Neural Syst. 2022;32(7):2250030

arXiv:2203.14031 [pdf, other]

Medicinal Boxes Recognition on a Deep Transfer Learning Augmented Reality Mobile Application

Authors: Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Marco Raoul Marini, Alessio Mecca, Daniele Pannone

Abstract: Taking medicines is a fundamental aspect to cure illnesses. However, studies have shown that it can be hard for patients to remember the correct posology. More aggravating, a wrong dosage generally causes the disease to worsen. Although, all relevant instructions for a medicine are summarized in the corresponding patient information leaflet, the latter is generally difficult to navigate and unders… ▽ More Taking medicines is a fundamental aspect to cure illnesses. However, studies have shown that it can be hard for patients to remember the correct posology. More aggravating, a wrong dosage generally causes the disease to worsen. Although, all relevant instructions for a medicine are summarized in the corresponding patient information leaflet, the latter is generally difficult to navigate and understand. To address this problem and help patients with their medication, in this paper we introduce an augmented reality mobile application that can present to the user important details on the framed medicine. In particular, the app implements an inference engine based on a deep neural network, i.e., a densenet, fine-tuned to recognize a medicinal from its package. Subsequently, relevant information, such as posology or a simplified leaflet, is overlaid on the camera feed to help a patient when taking a medicine. Extensive experiments to select the best hyperparameters were performed on a dataset specifically collected to address this task; ultimately obtaining up to 91.30\% accuracy as well as real-time capabilities. △ Less

Submitted 26 March, 2022; originally announced March 2022.

Comments: 12 pages, 7 figures

arXiv:2203.10009 [pdf, other]

Analyzing EEG Data with Machine and Deep Learning: A Benchmark

Authors: Danilo Avola, Marco Cascio, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Marco Raoul Marini, Daniele Pannone

Abstract: Nowadays, machine and deep learning techniques are widely used in different areas, ranging from economics to biology. In general, these techniques can be used in two ways: trying to adapt well-known models and architectures to the available data, or designing custom architectures. In both cases, to speed up the research process, it is useful to know which type of models work best for a specific pr… ▽ More Nowadays, machine and deep learning techniques are widely used in different areas, ranging from economics to biology. In general, these techniques can be used in two ways: trying to adapt well-known models and architectures to the available data, or designing custom architectures. In both cases, to speed up the research process, it is useful to know which type of models work best for a specific problem and/or data type. By focusing on EEG signal analysis, and for the first time in literature, in this paper a benchmark of machine and deep learning for EEG signal classification is proposed. For our experiments we used the four most widespread models, i.e., multilayer perceptron, convolutional neural network, long short-term memory, and gated recurrent unit, highlighting which one can be a good starting point for developing EEG classification models. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: conference, 11 pages, 5 figures

arXiv:2203.05864 [pdf, other]

doi 10.1142/S0129065722500150

Human Silhouette and Skeleton Video Synthesis through Wi-Fi signals

Authors: Danilo Avola, Marco Cascio, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti

Abstract: The increasing availability of wireless access points (APs) is leading towards human sensing applications based on Wi-Fi signals as support or alternative tools to the widespread visual sensors, where the signals enable to address well-known vision-related problems such as illumination changes or occlusions. Indeed, using image synthesis techniques to translate radio frequencies to the visible spe… ▽ More The increasing availability of wireless access points (APs) is leading towards human sensing applications based on Wi-Fi signals as support or alternative tools to the widespread visual sensors, where the signals enable to address well-known vision-related problems such as illumination changes or occlusions. Indeed, using image synthesis techniques to translate radio frequencies to the visible spectrum can become essential to obtain otherwise unavailable visual data. This domain-to-domain translation is feasible because both objects and people affect electromagnetic waves, causing radio and optical frequencies variations. In literature, models capable of inferring radio-to-visual features mappings have gained momentum in the last few years since frequency changes can be observed in the radio domain through the channel state information (CSI) of Wi-Fi APs, enabling signal-based feature extraction, e.g., amplitude. On this account, this paper presents a novel two-branch generative neural network that effectively maps radio data into visual features, following a teacher-student design that exploits a cross-modality supervision strategy. The latter conditions signal-based features in the visual domain to completely replace visual data. Once trained, the proposed method synthesizes human silhouette and skeleton videos using exclusively Wi-Fi signals. The approach is evaluated on publicly available data, where it obtains remarkable results for both silhouette and skeleton videos generation, demonstrating the effectiveness of the proposed cross-modality supervision strategy. △ Less

Submitted 11 March, 2022; originally announced March 2022.

Journal ref: International Journal of Neural Systems, 2022, 2250015

arXiv:2110.02776 [pdf, other]

doi 10.1016/j.neunet.2022.06.030

SIRe-Networks: Convolutional Neural Networks Architectural Extension for Information Preservation via Skip/Residual Connections and Interlaced Auto-Encoders

Authors: Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti

Abstract: Improving existing neural network architectures can involve several design choices such as manipulating the loss functions, employing a diverse learning strategy, exploiting gradient evolution at training time, optimizing the network hyper-parameters, or increasing the architecture depth. The latter approach is a straightforward solution, since it directly enhances the representation capabilities… ▽ More Improving existing neural network architectures can involve several design choices such as manipulating the loss functions, employing a diverse learning strategy, exploiting gradient evolution at training time, optimizing the network hyper-parameters, or increasing the architecture depth. The latter approach is a straightforward solution, since it directly enhances the representation capabilities of a network; however, the increased depth generally incurs in the well-known vanishing gradient problem. In this paper, borrowing from different methods addressing this issue, we introduce an interlaced multi-task learning strategy, defined SIRe, to reduce the vanishing gradient in relation to the object classification task. The presented methodology directly improves a convolutional neural network (CNN) by preserving information from the input image through interlaced auto-encoders (AEs), and further refines the base network architecture by means of skip and residual connections. To validate the presented methodology, a simple CNN and various implementations of famous networks are extended via the SIRe strategy and extensively tested on five collections, i.e., MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100, and Caltech-256; where the SIRe-extended architectures achieve significantly increased performances across all models and datasets, thus confirming the presented approach effectiveness. △ Less

Submitted 26 October, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Journal ref: Neural Networks 153 (2022): 386-398

arXiv:2109.13879 [pdf, other]

doi 10.1016/j.patcog.2022.108762

3D Hand Pose and Shape Estimation from RGB Images for Keypoint-Based Hand Gesture Recognition

Authors: Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Adriano Fragomeni, Daniele Pannone

Abstract: Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose i… ▽ More Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose is determined. However, depending on the method used to represent the hand, the performance can drop considerably in real-life tasks, suggesting that stable descriptions are required to achieve satisfactory results. In this paper, we present a keypoint-based end-to-end framework for 3D hand and pose estimation and successfully apply it to the task of hand gesture recognition as a study case. Specifically, after a pre-processing step in which the images are normalized, the proposed pipeline uses a multi-task semantic feature extractor generating 2D heatmaps and hand silhouettes from RGB images, a viewpoint encoder to predict the hand and camera view parameters, a stable hand estimator to produce the 3D hand pose and shape, and a loss function to guide all of the components jointly during the learning phase. Tests were performed on a 3D pose and shape estimation benchmark dataset to assess the proposed framework, which obtained state-of-the-art performance. Our system was also evaluated on two hand-gesture recognition benchmark datasets and significantly outperformed other keypoint-based approaches, indicating that it is an effective solution that is able to generate stable 3D estimates for hand pose and shape. △ Less

Submitted 9 May, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

arXiv:2107.00362 [pdf, other]

doi 10.1049/iet-cvi.2019.0963

Drone swarm patrolling with uneven coverage requirements

Authors: Claudio Piciarelli, Gian Luca Foresti

Abstract: Swarms of drones are being more and more used in many practical scenarios, such as surveillance, environmental monitoring, search and rescue in hardly-accessible areas, etc.. While a single drone can be guided by a human operator, the deployment of a swarm of multiple drones requires proper algorithms for automatic task-oriented control. In this paper, we focus on visual coverage optimization with… ▽ More Swarms of drones are being more and more used in many practical scenarios, such as surveillance, environmental monitoring, search and rescue in hardly-accessible areas, etc.. While a single drone can be guided by a human operator, the deployment of a swarm of multiple drones requires proper algorithms for automatic task-oriented control. In this paper, we focus on visual coverage optimization with drone-mounted camera sensors. In particular, we consider the specific case in which the coverage requirements are uneven, meaning that different parts of the environment have different coverage priorities. We model these coverage requirements with relevance maps and propose a deep reinforcement learning algorithm to guide the swarm. The paper first defines a proper learning model for a single drone, and then extends it to the case of multiple drones both with greedy and cooperative strategies. Experimental results show the performance of the proposed method, also compared with a standard patrolling algorithm. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: This paper has been published on IET Computer Vision. Please cite it accordingly (see journal reference below)

Journal ref: IET Computer Vision, 14: 452-461 (2020)

arXiv:2104.10036 [pdf, other]

doi 10.1109/ISIE45552.2021.9576231

VT-ADL: A Vision Transformer Network for Image Anomaly Detection and Localization

Authors: Pankaj Mishra, Riccardo Verk, Daniele Fornasier, Claudio Piciarelli, Gian Luca Foresti

Abstract: We present a transformer-based image anomaly detection and localization network. Our proposed model is a combination of a reconstruction-based approach and patch embedding. The use of transformer networks helps to preserve the spatial information of the embedded patches, which are later processed by a Gaussian mixture density network to localize the anomalous areas. In addition, we also publish BT… ▽ More We present a transformer-based image anomaly detection and localization network. Our proposed model is a combination of a reconstruction-based approach and patch embedding. The use of transformer networks helps to preserve the spatial information of the embedded patches, which are later processed by a Gaussian mixture density network to localize the anomalous areas. In addition, we also publish BTAD, a real-world industrial anomaly dataset. Our results are compared with other state-of-the-art algorithms using publicly available datasets like MNIST and MVTec. △ Less

Submitted 20 April, 2021; originally announced April 2021.

Comments: 6 Pages, 4 images, conference published paper

Report number: KD-003638

Journal ref: IEEE 30th International Symposium on Industrial Electronics (ISIE), 2021

arXiv:2012.02478 [pdf, other]

Is It a Plausible Colour? UCapsNet for Image Colourisation

Authors: Rita Pucci, Christian Micheloni, Gian Luca Foresti, Niki Martinel

Abstract: Human beings can imagine the colours of a grayscale image with no particular effort thanks to their ability of semantic feature extraction. Can an autonomous system achieve that? Can it hallucinate plausible and vibrant colours? This is the colourisation problem. Different from existing works relying on convolutional neural network models pre-trained with supervision, we cast such colourisation pr… ▽ More Human beings can imagine the colours of a grayscale image with no particular effort thanks to their ability of semantic feature extraction. Can an autonomous system achieve that? Can it hallucinate plausible and vibrant colours? This is the colourisation problem. Different from existing works relying on convolutional neural network models pre-trained with supervision, we cast such colourisation problem as a self-supervised learning task. We tackle the problem with the introduction of a novel architecture based on Capsules trained following the adversarial learning paradigm. Capsule networks are able to extract a semantic representation of the entities in the image but loose details about their spatial information, which is important for colourising a grayscale image. Thus our UCapsNet structure comes with an encoding phase that extracts entities through capsules and spatial details through convolutional neural networks. A decoding phase merges the entity features with the spatial features to hallucinate a plausible colour version of the input datum. Results on the ImageNet benchmark show that our approach is able to generate more vibrant and plausible colours than exiting solutions and achieves superior performance than models pre-trained with supervision. △ Less

Submitted 4 December, 2020; originally announced December 2020.

arXiv:2011.06288 [pdf]

Image Anomaly Detection by Aggregating Deep Pyramidal Representations

Authors: Pankaj Mishra, Claudio Piciarelli, Gian Luca Foresti

Abstract: Anomaly detection consists in identifying, within a dataset, those samples that significantly differ from the majority of the data, representing the normal class. It has many practical applications, e.g. ranging from defective product detection in industrial systems to medical imaging. This paper focuses on image anomaly detection using a deep neural network with multiple pyramid levels to analyze… ▽ More Anomaly detection consists in identifying, within a dataset, those samples that significantly differ from the majority of the data, representing the normal class. It has many practical applications, e.g. ranging from defective product detection in industrial systems to medical imaging. This paper focuses on image anomaly detection using a deep neural network with multiple pyramid levels to analyze the image features at different scales. We propose a network based on encoding-decoding scheme, using a standard convolutional autoencoders, trained on normal data only in order to build a model of normality. Anomalies can be detected by the inability of the network to reconstruct its input. Experimental results show a good accuracy on MNIST, FMNIST and the recent MVTec Anomaly Detection dataset △ Less

Submitted 12 November, 2020; originally announced November 2020.

Comments: Published in First International Conference of Industrial Machine Learning ICPR2020

arXiv:2009.04809 [pdf, other]

Deep Iterative Residual Convolutional Network for Single Image Super-Resolution

Authors: Rao Muhammad Umer, Gian Luca Foresti, Christian Micheloni

Abstract: Deep convolutional neural networks (CNNs) have recently achieved great success for single image super-resolution (SISR) task due to their powerful feature representation capabilities. The most recent deep learning based SISR methods focus on designing deeper / wider models to learn the non-linear mapping between low-resolution (LR) inputs and high-resolution (HR) outputs. These existing SR methods… ▽ More Deep convolutional neural networks (CNNs) have recently achieved great success for single image super-resolution (SISR) task due to their powerful feature representation capabilities. The most recent deep learning based SISR methods focus on designing deeper / wider models to learn the non-linear mapping between low-resolution (LR) inputs and high-resolution (HR) outputs. These existing SR methods do not take into account the image observation (physical) model and thus require a large number of network's trainable parameters with a great volume of training data. To address these issues, we propose a deep Iterative Super-Resolution Residual Convolutional Network (ISRResCNet) that exploits the powerful image regularization and large-scale optimization techniques by training the deep network in an iterative manner with a residual learning approach. Extensive experimental results on various super-resolution benchmarks demonstrate that our method with a few trainable parameters improves the results for different scaling factors in comparison with the state-of-art methods. △ Less

Submitted 7 September, 2020; originally announced September 2020.

Comments: To be appeared in proceedings of the 25th IEEE International Conference on Pattern Recognition (ICPR). arXiv admin note: text overlap with arXiv:2005.00953, arXiv:2009.03693

arXiv:2005.00953 [pdf, other]

Deep Generative Adversarial Residual Convolutional Networks for Real-World Super-Resolution

Authors: Rao Muhammad Umer, Gian Luca Foresti, Christian Micheloni

Abstract: Most current deep learning based single image super-resolution (SISR) methods focus on designing deeper / wider models to learn the non-linear mapping between low-resolution (LR) inputs and the high-resolution (HR) outputs from a large number of paired (LR/HR) training data. They usually take as assumption that the LR image is a bicubic down-sampled version of the HR image. However, such degradati… ▽ More Most current deep learning based single image super-resolution (SISR) methods focus on designing deeper / wider models to learn the non-linear mapping between low-resolution (LR) inputs and the high-resolution (HR) outputs from a large number of paired (LR/HR) training data. They usually take as assumption that the LR image is a bicubic down-sampled version of the HR image. However, such degradation process is not available in real-world settings i.e. inherent sensor noise, stochastic noise, compression artifacts, possible mismatch between image degradation process and camera device. It reduces significantly the performance of current SISR methods due to real-world image corruptions. To address these problems, we propose a deep Super-Resolution Residual Convolutional Generative Adversarial Network (SRResCGAN) to follow the real-world degradation settings by adversarial training the model with pixel-wise supervision in the HR domain from its generated LR counterpart. The proposed network exploits the residual learning by minimizing the energy-based objective function with powerful image regularization and convex optimization techniques. We demonstrate our proposed approach in quantitative and qualitative experiments that generalize robustly to real input and it is easy to deploy for other down-scaling operators and mobile/embedded devices. △ Less

Submitted 2 May, 2020; originally announced May 2020.

arXiv:2004.06154 [pdf, other]

An Efficient UAV-based Artificial Intelligence Framework for Real-Time Visual Tasks

Authors: Enkhtogtokh Togootogtokh, Christian Micheloni, Gian Luca Foresti, Niki Martinel

Abstract: Modern Unmanned Aerial Vehicles equipped with state of the art artificial intelligence (AI) technologies are opening to a wide plethora of novel and interesting applications. While this field received a strong impact from the recent AI breakthroughs, most of the provided solutions either entirely rely on commercial software or provide a weak integration interface which denies the development of ad… ▽ More Modern Unmanned Aerial Vehicles equipped with state of the art artificial intelligence (AI) technologies are opening to a wide plethora of novel and interesting applications. While this field received a strong impact from the recent AI breakthroughs, most of the provided solutions either entirely rely on commercial software or provide a weak integration interface which denies the development of additional techniques. This leads us to propose a novel and efficient framework for the UAV-AI joint technology. Intelligent UAV systems encounter complex challenges to be tackled without human control. One of these complex challenges is to be able to carry out computer vision tasks in real-time use cases. In this paper we focus on this challenge and introduce a multi-layer AI (MLAI) framework to allow easy integration of ad-hoc visual-based AI applications. To show its features and its advantages, we implemented and evaluated different modern visual-based deep learning models for object detection, target tracking and target handover. △ Less

Submitted 13 April, 2020; originally announced April 2020.

arXiv:1910.04856 [pdf, other]

Video-Based Convolutional Attention for Person Re-Identification

Authors: Marco Zamprogno, Marco Passon, Niki Martinel, Giuseppe Serra, Giuseppe Lancioni, Christian Micheloni, Carlo Tasso, Gian Luca Foresti

Abstract: In this paper we consider the problem of video-based person re-identification, which is the task of associating videos of the same person captured by different and non-overlapping cameras. We propose a Siamese framework in which video frames of the person to re-identify and of the candidate one are processed by two identical networks which produce a similarity score. We introduce an attention mech… ▽ More In this paper we consider the problem of video-based person re-identification, which is the task of associating videos of the same person captured by different and non-overlapping cameras. We propose a Siamese framework in which video frames of the person to re-identify and of the candidate one are processed by two identical networks which produce a similarity score. We introduce an attention mechanisms to capture the relevant information both at frame level (spatial information) and at video level (temporal information given by the importance of a specific frame within the sequence). One of the novelties of our approach is given by a joint concurrent processing of both frame and video levels, providing in such a way a very simple architecture. Despite this fact, our approach achieves better performance than the state-of-the-art on the challenging iLIDS-VID dataset. △ Less

Submitted 26 September, 2019; originally announced October 2019.

Comments: 11 pages, 2 figures. Accepted by ICIAP2019, 20th International Conference on IMAGE ANALYSIS AND PROCESSING, Trento, Italy, 9-13 September, 2019

arXiv:1909.08487 [pdf, other]

doi 10.1109/ICCVW.2019.00282

Visual Tracking by means of Deep Reinforcement Learning and an Expert Demonstrator

Authors: Matteo Dunnhofer, Niki Martinel, Gian Luca Foresti, Christian Micheloni

Abstract: In the last decade many different algorithms have been proposed to track a generic object in videos. Their execution on recent large-scale video datasets can produce a great amount of various tracking behaviours. New trends in Reinforcement Learning showed that demonstrations of an expert agent can be efficiently used to speed-up the process of policy learning. Taking inspiration from such works a… ▽ More In the last decade many different algorithms have been proposed to track a generic object in videos. Their execution on recent large-scale video datasets can produce a great amount of various tracking behaviours. New trends in Reinforcement Learning showed that demonstrations of an expert agent can be efficiently used to speed-up the process of policy learning. Taking inspiration from such works and from the recent applications of Reinforcement Learning to visual tracking, we propose two novel trackers, A3CT, which exploits demonstrations of a state-of-the-art tracker to learn an effective tracking policy, and A3CTD, that takes advantage of the same expert tracker to correct its behaviour during tracking. Through an extensive experimental validation on the GOT-10k, OTB-100, LaSOT, UAV123 and VOT benchmarks, we show that the proposed trackers achieve state-of-the-art performance while running in real-time. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: in 2019 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) - VOT2019 Challenge Workshop

arXiv:1909.03748 [pdf, other]

doi 10.1145/3349801.3349823

Deep Super-Resolution Network for Single Image Super-Resolution with Realistic Degradations

Authors: Rao Muhammad Umer, Gian Luca Foresti, Christian Micheloni

Abstract: Single Image Super-Resolution (SISR) aims to generate a high-resolution (HR) image of a given low-resolution (LR) image. The most of existing convolutional neural network (CNN) based SISR methods usually take an assumption that a LR image is only bicubicly down-sampled version of an HR image. However, the true degradation (i.e. the LR image is a bicubicly downsampled, blurred and noisy version of… ▽ More Single Image Super-Resolution (SISR) aims to generate a high-resolution (HR) image of a given low-resolution (LR) image. The most of existing convolutional neural network (CNN) based SISR methods usually take an assumption that a LR image is only bicubicly down-sampled version of an HR image. However, the true degradation (i.e. the LR image is a bicubicly downsampled, blurred and noisy version of an HR image) of a LR image goes beyond the widely used bicubic assumption, which makes the SISR problem highly ill-posed nature of inverse problems. To address this issue, we propose a deep SISR network that works for blur kernels of different sizes, and different noise levels in an unified residual CNN-based denoiser network, which significantly improves a practical CNN-based super-resolver for real applications. Extensive experimental results on synthetic LR datasets and real images demonstrate that our proposed method not only can produce better results on more realistic degradation but also computational efficient to practical SISR applications. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: 7 pages

Journal ref: 13th International Conference on Distributed Smart Cameras (ICDSC 2019)

arXiv:1909.02755 [pdf]

Image anomaly detection with capsule networks and imbalanced datasets

Authors: Claudio Piciarelli, Pankaj Mishra, Gian Luca Foresti

Abstract: Image anomaly detection consists in finding images with anomalous, unusual patterns with respect to a set of normal data. Anomaly detection can be applied to several fields and has numerous practical applications, e.g. in industrial inspection, medical imaging, security enforcement, etc.. However, anomaly detection techniques often still rely on traditional approaches such as one-class Support Vec… ▽ More Image anomaly detection consists in finding images with anomalous, unusual patterns with respect to a set of normal data. Anomaly detection can be applied to several fields and has numerous practical applications, e.g. in industrial inspection, medical imaging, security enforcement, etc.. However, anomaly detection techniques often still rely on traditional approaches such as one-class Support Vector Machines, while the topic has not been fully developed yet in the context of modern deep learning approaches. In this paper, we propose an image anomaly detection system based on capsule networks under the assumption that anomalous data are available for training but their amount is scarce. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: Published in conference ICIAP 2019

Journal ref: [978-3-030-30641-0, ICIAP 2019, Part I, LNCS 11751, paper approval (489497_1_En, Chapter 23)]

arXiv:1907.09945 [pdf, other]

doi 10.1109/TAFFC.2020.3003816

Deep Temporal Analysis for Non-Acted Body Affect Recognition

Authors: Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Cristiano Massaroni

Abstract: Affective computing is a field of great interest in many computer vision applications, including video surveillance, behaviour analysis, and human-robot interaction. Most of the existing literature has addressed this field by analysing different sets of face features. However, in the last decade, several studies have shown how body movements can play a key role even in emotion recognition. The maj… ▽ More Affective computing is a field of great interest in many computer vision applications, including video surveillance, behaviour analysis, and human-robot interaction. Most of the existing literature has addressed this field by analysing different sets of face features. However, in the last decade, several studies have shown how body movements can play a key role even in emotion recognition. The majority of these experiments on the body are performed by trained actors whose aim is to simulate emotional reactions. These unnatural expressions differ from the more challenging genuine emotions, thus invalidating the obtained results. In this paper, a solution for basic non-acted emotion recognition based on 3D skeleton and Deep Neural Networks (DNNs) is provided. The proposed work introduces three majors contributions. First, unlike the current state-of-the-art in non-acted body affect recognition, where only static or global body features are considered, in this work also temporal local movements performed by subjects in each frame are examined. Second, an original set of global and time-dependent features for body movement description is provided. Third, to the best of out knowledge, this is the first attempt to use deep learning methods for non-acted body affect recognition. Due to the novelty of the topic, only the UCLIC dataset is currently considered the benchmark for comparative tests. On the latter, the proposed method outperforms all the competitors. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Journal ref: IEEE Transactions on Affective Computing 2020

arXiv:1812.01521 [pdf, other]

Localization and Tracking of an Acoustic Source using a Diagonal Unloading Beamforming and a Kalman Filter

Authors: Daniele Salvati, Carlo Drioli, Gian Luca Foresti

Abstract: We present the signal processing framework and some results for the IEEE AASP challenge on acoustic source localization and tracking (LOCATA). The system is designed for the direction of arrival (DOA) estimation in single-source scenarios. The proposed framework consists of four main building blocks: pre-processing, voice activity detection (VAD), localization, tracking. The signal pre-processing… ▽ More We present the signal processing framework and some results for the IEEE AASP challenge on acoustic source localization and tracking (LOCATA). The system is designed for the direction of arrival (DOA) estimation in single-source scenarios. The proposed framework consists of four main building blocks: pre-processing, voice activity detection (VAD), localization, tracking. The signal pre-processing pipeline includes the short-time Fourier transform (STFT) of the multichannel input captured by the array and the cross power spectral density (CPSD) matrices estimation. The VAD is calculated with a trace-based threshold of the CPSD matrices. The localization is then computed using our recently proposed diagonal unloading (DU) beamforming, which has low-complexity and high resolution. The DOA estimation is finally smoothed with a Kalman filer (KF). Experimental results on the LOCATA development dataset are reported in terms of the root mean square error (RMSE) for a 7-microphone linear array, the 12-microphone pseudo-spherical array integrated in a prototype head for a humanoid robot, and the 32-microphone spherical array. △ Less

Submitted 4 December, 2018; originally announced December 2018.

Comments: In Proceedings of the LOCATA Challenge Workshop - a satellite event of IWAENC 2018 (arXiv:1811.08482)

Report number: LOCATAchallenge/2018/07

arXiv:1803.10435 [pdf, other]

doi 10.1109/TMM.2018.2856094

Exploiting Recurrent Neural Networks and Leap Motion Controller for Sign Language and Semaphoric Gesture Recognition

Authors: Danilo Avola, Marco Bernardi, Luigi Cinque, Gian Luca Foresti, Cristiano Massaroni

Abstract: In human interactions, hands are a powerful way of expressing information that, in some cases, can be used as a valid substitute for voice, as it happens in Sign Language. Hand gesture recognition has always been an interesting topic in the areas of computer vision and multimedia. These gestures can be represented as sets of feature vectors that change over time. Recurrent Neural Networks (RNNs) a… ▽ More In human interactions, hands are a powerful way of expressing information that, in some cases, can be used as a valid substitute for voice, as it happens in Sign Language. Hand gesture recognition has always been an interesting topic in the areas of computer vision and multimedia. These gestures can be represented as sets of feature vectors that change over time. Recurrent Neural Networks (RNNs) are suited to analyse this type of sets thanks to their ability to model the long term contextual information of temporal sequences. In this paper, a RNN is trained by using as features the angles formed by the finger bones of human hands. The selected features, acquired by a Leap Motion Controller (LMC) sensor, have been chosen because the majority of human gestures produce joint movements that generate truly characteristic corners. A challenging subset composed by a large number of gestures defined by the American Sign Language (ASL) is used to test the proposed solution and the effectiveness of the selected angles. Moreover, the proposed method has been compared to other state of the art works on the SHREC dataset, thus demonstrating its superiority in hand gesture recognition accuracy. △ Less

Submitted 28 March, 2018; originally announced March 2018.

Journal ref: IEEE Transactions on Multimedia 21 (2019) 234-245

arXiv:1707.09173 [pdf, other]

Group Re-Identification via Unsupervised Transfer of Sparse Features Encoding

Authors: Giuseppe Lisanti, Niki Martinel, Alberto Del Bimbo, Gian Luca Foresti

Abstract: Person re-identification is best known as the problem of associating a single person that is observed from one or more disjoint cameras. The existing literature has mainly addressed such an issue, neglecting the fact that people usually move in groups, like in crowded scenarios. We believe that the additional information carried by neighboring individuals provides a relevant visual context that ca… ▽ More Person re-identification is best known as the problem of associating a single person that is observed from one or more disjoint cameras. The existing literature has mainly addressed such an issue, neglecting the fact that people usually move in groups, like in crowded scenarios. We believe that the additional information carried by neighboring individuals provides a relevant visual context that can be exploited to obtain a more robust match of single persons within the group. Despite this, re-identifying groups of people compound the common single person re-identification problems by introducing changes in the relative position of persons within the group and severe self-occlusions. In this paper, we propose a solution for group re-identification that grounds on transferring knowledge from single person re-identification to group re-identification by exploiting sparse dictionary learning. First, a dictionary of sparse atoms is learned using patches extracted from single person images. Then, the learned dictionary is exploited to obtain a sparsity-driven residual group representation, which is finally matched to perform the re-identification. Extensive experiments on the i-LIDS groups and two newly collected datasets show that the proposed solution outperforms state-of-the-art approaches. △ Less

Submitted 28 July, 2017; originally announced July 2017.

Comments: This paper has been accepted for publication at ICCV 2017

arXiv:1704.01426 [pdf, other]

doi 10.1109/TSMC.2018.2804766

The UMCD Dataset

Authors: Danilo Avola, Gian Luca Foresti, Niki Martinel, Daniele Pannone, Claudio Piciarelli

Abstract: In recent years, the technological improvements of low-cost small-scale Unmanned Aerial Vehicles (UAVs) are promoting an ever-increasing use of them in different tasks. In particular, the use of small-scale UAVs is useful in all these low-altitude tasks in which common UAVs cannot be adopted, such as recurrent comprehensive view of wide environments, frequent monitoring of military areas, real-tim… ▽ More In recent years, the technological improvements of low-cost small-scale Unmanned Aerial Vehicles (UAVs) are promoting an ever-increasing use of them in different tasks. In particular, the use of small-scale UAVs is useful in all these low-altitude tasks in which common UAVs cannot be adopted, such as recurrent comprehensive view of wide environments, frequent monitoring of military areas, real-time classification of static and moving entities (e.g., people, cars, etc.). These tasks can be supported by mosaicking and change detection algorithms achieved at low-altitude. Currently, public datasets for testing these algorithms are not available. This paper presents the UMCD dataset, the first collection of geo-referenced video sequences acquired at low-altitude for mosaicking and change detection purposes. Five reference scenarios are also reported. △ Less

Submitted 5 April, 2017; originally announced April 2017.

Comments: 3 pages, 5 figures

Journal ref: IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018

arXiv:1612.06543 [pdf, other]

Wide-Slice Residual Networks for Food Recognition

Authors: Niki Martinel, Gian Luca Foresti, Christian Micheloni

Abstract: Food diary applications represent a tantalizing market. Such applications, based on image food recognition, opened to new challenges for computer vision and pattern recognition algorithms. Recent works in the field are focusing either on hand-crafted representations or on learning these by exploiting deep neural networks. Despite the success of such a last family of works, these generally exploit… ▽ More Food diary applications represent a tantalizing market. Such applications, based on image food recognition, opened to new challenges for computer vision and pattern recognition algorithms. Recent works in the field are focusing either on hand-crafted representations or on learning these by exploiting deep neural networks. Despite the success of such a last family of works, these generally exploit off-the shelf deep architectures to classify food dishes. Thus, the architectures are not cast to the specific problem. We believe that better results can be obtained if the deep architecture is defined with respect to an analysis of the food composition. Following such an intuition, this work introduces a new deep scheme that is designed to handle the food structure. Specifically, inspired by the recent success of residual deep network, we exploit such a learning scheme and introduce a slice convolution block to capture the vertical food layers. Outputs of the deep residual blocks are combined with the sliced convolution to produce the classification score for specific food categories. To evaluate our proposed architecture we have conducted experimental results on three benchmark datasets. Results demonstrate that our solution shows better performance with respect to existing approaches (e.g., a top-1 accuracy of 90.27% on the Food-101 challenging dataset). △ Less

Submitted 20 December, 2016; originally announced December 2016.

arXiv:1605.00810 [pdf, other]

doi 10.1109/TASLP.2017.2789321

Diagonal Unloading Beamforming for Source Localization

Authors: Daniele Salvati, Carlo Drioli, Gian Luca Foresti

Abstract: In sensor array beamforming methods, a class of algorithms commonly used to estimate the position of a radiating source, the diagonal loading of the beamformer covariance matrix is generally used to improve computational accuracy and localization robustness. This paper proposes a diagonal unloading (DU) method which extends the conventional response power beamforming method by imposing an addition… ▽ More In sensor array beamforming methods, a class of algorithms commonly used to estimate the position of a radiating source, the diagonal loading of the beamformer covariance matrix is generally used to improve computational accuracy and localization robustness. This paper proposes a diagonal unloading (DU) method which extends the conventional response power beamforming method by imposing an additional constraint to the covariance matrix of the array output vector. The regularization is obtained by subtracting a given amount of white noise from the main diagonal of the covariance matrix. Specifically, the DU beamformer aims at subtracting the signal subspace from the noisy signal space and it is computed by constraining the regularized covariance matrix to be negative definite. It is hence a data-dependent covariance matrix conditioning method. We show how to calculate precisely the unloading parameter, and we present an eigenvalue analysis for comparing the proposed DU beamforming, the minimum variance distortionless response (MVDR) filter and the multiple signal classification (MUSIC) method. Theoretical analysis and experiments with acoustic sources demonstrate that the DU beamformer localization performance is comparable to that of MVDR and MUSIC. Since the DU beamformer computational cost is comparable to that of a conventional beamformer, the proposed method can be attractive in array processing due to its simplicity, effectiveness and computational efficiency. △ Less

Submitted 3 May, 2016; originally announced May 2016.

Journal ref: IEEE/ACM Transactions on Audio, Speech and Language Processing, Volume 25, Issue 3, Pages 609-622 (2018)

arXiv:1512.03261 [pdf, other]

doi 10.1121/1.4974289

Exploiting a Geometrically Sampled Grid in the SRP-PHAT for Localization Improvement and Power Response Sensitivity Analysis

Authors: Daniele Salvati, Carlo Drioli, Gian Luca Foresti

Abstract: The steered response power phase transform (SRP-PHAT) is a beamformer method very attractive in acoustic localization applications due to its robustness in reverberant environments. This paper presents a spatial grid design procedure, called the geometrically sampled grid (GSG), which aims at computing the spatial grid by taking into account the discrete sampling of time difference of arrival (TDO… ▽ More The steered response power phase transform (SRP-PHAT) is a beamformer method very attractive in acoustic localization applications due to its robustness in reverberant environments. This paper presents a spatial grid design procedure, called the geometrically sampled grid (GSG), which aims at computing the spatial grid by taking into account the discrete sampling of time difference of arrival (TDOA) functions and the desired spatial resolution. A new SRP-PHAT localization algorithm based on the GSG method is also introduced. The proposed method exploits the intersections of the discrete hyperboloids representing the TDOA information domain of the sensor array, and projects the whole TDOA information on the space search grid. The GSG method thus allows to design the sampled spatial grid which represents the best search grid for a given sensor array, it allows to perform a sensitivity analysis of the array and to characterize its spatial localization accuracy, and it may assist the system designer in the reconfiguration of the array. Experimental results using both simulated data and real recordings show that the localization accuracy is substantially improved both for high and for low spatial resolution, and that it is closely related to the proposed power response sensitivity measure. △ Less

Submitted 7 March, 2018; v1 submitted 10 December, 2015; originally announced December 2015.

Journal ref: Journal of the Acoustical Society of America, Volume 141, Issue 1, Pages 586-601 (2017)

Showing 1–28 of 28 results for author: Foresti, G L