Skip to main content

Showing 1–49 of 49 results for author: Baraldi, L

  1. arXiv:2405.13127  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Towards Retrieval-Augmented Architectures for Image Captioning

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

    Abstract: The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This wor… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  2. arXiv:2404.15406  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

    Authors: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at int… ▽ More

    Submitted 22 May, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models

  3. arXiv:2404.10054  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    AIGeN: An Adversarial Approach for Instruction Generation in VLN

    Authors: Niyati Rawal, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara

    Abstract: In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation perform… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: Accepted to 7th Multimodal Learning and Applications Workshop (MULA 2024) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  4. arXiv:2404.06542  [pdf, other

    cs.CV

    Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

    Authors: Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets in… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: CVPR 2024. Project page: https://aimagelab.github.io/freeda/

  5. arXiv:2403.07076  [pdf, other

    cs.RO cs.AI cs.CV

    Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

    Authors: Roberto Bigazzi, Lorenzo Baraldi, Shreyas Kousik, Rita Cucchiara, Marco Pavone

    Abstract: Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and mapping large semantic regions. The present work proposes a method for semantic region map… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2024)

  6. arXiv:2402.12451  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    The Revolution of Multimodal Large Language Models: A Survey

    Authors: Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-foll… ▽ More

    Submitted 6 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (Findings)

  7. arXiv:2311.16254  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

    Authors: Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of… ▽ More

    Submitted 12 April, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

  8. arXiv:2308.12383  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

    Authors: Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information whi… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  9. arXiv:2307.09416  [pdf, other

    cs.CV cs.CL

    Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

    Authors: Federico Betti, Jacopo Staiano, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe

    Abstract: Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content an… ▽ More

    Submitted 19 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

    Comments: Accepted as oral at ACM MultiMedia 2023 (Brave New Ideas track)

  10. arXiv:2306.07346  [pdf, other

    cs.CV cs.AI cs.MM

    Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

    Authors: Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi, Andrea Pilzer, Rita Cucchiara

    Abstract: The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the i… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

  11. arXiv:2304.02049  [pdf, other

    cs.CV cs.AI cs.LG

    Multi-Class Unlearning for Image Classification via Weight Filtering

    Authors: Samuele Poppi, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any clas… ▽ More

    Submitted 8 June, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: IEEE Intelligent Systems (2024)

  12. arXiv:2304.01842  [pdf, other

    cs.CV

    Evaluating Synthetic Pre-Training for Handwriting Processing Tasks

    Authors: Vittorio Pippi, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: In this work, we explore massive pre-training on synthetic word images for enhancing the performance on four benchmark downstream handwriting analysis tasks. To this end, we build a large synthetic dataset of word images rendered in several handwriting fonts, which offers a complete supervision signal. We use it to train a simple convolutional neural network (ConvNet) with a fully supervised objec… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

  13. arXiv:2304.00500  [pdf, other

    cs.CV cs.AI cs.MM

    Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images

    Authors: Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara

    Abstract: Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by s… ▽ More

    Submitted 21 May, 2024; v1 submitted 2 April, 2023; originally announced April 2023.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications (2024)

  14. arXiv:2303.12112  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

    Authors: Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contr… ▽ More

    Submitted 20 July, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

    Comments: CVPR 2023 (highlight paper)

  15. arXiv:2301.07150  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    Embodied Agents for Efficient Exploration and Smart Scene Description

    Authors: Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment w… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

    Comments: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2023)

  16. arXiv:2208.08109  [pdf, other

    cs.CV

    Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions

    Authors: Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typi… ▽ More

    Submitted 17 August, 2022; originally announced August 2022.

    Journal ref: International Journal on Document Analysis and Recognition (IJDAR), 2022, 1-11

  17. arXiv:2208.07682  [pdf, other

    cs.CV cs.DL

    The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

    Authors: Silvia Cascianelli, Vittorio Pippi, Martin Maarand, Marcella Cornia, Lorenzo Baraldi, Christopher Kermorvant, Rita Cucchiara

    Abstract: Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With… ▽ More

    Submitted 16 August, 2022; originally announced August 2022.

    Comments: Accepted at ICPR 2022

  18. arXiv:2207.14757  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

    Authors: Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, Rita Cucchiara

    Abstract: Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-ver… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

    Comments: CBMI 2022

  19. arXiv:2207.13162  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Retrieval-Augmented Transformer for Image Captioning

    Authors: Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, wi… ▽ More

    Submitted 22 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: CBMI 2022

  20. Embodied Navigation at the Art Gallery

    Authors: Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like offices, flats, and houses. In this paper, we build and release a new 3D space with unique characteristics: the one of a complete art museum. We name this environ… ▽ More

    Submitted 19 April, 2022; originally announced April 2022.

    Comments: Accepted by 21st International Conference on Image Analysis and Processing (ICIAP 2021)

  21. Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

    Authors: Federico Landi, Roberto Bigazzi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the… ▽ More

    Submitted 18 April, 2022; originally announced April 2022.

    Comments: Accepted by 26TH International Conference on Pattern Recognition (ICPR 2022)

  22. arXiv:2202.10492  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    CaMEL: Mean Teacher Learning for Image Captioning

    Authors: Manuele Barraco, Matteo Stefanini, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay betw… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

  23. arXiv:2111.12727  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

    Authors: Marcella Cornia, Lorenzo Baraldi, Giuseppe Fiameni, Rita Cucchiara

    Abstract: This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To… ▽ More

    Submitted 30 November, 2023; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted to IJCV

  24. arXiv:2109.08521  [pdf, other

    cs.RO cs.AI cs.CV

    Focus on Impact: Indoor Exploration with Intrinsic Motivation

    Authors: Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Exploration of indoor environments has recently experienced a significant interest, also thanks to the introduction of deep neural agents built in a hierarchical fashion and trained with Deep Reinforcement Learning (DRL) on simulated environments. Current state-of-the-art methods employ a dense extrinsic reward that requires the complete a priori knowledge of the layout of the training environment… ▽ More

    Submitted 4 February, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

    Comments: Published in IEEE Robotics and Automation Letters. To appear in ICRA 2022

    Journal ref: IEEE Robotics and Automation Letters (Volume: 7, Issue: 2, April 2022)

  25. arXiv:2109.00020  [pdf, other

    cs.LG cs.CL cs.CV cs.NE

    Working Memory Connections for LSTM

    Authors: Federico Landi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

    Abstract: Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when learning long-term dependencies. For this reason, LSTMs and other gated RNNs are widely adopted, being the standard de facto for many sequence modeling tasks. Although the memory cell inside the LSTM contains essential information, it is not allowed to influ… ▽ More

    Submitted 31 August, 2021; originally announced September 2021.

    Comments: Accepted for publication in Neural Networks

  26. arXiv:2107.06912  [pdf, other

    cs.CV cs.CL

    From Show to Tell: A Survey on Deep Learning-based Image Captioning

    Authors: Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, Rita Cucchiara

    Abstract: Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these y… ▽ More

    Submitted 30 November, 2021; v1 submitted 14 July, 2021; originally announced July 2021.

  27. Learning to Select: A Fully Attentive Approach for Novel Object Captioning

    Authors: Marco Cagrandi, Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Image captioning models have lately shown impressive results when applied to standard datasets. Switching to real-life scenarios, however, constitutes a challenge due to the larger variety of visual concepts which are not covered in existing training sets. For this reason, novel object captioning (NOC) has recently emerged as a paradigm to test captioning models on objects which are unseen during… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: ICMR 2021

  28. Out of the Box: Embodied Navigation in the Real World

    Authors: Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The research field of Embodied AI has witnessed substantial progress in visual navigation and exploration thanks to powerful simulating platforms and the availability of 3D data of indoor and photorealistic environments. These two factors have opened the doors to a new generation of intelligent agents capable of achieving nearly perfect PointGoal Navigation. However, such architectures are commonl… ▽ More

    Submitted 12 May, 2021; originally announced May 2021.

  29. arXiv:2104.10252  [pdf, other

    cs.CV

    Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis

    Authors: Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: As the request for deep learning solutions increases, the need for explainability is even more fundamental. In this setting, particular attention has been given to visualization techniques, that try to attribute the right relevance to each input pixel with respect to the output of the network. In this paper, we focus on Class Activation Mapping (CAM) approaches, which provide an effective visualiz… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

    Comments: CVPR 2021 Workshop on Responsible Computer Vision

  30. arXiv:2102.07624  [pdf, other

    cs.CV

    RMS-Net: Regression and Masking for Soccer Event Spotting

    Authors: Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara

    Abstract: The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs. This task fits particularly well for soccer videos, where events correspond to salient actions strictly defined by soccer rules (a goal occurs when the ball crosses the goal line). In this paper, we devise a lightweight and modular network for action spotting, which can simultaneously predi… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

  31. arXiv:2007.10243  [pdf, other

    cs.CV

    Inter-Homines: Distance-Based Risk Estimation for Human Safety

    Authors: Matteo Fabbri, Fabio Lanzi, Riccardo Gasparini, Simone Calderara, Lorenzo Baraldi, Rita Cucchiara

    Abstract: In this document, we report our proposal for modeling the risk of possible contagiousity in a given area monitored by RGB cameras where people freely move and interact. Our system, called Inter-Homines, evaluates in real-time the contagion risk in a monitored area by analyzing video streams: it is able to locate people in 3D space, calculate interpersonal distances and predict risk levels by build… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

  32. arXiv:2007.07268  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    Explore and Explain: Self-supervised Navigation and Recounting

    Authors: Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moment… ▽ More

    Submitted 14 July, 2020; originally announced July 2020.

    Comments: ICPR 2020

  33. arXiv:2004.13073  [pdf, other

    cs.CV cs.CL cs.LG

    A Novel Attention-based Aggregation Function to Combine Vision and Language

    Authors: Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements -- like regions and words -- proper reduction functions a… ▽ More

    Submitted 13 July, 2020; v1 submitted 27 April, 2020; originally announced April 2020.

    Comments: ICPR 2020

  34. arXiv:1912.08226  [pdf, other

    cs.CV cs.CL

    Meshed-Memory Transformer for Image Captioning

    Authors: Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M$^2$ - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image… ▽ More

    Submitted 20 March, 2020; v1 submitted 17 December, 2019; originally announced December 2019.

    Comments: CVPR 2020

  35. Video action detection by learning graph-based spatio-temporal interactions

    Authors: Matteo Tomei, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, Rita Cucchiara

    Abstract: Action Detection is a complex task that aims to detect and classify human actions in video clips. Typically, it has been addressed by processing fine-grained features extracted from a video classification backbone. Recently, thanks to the robustness of object and people detectors, a deeper focus has been added on relationship modelling. Following this line, we propose a graph-based framework to le… ▽ More

    Submitted 1 March, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: This is the authors version of an article accepted for publication in Computer Vision and Image Understanding (CVIU), available online February 2021

    Journal ref: Computer Vision and Image Understanding (CVIU), 2021

  36. arXiv:1911.12377  [pdf, other

    cs.CV cs.CL cs.LG

    Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

    Authors: Federico Landi, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, Rita Cucchiara

    Abstract: Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. The goal gets even harder as the actions available to the agent get simpler and move towards low-level, atomic interactions with the environment. This setting takes the name of low-level VLN. In this paper, we strive for the creation of an agent able… ▽ More

    Submitted 30 July, 2021; v1 submitted 27 November, 2019; originally announced November 2019.

    Comments: Computer Vision and Image Understanding (CVIU)

  37. arXiv:1910.02974  [pdf, other

    cs.CV cs.CL cs.RO

    SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

    Authors: Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans. While the research efforts in image and video captioning are giving promising results, this is often done at the expense of the computational requirements of the approaches, limiting their applicability to r… ▽ More

    Submitted 9 March, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

    Comments: ICRA 2020

  38. arXiv:1907.02985  [pdf, other

    cs.CV

    Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

    Authors: Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara

    Abstract: In Vision-and-Language Navigation (VLN), an embodied agent needs to reach a target destination with the only guidance of a natural language instruction. To explore the environment and progress towards the target location, the agent must perform a series of low-level actions, such as rotate, before stepping ahead. In this paper, we propose to exploit dynamic convolutional filters to encode the visu… ▽ More

    Submitted 25 September, 2019; v1 submitted 5 July, 2019; originally announced July 2019.

    Comments: BMVC 2019 (Oral). Code is available at https://github.com/aimagelab/DynamicConv-agent

  39. arXiv:1903.01930  [pdf, other

    cs.LG cs.AI cs.DC stat.ML

    A Deep Learning based approach to VM behavior identification in cloud systems

    Authors: Matteo Stefanini, Riccardo Lancellotti, Lorenzo Baraldi, Simone Calderara

    Abstract: Cloud computing data centers are growing in size and complexity to the point where monitoring and management of the infrastructure become a challenge due to scalability issues. A possible approach to cope with the size of such data centers is to identify VMs exhibiting a similar behavior. Existing literature demonstrated that clustering together VMs that show a similar behavior may improve the sca… ▽ More

    Submitted 5 March, 2019; originally announced March 2019.

    Comments: Accepted at CLOSER2019

  40. M-VAD Names: a Dataset for Video Captioning with Naming

    Authors: Stefano Pini, Marcella Cornia, Federico Bolelli, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version… ▽ More

    Submitted 4 March, 2019; originally announced March 2019.

    Comments: Source Code: https://github.com/aimagelab/mvad-names-dataset - Video Demo: https://youtu.be/dOvtAXbOOH4

    Journal ref: Multimedia Tools and Applications (2018)

  41. arXiv:1811.10666  [pdf, other

    cs.CV

    Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation

    Authors: Matteo Tomei, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose… ▽ More

    Submitted 17 May, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

    Comments: CVPR 2019

  42. arXiv:1811.10652  [pdf, other

    cs.CV cs.CL

    Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

    Authors: Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

    Abstract: Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image… ▽ More

    Submitted 9 May, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

    Comments: CVPR 2019

  43. arXiv:1706.08474  [pdf, other

    cs.CV

    Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

    Authors: Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara

    Abstract: Image captioning has been recently gaining a lot of attention thanks to the impressive achievements shown by deep captioning architectures, which combine Convolutional Neural Networks to extract image representations, and Recurrent Neural Networks to generate the corresponding captions. At the same time, a significant research effort has been dedicated to the development of saliency prediction mod… ▽ More

    Submitted 21 May, 2018; v1 submitted 26 June, 2017; originally announced June 2017.

    Comments: ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 14, No. 2, Article 48

  44. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model

    Authors: Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara

    Abstract: Data-driven saliency has recently gained a lot of attention thanks to the use of Convolutional Neural Networks for predicting gaze fixations. In this paper we go beyond standard approaches to saliency prediction, in which gaze maps are computed with a feed-forward network, and present a novel model which can predict accurate saliency maps by incorporating neural attentive mechanisms. The core of o… ▽ More

    Submitted 9 July, 2018; v1 submitted 29 November, 2016; originally announced November 2016.

    Comments: IEEE Transactions on Image Processing 2018

  45. Hierarchical Boundary-Aware Neural Encoder for Video Captioning

    Authors: Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

    Abstract: The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is e… ▽ More

    Submitted 10 April, 2017; v1 submitted 28 November, 2016; originally announced November 2016.

    Comments: CVPR 2017

  46. arXiv:1610.01376  [pdf, other

    cs.CV

    Recognizing and Presenting the Storytelling Video Structure with Deep Multimodal Networks

    Authors: Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

    Abstract: This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure. The objective is to decompose a long video into more manageable sequences, which can in turn be used to retrieve the most significant parts of it given a textual query and to provide an effective summarization. Previous video d… ▽ More

    Submitted 10 November, 2016; v1 submitted 5 October, 2016; originally announced October 2016.

  47. arXiv:1609.01064  [pdf, other

    cs.CV

    A Deep Multi-Level Network for Saliency Prediction

    Authors: Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara

    Abstract: This paper presents a novel deep architecture for saliency prediction. Current state of the art models for saliency prediction employ Fully Convolutional networks that perform a non-linear combination of features extracted from the last convolutional layer to predict saliency maps. We propose an architecture which, instead, combines features extracted at different levels of a Convolutional Neural… ▽ More

    Submitted 18 July, 2017; v1 submitted 5 September, 2016; originally announced September 2016.

    Comments: International Conference on Pattern Recognition (ICPR), 2016

  48. arXiv:1604.02546  [pdf, other

    cs.CV cs.IR cs.MM

    Scene-driven Retrieval in Edited Videos using Aesthetic and Semantic Deep Features

    Authors: Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

    Abstract: This paper presents a novel retrieval pipeline for video collections, which aims to retrieve the most significant parts of an edited video for a given query, and represent them with thumbnails which are at the same time semantically meaningful and aesthetically remarkable. Videos are first segmented into coherent and story-telling scenes, then a retrieval algorithm based on deep learning is propos… ▽ More

    Submitted 9 April, 2016; originally announced April 2016.

    Comments: ICMR 2016

  49. A Deep Siamese Network for Scene Detection in Broadcast Videos

    Authors: Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

    Abstract: We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots. Experiments are performed to demonstrate the effectiveness of our approach by comparing our algorithm against recent proposals for automatic scene segmentation. We also propose an improved performance measure that aims to reduce the gap between numerical evaluation and… ▽ More

    Submitted 29 October, 2015; originally announced October 2015.

    Comments: ACM Multimedia 2015