subscribe to arXiv mailings

arXiv:2406.19859 [pdf, other]

MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis

Authors: Jun-Yan He, Zhi-Qi Cheng, Chenyang Li, Jingdong Sun, Qi He, Wangmeng Xiang, Hanyuan Chen, Jin-Peng Lan, Xianhui Lin, Kang Zhu, Bin Luo, Yifeng Geng, Xuansong Xie, Alexander G. Hauptmann

Abstract: MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition… ▽ More MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results. △ Less

Submitted 4 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

Comments: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt

arXiv:2406.19236 [pdf, other]

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Authors: Minghan Li, Heng Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann

Abstract: Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activitie… ▽ More Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments. △ Less

Submitted 4 July, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

Comments: 30 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/

arXiv:2406.11161 [pdf, other]

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Authors: Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Jingdong Sun, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing su… ▽ More Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 37 pages, 12 figures, Project: https://github.com/ZebangCheng/Emotion-LLaMA, Demo: https://huggingface.co/spaces/ZebangCheng/Emotion-LLaMA

arXiv:2405.16213 [pdf, other]

Learning Visual-Semantic Subspace Representations for Propositional Reasoning

Authors: Gabriel Moreira, Alexander Hauptmann, Manuel Marques, João Paulo Costeira

Abstract: Learning representations that capture rich semantic relationships and accommodate propositional calculus poses a significant challenge. Existing approaches are either contrastive, lacking theoretical guarantees, or fall short in effectively representing the partial orders inherent to rich visual-semantic hierarchies. In this paper, we propose a novel approach for learning visual representations th… ▽ More Learning representations that capture rich semantic relationships and accommodate propositional calculus poses a significant challenge. Existing approaches are either contrastive, lacking theoretical guarantees, or fall short in effectively representing the partial orders inherent to rich visual-semantic hierarchies. In this paper, we propose a novel approach for learning visual representations that not only conform to a specified semantic structure but also facilitate probabilistic propositional reasoning. Our approach is based on a new nuclear norm-based loss. We show that its minimum encodes the spectral geometry of the semantics in a subspace lattice, where logical propositions can be represented by projection operators. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.10952 [pdf, other]

VICAN: Very Efficient Calibration Algorithm for Large Camera Networks

Authors: Gabriel Moreira, Manuel Marques, João Paulo Costeira, Alexander Hauptmann

Abstract: The precise estimation of camera poses within large camera networks is a foundational problem in computer vision and robotics, with broad applications spanning autonomous navigation, surveillance, and augmented reality. In this paper, we introduce a novel methodology that extends state-of-the-art Pose Graph Optimization (PGO) techniques. Departing from the conventional PGO paradigm, which primaril… ▽ More The precise estimation of camera poses within large camera networks is a foundational problem in computer vision and robotics, with broad applications spanning autonomous navigation, surveillance, and augmented reality. In this paper, we introduce a novel methodology that extends state-of-the-art Pose Graph Optimization (PGO) techniques. Departing from the conventional PGO paradigm, which primarily relies on camera-camera edges, our approach centers on the introduction of a dynamic element - any rigid object free to move in the scene - whose pose can be reliably inferred from a single image. Specifically, we consider the bipartite graph encompassing cameras, object poses evolving dynamically, and camera-object relative transformations at each time step. This shift not only offers a solution to the challenges encountered in directly estimating relative poses between cameras, particularly in adverse environments, but also leverages the inclusion of numerous object poses to ameliorate and integrate errors, resulting in accurate camera pose estimates. Though our framework retains compatibility with traditional PGO solvers, its efficacy benefits from a custom-tailored optimization scheme. To this end, we introduce an iterative primal-dual algorithm, capable of handling large graphs. Empirical benchmarks, conducted on a new dataset of simulated indoor environments, substantiate the efficacy and efficiency of our approach. △ Less

Submitted 25 March, 2024; originally announced May 2024.

Comments: To appear at the IEEE International Conference on Robotics and Automation (ICRA), 2024

arXiv:2404.18398 [pdf, other]

MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis

Authors: Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann

Abstract: Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-… ▽ More Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction. However, current E-TTS approaches often struggle to capture the complexity of human emotions, primarily relying on oversimplified emotional labels or single-modality inputs. To address these limitations, we propose the Multimodal Emotional Text-to-Speech System (MM-TTS), a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. MM-TTS consists of two key components: (1) the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information; and (2) the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions. Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech. Our code and pre-trained models are publicly available at https://anonymous.4open.science/r/MMTTS-D214 △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.01258 [pdf, other]

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Authors: Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, Yiming Yang

Abstract: Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large… ▽ More Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks. △ Less

Submitted 2 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.16242 [pdf, other]

Adversarially Masked Video Consistency for Unsupervised Domain Adaptation

Authors: Xiaoyu Zhu, Junwei Liang, Po-Yao Huang, Alex Hauptmann

Abstract: We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator… ▽ More We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2311.12528 [pdf, other]

Inverse Problems with Learned Forward Operators

Authors: Simon Arridge, Andreas Hauptmann, Yury Korolev

Abstract: Solving inverse problems requires the knowledge of the forward operator, but accurate models can be computationally expensive and hence cheaper variants that do not compromise the reconstruction quality are desired. This chapter reviews reconstruction methods in inverse problems with learned forward operators that follow two different paradigms. The first one is completely agnostic to the forward… ▽ More Solving inverse problems requires the knowledge of the forward operator, but accurate models can be computationally expensive and hence cheaper variants that do not compromise the reconstruction quality are desired. This chapter reviews reconstruction methods in inverse problems with learned forward operators that follow two different paradigms. The first one is completely agnostic to the forward operator and learns its restriction to the subspace spanned by the training data. The framework of regularisation by projection is then used to find a reconstruction. The second one uses a simplified model of the physics of the measurement process and only relies on the training data to learn a model correction. We present the theory of these two approaches and compare them numerically. A common theme emerges: both methods require, or at least benefit from, training data not only for the forward operator, but also for its adjoint. △ Less

Submitted 18 March, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

MSC Class: 65J22; 47A52; 35R30; 74J25

arXiv:2311.01723 [pdf, other]

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Authors: Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, Kyungwoo Song

Abstract: Improving out-of-distribution (OOD) generalization through in-distribution (ID) adaptation is a primary goal of robust fine-tuning methods beyond the naive fine-tuning approach. However, despite decent OOD generalization performance from recent robust fine-tuning methods, OOD confidence calibration for reliable machine learning has not been fully addressed. This work proposes a robust fine-tuning… ▽ More Improving out-of-distribution (OOD) generalization through in-distribution (ID) adaptation is a primary goal of robust fine-tuning methods beyond the naive fine-tuning approach. However, despite decent OOD generalization performance from recent robust fine-tuning methods, OOD confidence calibration for reliable machine learning has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and calibration error in Vision Language Models (VLMs). Firstly, we show that both types of errors have a shared upper bound consisting of two terms of ID data: 1) calibration error and 2) the smallest singular value of the input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further aided by the self-distillation of a moving averaged model to achieve well-calibrated prediction. Starting from an empirical validation of our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our method. △ Less

Submitted 27 May, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: Presented at the NeurIPS 2023 Workshop on Distribution Shifts (DistShift)

arXiv:2310.18636 [pdf, other]

Electrical Impedance Tomography: A Fair Comparative Study on Deep Learning and Analytic-based Approaches

Authors: Derick Nganyu Tanyu, Jianfeng Ning, Andreas Hauptmann, Bangti Jin, Peter Maass

Abstract: Electrical Impedance Tomography (EIT) is a powerful imaging technique with diverse applications, e.g., medical diagnosis, industrial monitoring, and environmental studies. The EIT inverse problem is about inferring the internal conductivity distribution of an object from measurements taken on its boundary. It is severely ill-posed, necessitating advanced computational methods for accurate image re… ▽ More Electrical Impedance Tomography (EIT) is a powerful imaging technique with diverse applications, e.g., medical diagnosis, industrial monitoring, and environmental studies. The EIT inverse problem is about inferring the internal conductivity distribution of an object from measurements taken on its boundary. It is severely ill-posed, necessitating advanced computational methods for accurate image reconstructions. Recent years have witnessed significant progress, driven by innovations in analytic-based approaches and deep learning. This review explores techniques for solving the EIT inverse problem, focusing on the interplay between contemporary deep learning-based strategies and classical analytic-based methods. Four state-of-the-art deep learning algorithms are rigorously examined, harnessing the representational capabilities of deep neural networks to reconstruct intricate conductivity distributions. In parallel, two analytic-based methods, rooted in mathematical formulations and regularisation techniques, are dissected for their strengths and limitations. These methodologies are evaluated through various numerical experiments, encompassing diverse scenarios that reflect real-world complexities. A suite of performance metrics is employed to assess the efficacy of these methods. These metrics collectively provide a nuanced understanding of the methods' ability to capture essential features and delineate complex conductivity patterns. One novel feature of the study is the incorporation of variable conductivity scenarios, introducing a level of heterogeneity that mimics textured inclusions. This departure from uniform conductivity assumptions mimics realistic scenarios where tissues or materials exhibit spatially varying electrical properties. Exploring how each method responds to such variable conductivity scenarios opens avenues for understanding their robustness and adaptability. △ Less

Submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.05737 [pdf, other]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer… ▽ More While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks. △ Less

Submitted 29 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

Comments: ICLR 2024

arXiv:2309.10013 [pdf, other]

Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin

Authors: Gabriel Moreira, Manuel Marques, João Paulo Costeira, Alexander Hauptmann

Abstract: Recent research in representation learning has shown that hierarchical data lends itself to low-dimensional and highly informative representations in hyperbolic space. However, even if hyperbolic embeddings have gathered attention in image recognition, their optimization is prone to numerical hurdles. Further, it remains unclear which applications stand to benefit the most from the implicit bias i… ▽ More Recent research in representation learning has shown that hierarchical data lends itself to low-dimensional and highly informative representations in hyperbolic space. However, even if hyperbolic embeddings have gathered attention in image recognition, their optimization is prone to numerical hurdles. Further, it remains unclear which applications stand to benefit the most from the implicit bias imposed by hyperbolicity, when compared to traditional Euclidean features. In this paper, we focus on prototypical hyperbolic neural networks. In particular, the tendency of hyperbolic embeddings to converge to the boundary of the Poincaré ball in high dimensions and the effect this has on few-shot classification. We show that the best few-shot results are attained for hyperbolic embeddings at a common hyperbolic radius. In contrast to prior benchmark results, we demonstrate that better performance can be achieved by a fixed-radius encoder equipped with the Euclidean metric, regardless of the embedding dimension. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted for WACV 2024

arXiv:2307.09441 [pdf, other]

Convergent regularization in inverse problems and linear plug-and-play denoisers

Authors: Andreas Hauptmann, Subhadip Mukherjee, Carola-Bibiane Schönlieb, Ferdia Sherry

Abstract: Plug-and-play (PnP) denoising is a popular iterative framework for solving imaging inverse problems using off-the-shelf image denoisers. Their empirical success has motivated a line of research that seeks to understand the convergence of PnP iterates under various assumptions on the denoiser. While a significant amount of research has gone into establishing the convergence of the PnP iteration for… ▽ More Plug-and-play (PnP) denoising is a popular iterative framework for solving imaging inverse problems using off-the-shelf image denoisers. Their empirical success has motivated a line of research that seeks to understand the convergence of PnP iterates under various assumptions on the denoiser. While a significant amount of research has gone into establishing the convergence of the PnP iteration for different regularity conditions on the denoisers, not much is known about the asymptotic properties of the converged solution as the noise level in the measurement tends to zero, i.e., whether PnP methods are provably convergent regularization schemes under reasonable assumptions on the denoiser. This paper serves two purposes: first, we provide an overview of the classical regularization theory in inverse problems and survey a few notable recent data-driven methods that are provably convergent regularization schemes. We then continue to discuss PnP algorithms and their established convergence guarantees. Subsequently, we consider PnP algorithms with linear denoisers and propose a novel spectral filtering technique to control the strength of regularization arising from the denoiser. Further, by relating the implicit regularization of the denoiser to an explicit regularization functional, we rigorously show that PnP with linear denoisers leads to a convergent regularization scheme. More specifically, we prove that in the limit as the noise vanishes, the PnP reconstruction converges to the minimizer of a regularization potential subject to the solution satisfying the noiseless operator equation. The theoretical analysis is corroborated by numerical experiments for the classical inverse problem of tomographic image reconstruction. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2306.17842 [pdf, other]

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Authors: Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang

Abstract: In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details n… ▽ More In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%. △ Less

Submitted 28 October, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023 spotlight

arXiv:2306.08937 [pdf, other]

DocumentNet: Bridging the Data Gap in Document Pre-Training

Authors: Lijun Yu, Jin Miao, Xiaoyu Sun, Jiayi Chen, Alexander G. Hauptmann, Hanjun Dai, Wei Wei

Abstract: Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datase… ▽ More Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets, making it universally applicable to all VDER tasks. The current DocumentNet consists of 30M documents spanning nearly 400 document types organized in a four-level ontology. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training for both classic and few-shot learning settings. With the recent emergence of large language models (LLMs), DocumentNet provides a large data source to extend their multi-modal capabilities for VDER. △ Less

Submitted 26 October, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: EMNLP 2023

arXiv:2305.05020 [pdf, other]

Domain independent post-processing with graph U-nets: Applications to Electrical Impedance Tomographic Imaging

Authors: William Herzberg, Andreas Hauptmann, Sarah J. Hamilton

Abstract: Reconstruction of tomographic images from boundary measurements requires flexibility with respect to target domains. For instance, when the system equations are modeled by partial differential equations the reconstruction is usually done on finite element (FE) meshes, allowing for flexible geometries. Thus, any processing of the obtained reconstructions should be ideally done on the FE mesh as wel… ▽ More Reconstruction of tomographic images from boundary measurements requires flexibility with respect to target domains. For instance, when the system equations are modeled by partial differential equations the reconstruction is usually done on finite element (FE) meshes, allowing for flexible geometries. Thus, any processing of the obtained reconstructions should be ideally done on the FE mesh as well. For this purpose, we extend the hugely successful U-Net architecture that is limited to rectangular pixel or voxel domains to an equivalent that works flexibly on FE meshes. To achieve this, the FE mesh is converted into a graph and we formulate a graph U-Net with a new cluster pooling and unpooling on the graph that mimics the classic neighborhood based max-pooling. We demonstrate effectiveness and flexibility of the graph U-Net for improving reconstructions from electrical impedance tomographic (EIT) measurements, a nonlinear and highly ill-posed inverse problem. The performance is evaluated for simulated data and from three measurement devices with different measurement geometries and instrumentations. We successfully show that such networks can be trained with a simple two-dimensional simulated training set and generalize to very different domains, including measurements from a three-dimensional device and subsequent 3D reconstructions. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: 12 pages, 11 figures

arXiv:2304.02173 [pdf, other]

ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules

Authors: Zhi-Qi Cheng, Qi Dai, Siyao Li, Jingdong Sun, Teruko Mitamura, Alexander G. Hauptmann

Abstract: Charts are a powerful tool for visually conveying complex data, but their comprehension poses a challenge due to the diverse chart types and intricate components. Existing chart comprehension methods suffer from either heuristic rules or an over-reliance on OCR systems, resulting in suboptimal performance. To address these issues, we present ChartReader, a unified framework that seamlessly integra… ▽ More Charts are a powerful tool for visually conveying complex data, but their comprehension poses a challenge due to the diverse chart types and intricate components. Existing chart comprehension methods suffer from either heuristic rules or an over-reliance on OCR systems, resulting in suboptimal performance. To address these issues, we present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks. Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks. By learning the rules of charts automatically from annotated datasets, our approach eliminates the need for manual rule-making, reducing effort and enhancing accuracy.~We also introduce a data variable replacement technique and extend the input and position embeddings of the pre-trained model for cross-task training. We evaluate ChartReader on Chart-to-Table, ChartQA, and Chart-to-Text tasks, demonstrating its superiority over existing methods. Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model. Moreover, our approach offers opportunities for plug-and-play integration with mainstream LLMs such as T5 and TaPas, extending their capability to chart comprehension tasks. The code is available at https://github.com/zhiqic/ChartReader. △ Less

Submitted 4 April, 2023; originally announced April 2023.

arXiv:2304.01963 [pdf, other]

Model-corrected learned primal-dual models for fast limited-view photoacoustic tomography

Authors: Andreas Hauptmann, Jenni Poimala

Abstract: Learned iterative reconstructions hold great promise to accelerate tomographic imaging with empirical robustness to model perturbations. Nevertheless, an adoption for photoacoustic tomography is hindered by the need to repeatedly evaluate the computational expensive forward model. Computational feasibility can be obtained by the use of fast approximate models, but a need to compensate model errors… ▽ More Learned iterative reconstructions hold great promise to accelerate tomographic imaging with empirical robustness to model perturbations. Nevertheless, an adoption for photoacoustic tomography is hindered by the need to repeatedly evaluate the computational expensive forward model. Computational feasibility can be obtained by the use of fast approximate models, but a need to compensate model errors arises. In this work we advance the methodological and theoretical basis for model corrections in learned image reconstructions by embedding the model correction in a learned primal-dual framework. Here, the model correction is jointly learned in data space coupled with a learned updating operator in image space within an unrolled end-to-end learned iterative reconstruction approach. The proposed formulation allows an extension to a primal-dual deep equilibrium model providing fixed-point convergence as well as reduced memory requirements for training. We provide theoretical and empirical insights into the proposed models with numerical validation in a realistic 2D limited-view setting. The model-corrected learned primal-dual methods show excellent reconstruction quality with fast inference times and thus providing a methodological basis for real-time capable and scalable iterative reconstructions in photoacoustic tomography. △ Less

Submitted 4 April, 2023; originally announced April 2023.

arXiv:2303.18177 [pdf, other]

STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition

Authors: Xiaoyu Zhu, Po-Yao Huang, Junwei Liang, Celso M. de Melo, Alexander Hauptmann

Abstract: We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-fra… ▽ More We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: CVPR 2023

arXiv:2212.05199 [pdf, other]

MAGVIT: Masked Generative Video Transformer

Authors: Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

Abstract: We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MA… ▽ More We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu. △ Less

Submitted 4 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

Comments: CVPR 2023 highlight

arXiv:2211.01159 [pdf, other]

Unsupervised denoising for sparse multi-spectral computed tomography

Authors: Satu I. Inkinen, Mikael A. K. Brix, Miika T. Nieminen, Simon Arridge, Andreas Hauptmann

Abstract: Multi-energy computed tomography (CT) with photon counting detectors (PCDs) enables spectral imaging as PCDs can assign the incoming photons to specific energy channels. However, PCDs with many spectral channels drastically increase the computational complexity of the CT reconstruction, and bespoke reconstruction algorithms need fine-tuning to varying noise statistics. \rev{Especially if many proj… ▽ More Multi-energy computed tomography (CT) with photon counting detectors (PCDs) enables spectral imaging as PCDs can assign the incoming photons to specific energy channels. However, PCDs with many spectral channels drastically increase the computational complexity of the CT reconstruction, and bespoke reconstruction algorithms need fine-tuning to varying noise statistics. \rev{Especially if many projections are taken, a large amount of data has to be collected and stored. Sparse view CT is one solution for data reduction. However, these issues are especially exacerbated when sparse imaging scenarios are encountered due to a significant reduction in photon counts.} In this work, we investigate the suitability of learning-based improvements to the challenging task of obtaining high-quality reconstructions from sparse measurements for a 64-channel PCD-CT. In particular, to overcome missing reference data for the training procedure, we propose an unsupervised denoising and artefact removal approach by exploiting different filter functions in the reconstruction and an explicit coupling of spectral channels with the nuclear norm. Performance is assessed on both simulated synthetic data and the openly available experimental Multi-Spectral Imaging via Computed Tomography (MUSIC) dataset. We compared the quality of our unsupervised method to iterative total nuclear variation regularized reconstructions and a supervised denoiser trained with reference data. We show that improved reconstruction quality can be achieved with flexibility on noise statistics and effective suppression of streaking artefacts when using unsupervised denoising with spectral coupling. △ Less

Submitted 2 November, 2022; originally announced November 2022.

arXiv:2208.08965 [pdf, other]

doi 10.1145/3503161.3547943

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

Authors: Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, Alexander G. Hauptmann

Abstract: Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like" event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework… ▽ More Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like" event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics. △ Less

Submitted 28 November, 2022; v1 submitted 18 August, 2022; originally announced August 2022.

Comments: ACM Multimedia 2022 (Oral), Code: https://github.com/zhiqic/GSRFormer

arXiv:2206.09595 [pdf, other]

Reconstruction and segmentation from sparse sequential X-ray measurements of wood logs

Authors: Sebastian Springer, Aldo Glielmo, Angelina Senchukova, Tomi Kauppi, Jarkko Suuronen, Lassi Roininen, Heikki Haario, Andreas Hauptmann

Abstract: In industrial applications, it is common to scan objects on a moving conveyor belt. If slice-wise 2D computed tomography (CT) measurements of the moving object are obtained we call it a sequential scanning geometry. In this case, each slice on its own does not carry sufficient information to reconstruct a useful tomographic image. Thus, here we propose the use of a Dimension reduced Kalman Filter… ▽ More In industrial applications, it is common to scan objects on a moving conveyor belt. If slice-wise 2D computed tomography (CT) measurements of the moving object are obtained we call it a sequential scanning geometry. In this case, each slice on its own does not carry sufficient information to reconstruct a useful tomographic image. Thus, here we propose the use of a Dimension reduced Kalman Filter to accumulate information between slices and allow for sufficiently accurate reconstructions for further assessment of the object. Additionally, we propose to use an unsupervised clustering approach known as Density Peak Advanced, to perform a segmentation and spot density anomalies in the internal structure of the reconstructed objects. We evaluate the method in a proof of concept study for the application of wood log scanning for the industrial sawing process, where the goal is to spot anomalies within the wood log to allow for optimal sawing patterns. Reconstruction and segmentation quality are evaluated from experimental measurement data for various scenarios of severely undersampled X-measurements. Results show clearly that an improvement in reconstruction quality can be obtained by employing the Dimension reduced Kalman Filter allowing to robustly obtain the segmented logs. △ Less

Submitted 9 November, 2023; v1 submitted 20 June, 2022; originally announced June 2022.

arXiv:2206.05431 [pdf, other]

Learned reconstruction methods with convergence guarantees

Authors: Subhadip Mukherjee, Andreas Hauptmann, Ozan Öktem, Marcelo Pereyra, Carola-Bibiane Schönlieb

Abstract: In recent years, deep learning has achieved remarkable empirical success for image reconstruction. This has catalyzed an ongoing quest for precise characterization of correctness and reliability of data-driven methods in critical use-cases, for instance in medical imaging. Notwithstanding the excellent performance and efficacy of deep learning-based methods, concerns have been raised regarding the… ▽ More In recent years, deep learning has achieved remarkable empirical success for image reconstruction. This has catalyzed an ongoing quest for precise characterization of correctness and reliability of data-driven methods in critical use-cases, for instance in medical imaging. Notwithstanding the excellent performance and efficacy of deep learning-based methods, concerns have been raised regarding their stability, or lack thereof, with serious practical implications. Significant advances have been made in recent years to unravel the inner workings of data-driven image recovery methods, challenging their widely perceived black-box nature. In this article, we will specify relevant notions of convergence for data-driven image reconstruction, which will form the basis of a survey of learned methods with mathematically rigorous reconstruction guarantees. An example that is highlighted is the role of ICNN, offering the possibility to combine the power of deep learning with classical convex regularization theory for devising methods that are provably convergent. This survey article is aimed at both methodological researchers seeking to advance the frontiers of our understanding of data-driven image reconstruction methods as well as practitioners, by providing an accessible description of useful convergence concepts and by placing some of the existing empirical practices on a solid mathematical foundation. △ Less

Submitted 14 September, 2022; v1 submitted 11 June, 2022; originally announced June 2022.

arXiv:2206.05253 [pdf, other]

Rethinking Spatial Invariance of Convolutional Networks for Object Counting

Authors: Zhi-Qi Cheng, Qi Dai, Hong Li, JingKuan Song, Xiao Wu, Alexander G. Hauptmann

Abstract: Previous work generally believes that improving the spatial invariance of convolutional networks is the key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level spatial invariance would cause overfit noise in the density map generation. In this paper, we try to use locally connected Gaussian kernels to replace the original… ▽ More Previous work generally believes that improving the spatial invariance of convolutional networks is the key to object counting. However, after verifying several mainstream counting networks, we surprisingly found too strict pixel-level spatial invariance would cause overfit noise in the density map generation. In this paper, we try to use locally connected Gaussian kernels to replace the original convolution filter to estimate the spatial position in the density map. The purpose of this is to allow the feature extraction process to potentially stimulate the density map generation process to overcome the annotation noise. Inspired by previous work, we propose a low-rank approximation accompanied with translation invariance to favorably implement the approximation of massive Gaussian convolution. Our work points a new direction for follow-up research, which should investigate how to properly relax the overly strict pixel-level spatial invariance for object counting. We evaluate our methods on 4 mainstream object counting networks (i.e., MCNN, CSRNet, SANet, and ResNet-50). Extensive experiments were conducted on 7 popular benchmarks for 3 applications (i.e., crowd, vehicle, and plant counting). Experimental results show that our methods significantly outperform other state-of-the-art methods and achieve promising learning of the spatial position of objects. △ Less

Submitted 18 August, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: Accepted to CVPR 2022, Code: https://github.com/zhiqic/Rethinking-Counting

arXiv:2205.09256 [pdf, other]

Training Vision-Language Transformers from Captions

Authors: Liangke Gui, Yingshan Chang, Qiuyuan Huang, Subhojit Som, Alex Hauptmann, Jianfeng Gao, Yonatan Bisk

Abstract: Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Visi… ▽ More Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes. △ Less

Submitted 14 June, 2023; v1 submitted 18 May, 2022; originally announced May 2022.

arXiv:2201.05290 [pdf, other]

Argus++: Robust Real-time Activity Detection for Unconstrained Video Streams with Overlapping Cube Proposals

Authors: Lijun Yu, Yijun Qian, Wenhe Liu, Alexander G. Hauptmann

Abstract: Activity detection is one of the attractive computer vision tasks to exploit the video streams captured by widely installed cameras. Although achieving impressive performance, conventional activity detection algorithms are usually designed under certain constraints, such as using trimmed and/or object-centered video clips as inputs. Therefore, they failed to deal with the multi-scale multi-instanc… ▽ More Activity detection is one of the attractive computer vision tasks to exploit the video streams captured by widely installed cameras. Although achieving impressive performance, conventional activity detection algorithms are usually designed under certain constraints, such as using trimmed and/or object-centered video clips as inputs. Therefore, they failed to deal with the multi-scale multi-instance cases in real-world unconstrained video streams, which are untrimmed and have large field-of-views. Real-time requirements for streaming analysis also mark brute force expansion of them unfeasible. To overcome these issues, we propose Argus++, a robust real-time activity detection system for analyzing unconstrained video streams. The design of Argus++ introduces overlapping spatio-temporal cubes as an intermediate concept of activity proposals to ensure coverage and completeness of activity detection through over-sampling. The overall system is optimized for real-time processing on standalone consumer-level hardware. Extensive experiments on different surveillance and driving scenarios demonstrated its superior performance in a series of activity detection benchmarks, including CVPR ActivityNet ActEV 2021, NIST ActEV SDL UF/KF, TRECVID ActEV 2020/2021, and ICCV ROAD 2021. △ Less

Submitted 13 January, 2022; originally announced January 2022.

arXiv:2112.08614 [pdf, other]

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Authors: Liangke Gui, Borui Wang, Qiuyuan Huang, Alex Hauptmann, Yonatan Bisk, Jianfeng Gao

Abstract: The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction… ▽ More The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result (+6 points absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis. △ Less

Submitted 5 May, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: Accepted by NAACL 2022

arXiv:2111.11926 [pdf, other]

doi 10.1109/TCI.2022.3233188

An Educated Warm Start For Deep Image Prior-Based Micro CT Reconstruction

Authors: Riccardo Barbano, Johannes Leuschner, Maximilian Schmidt, Alexander Denker, Andreas Hauptmann, Peter Maaß, Bangti Jin

Abstract: Deep image prior (DIP) was recently introduced as an effective unsupervised approach for image restoration tasks. DIP represents the image to be recovered as the output of a deep convolutional neural network, and learns the network's parameters such that the output matches the corrupted observation. Despite its impressive reconstructive properties, the approach is slow when compared to supervisedl… ▽ More Deep image prior (DIP) was recently introduced as an effective unsupervised approach for image restoration tasks. DIP represents the image to be recovered as the output of a deep convolutional neural network, and learns the network's parameters such that the output matches the corrupted observation. Despite its impressive reconstructive properties, the approach is slow when compared to supervisedly learned, or traditional reconstruction techniques. To address the computational challenge, we bestow DIP with a two-stage learning paradigm: (i) perform a supervised pretraining of the network on a simulated dataset; (ii) fine-tune the network's parameters to adapt to the target reconstruction task. We provide a thorough empirical analysis to shed insights into the impacts of pretraining in the context of image reconstruction. We showcase that pretraining considerably speeds up and stabilizes the subsequent reconstruction task from real-measured 2D and 3D micro computed tomography data of biological specimens. The code and additional experimental materials are available at https://educateddip.github.io/docs.educated_deep_image_prior/. △ Less

Submitted 8 February, 2023; v1 submitted 23 November, 2021; originally announced November 2021.

Journal ref: in IEEE Transactions on Computational Imaging, vol. 8, pp. 1210-1222, 2022

arXiv:2111.09631 [pdf, other]

doi 10.1109/TUFFC.2022.3162097

Neural Network Kalman filtering for 3D object tracking from linear array ultrasound data

Authors: Arttu Arjas, Erwin J. Alles, Efthymios Maneas, Simon Arridge, Adrien Desjardins, Mikko J. Sillanpää, Andreas Hauptmann

Abstract: Many interventional surgical procedures rely on medical imaging to visualise and track instruments. Such imaging methods not only need to be real-time capable, but also provide accurate and robust positional information. In ultrasound applications, typically only two-dimensional data from a linear array are available, and as such obtaining accurate positional estimation in three dimensions is non-… ▽ More Many interventional surgical procedures rely on medical imaging to visualise and track instruments. Such imaging methods not only need to be real-time capable, but also provide accurate and robust positional information. In ultrasound applications, typically only two-dimensional data from a linear array are available, and as such obtaining accurate positional estimation in three dimensions is non-trivial. In this work, we first train a neural network, using realistic synthetic training data, to estimate the out-of-plane offset of an object with the associated axial aberration in the reconstructed ultrasound image. The obtained estimate is then combined with a Kalman filtering approach that utilises positioning estimates obtained in previous time-frames to improve localisation robustness and reduce the impact of measurement noise. The accuracy of the proposed method is evaluated using simulations, and its practical applicability is demonstrated on experimental data obtained using a novel optical ultrasound imaging setup. Accurate and robust positional information is provided in real-time. Axial and lateral coordinates for out-of-plane objects are estimated with a mean error of 0.1mm for simulated data and a mean error of 0.2mm for experimental data. Three-dimensional localisation is most accurate for elevational distances larger than 1mm, with a maximum distance of 6mm considered for a 25mm aperture. △ Less

Submitted 15 June, 2022; v1 submitted 18 November, 2021; originally announced November 2021.

Comments: 13 pages, 8 figures

arXiv:2107.02572 [pdf, other]

doi 10.1088/1361-6420/ac8a91

Unsupervised Knowledge-Transfer for Learned Image Reconstruction

Authors: Riccardo Barbano, Zeljko Kereta, Andreas Hauptmann, Simon R. Arridge, Bangti Jin

Abstract: Deep learning-based image reconstruction approaches have demonstrated impressive empirical performance in many imaging modalities. These approaches usually require a large amount of high-quality paired training data, which is often not available in medical imaging. To circumvent this issue we develop a novel unsupervised knowledge-transfer paradigm for learned reconstruction within a Bayesian fram… ▽ More Deep learning-based image reconstruction approaches have demonstrated impressive empirical performance in many imaging modalities. These approaches usually require a large amount of high-quality paired training data, which is often not available in medical imaging. To circumvent this issue we develop a novel unsupervised knowledge-transfer paradigm for learned reconstruction within a Bayesian framework. The proposed approach learns a reconstruction network in two phases. The first phase trains a reconstruction network with a set of ordered pairs comprising of ground truth images of ellipses and the corresponding simulated measurement data. The second phase fine-tunes the pretrained network to more realistic measurement data without supervision. By construction, the framework is capable of delivering predictive uncertainty information over the reconstructed image. We present extensive experimental results on low-dose and sparse-view computed tomography showing that the approach is competitive with several state-of-the-art supervised and unsupervised reconstruction techniques. Moreover, for test data distributed differently from the training data, the proposed framework can significantly improve reconstruction quality not only visually, but also quantitatively in terms of PSNR and SSIM, when compared with learned methods trained on the synthetic dataset only. △ Less

Submitted 21 July, 2022; v1 submitted 6 July, 2021; originally announced July 2021.

arXiv:2105.01605 [pdf, other]

Person Search Challenges and Solutions: A Survey

Authors: Xiangtan Lin, Pengzhen Ren, Yun Xiao, Xiaojun Chang, Alex Hauptmann

Abstract: Person search has drawn increasing attention due to its real-world applications and research significance. Person search aims to find a probe person in a gallery of scene images with a wide range of applications, such as criminals search, multicamera tracking, missing person search, etc. Early person search works focused on image-based person search, which uses person image as the search query. Te… ▽ More Person search has drawn increasing attention due to its real-world applications and research significance. Person search aims to find a probe person in a gallery of scene images with a wide range of applications, such as criminals search, multicamera tracking, missing person search, etc. Early person search works focused on image-based person search, which uses person image as the search query. Text-based person search is another major person search category that uses free-form natural language as the search query. Person search is challenging, and corresponding solutions are diverse and complex. Therefore, systematic surveys on this topic are essential. This paper surveyed the recent works on image-based and text-based person search from the perspective of challenges and solutions. Specifically, we provide a brief analysis of highly influential person search methods considering the three significant challenges: the discriminative person features, the query-person gap, and the detection-identification inconsistency. We summarise and compare evaluation results. Finally, we discuss open issues and some promising future research directions. △ Less

Submitted 1 May, 2021; originally announced May 2021.

Comments: 8 pages; Accepted by IJCAI 2021 Survey Track

arXiv:2105.00379 [pdf, other]

Subspace Representation Learning for Few-shot Image Classification

Authors: Ting-Yao Hu, Zhi-Qi Cheng, Alexander G. Hauptmann

Abstract: In this paper, we propose a subspace representation learning (SRL) framework to tackle few-shot image classification tasks. It exploits a subspace in local CNN feature space to represent an image, and measures the similarity between two images according to a weighted subspace distance (WSD). When K images are available for each class, we develop two types of template subspaces to aggregate K-shot… ▽ More In this paper, we propose a subspace representation learning (SRL) framework to tackle few-shot image classification tasks. It exploits a subspace in local CNN feature space to represent an image, and measures the similarity between two images according to a weighted subspace distance (WSD). When K images are available for each class, we develop two types of template subspaces to aggregate K-shot information: the prototypical subspace (PS) and the discriminative subspace (DS). Based on the SRL framework, we extend metric learning based techniques from vector to subspace representation. While most previous works adopted global vector representation, using subspace representation can effectively preserve the spatial structure, and diversity within an image. We demonstrate the effectiveness of the SRL framework on three public benchmark datasets: MiniImageNet, TieredImageNet and Caltech-UCSD Birds-200-2011 (CUB), and the experimental results illustrate competitive/superior performance of our method compared to the previous state-of-the-art. △ Less

Submitted 4 May, 2021; v1 submitted 1 May, 2021; originally announced May 2021.

arXiv:2104.01111 [pdf, other]

doi 10.1109/TPAMI.2021.3137605

A Comprehensive Survey of Scene Graphs: Generation and Application

Authors: Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, Alex Hauptmann

Abstract: Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene. As computer vision technology continues to develop, people are no longer satisfied with simply detecting and recognizing objects in images; instead, people look forward to a higher level of understanding and reasoning about visual scenes. For examp… ▽ More Scene graph is a structured representation of a scene that can clearly express the objects, attributes, and relationships between objects in the scene. As computer vision technology continues to develop, people are no longer satisfied with simply detecting and recognizing objects in images; instead, people look forward to a higher level of understanding and reasoning about visual scenes. For example, given an image, we want to not only detect and recognize objects in the image, but also know the relationship between objects (visual relationship detection), and generate a text description (image captioning) based on the image content. Alternatively, we might want the machine to tell us what the little girl in the image is doing (Visual Question Answering (VQA)), or even remove the dog from the image and find similar images (image editing and retrieval), etc. These tasks require a higher level of understanding and reasoning for image vision tasks. The scene graph is just such a powerful tool for scene understanding. Therefore, scene graphs have attracted the attention of a large number of researchers, and related research is often cross-modal, complex, and rapidly developing. However, no relatively systematic survey of scene graphs exists at present. To this end, this survey conducts a comprehensive investigation of the current scene graph research. More specifically, we first summarized the general definition of the scene graph, then conducted a comprehensive and systematic discussion on the generation method of the scene graph (SGG) and the SGG with the aid of prior knowledge. We then investigated the main applications of scene graphs and summarized the most commonly used datasets. Finally, we provide some insights into the future development of scene graphs. We believe this will be a very helpful foundation for future research on scene graphs. △ Less

Submitted 6 January, 2022; v1 submitted 17 March, 2021; originally announced April 2021.

Comments: 25 pages

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

arXiv:2103.15138 [pdf, other]

Graph Convolutional Networks for Model-Based Learning in Nonlinear Inverse Problems

Authors: William Herzberg, Daniel B. Rowe, Andreas Hauptmann, Sarah J. Hamilton

Abstract: The majority of model-based learned image reconstruction methods in medical imaging have been limited to uniform domains, such as pixelated images. If the underlying model is solved on nonuniform meshes, arising from a finite element method typical for nonlinear inverse problems, interpolation and embeddings are needed. To overcome this, we present a flexible framework to extend model-based learni… ▽ More The majority of model-based learned image reconstruction methods in medical imaging have been limited to uniform domains, such as pixelated images. If the underlying model is solved on nonuniform meshes, arising from a finite element method typical for nonlinear inverse problems, interpolation and embeddings are needed. To overcome this, we present a flexible framework to extend model-based learning directly to nonuniform meshes, by interpreting the mesh as a graph and formulating our network architectures using graph convolutional neural networks. This gives rise to the proposed iterative Graph Convolutional Newton-type Method (GCNM), which includes the forward model in the solution of the inverse problem, while all updates are directly computed by the network on the problem specific mesh. We present results for Electrical Impedance Tomography, a severely ill-posed nonlinear inverse problem that is frequently solved via optimization-based methods, where the forward problem is solved by finite element methods. Results for absolute EIT imaging are compared to standard iterative methods as well as a graph residual network. We show that the GCNM has strong generalizability to different domain shapes and meshes, out of distribution data as well as experimental data, from purely simulated training data and without transfer training. △ Less

Submitted 8 July, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

Comments: 9 figures, 5 tables

arXiv:2103.08849 [pdf, other]

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

Authors: Po-Yao Huang, Mandela Patrick, Junjie Hu, Graham Neubig, Florian Metze, Alexander Hauptmann

Abstract: This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English s… ▽ More This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M. △ Less

Submitted 14 April, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

Comments: accepted by NAACL 2021

arXiv:2102.10033 [pdf, other]

Pose Guided Person Image Generation with Hidden p-Norm Regression

Authors: Ting-Yao Hu, Alexander G. Hauptmann

Abstract: In this paper, we propose a novel approach to solve the pose guided person image generation task. We assume that the relation between pose and appearance information can be described by a simple matrix operation in hidden space. Based on this assumption, our method estimates a pose-invariant feature matrix for each identity, and uses it to predict the target appearance conditioned on the target po… ▽ More In this paper, we propose a novel approach to solve the pose guided person image generation task. We assume that the relation between pose and appearance information can be described by a simple matrix operation in hidden space. Based on this assumption, our method estimates a pose-invariant feature matrix for each identity, and uses it to predict the target appearance conditioned on the target pose. The estimation process is formulated as a p-norm regression problem in hidden space. By utilizing the differentiation of the solution of this regression problem, the parameters of the whole framework can be trained in an end-to-end manner. While most previous works are only applicable to the supervised training and single-shot generation scenario, our method can be easily adapted to unsupervised training and multi-shot generation. Extensive experiments on the challenging Market-1501 dataset show that our method yields competitive performance in all the aforementioned variant scenarios. △ Less

Submitted 19 February, 2021; originally announced February 2021.

Journal ref: ICIP 2021

arXiv:2012.07676 [pdf, other]

doi 10.1109/LSP.2021.3063622

An efficient Quasi-Newton method for nonlinear inverse problems via learned singular values

Authors: Danny Smyl, Tyler N. Tallman, Dong Liu, Andreas Hauptmann

Abstract: Solving complex optimization problems in engineering and the physical sciences requires repetitive computation of multi-dimensional function derivatives. Commonly, this requires computationally-demanding numerical differentiation such as perturbation techniques, which ultimately limits the use for time-sensitive applications. In particular, in nonlinear inverse problems Gauss-Newton methods are us… ▽ More Solving complex optimization problems in engineering and the physical sciences requires repetitive computation of multi-dimensional function derivatives. Commonly, this requires computationally-demanding numerical differentiation such as perturbation techniques, which ultimately limits the use for time-sensitive applications. In particular, in nonlinear inverse problems Gauss-Newton methods are used that require iterative updates to be computed from the Jacobian. Computationally more efficient alternatives are Quasi-Newton methods, where the repeated computation of the Jacobian is replaced by an approximate update. Here we present a highly efficient data-driven Quasi-Newton method applicable to nonlinear inverse problems. We achieve this, by using the singular value decomposition and learning a mapping from model outputs to the singular values to compute the updated Jacobian. This enables a speed-up expected of Quasi-Newton methods without accumulating roundoff errors, enabling time-critical applications and allowing for flexible incorporation of prior knowledge necessary to solve ill-posed problems. We present results for the highly non-linear inverse problem of electrical impedance tomography with experimental data. △ Less

Submitted 1 March, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

arXiv:2012.05303 [pdf]

Machine Learning in Magnetic Resonance Imaging: Image Reconstruction

Authors: Javier Montalt-Tordera, Vivek Muthurangu, Andreas Hauptmann, Jennifer Anne Steeden

Abstract: Magnetic Resonance Imaging (MRI) plays a vital role in diagnosis, management and monitoring of many diseases. However, it is an inherently slow imaging technique. Over the last 20 years, parallel imaging, temporal encoding and compressed sensing have enabled substantial speed-ups in the acquisition of MRI data, by accurately recovering missing lines of k-space data. However, clinical uptake of vas… ▽ More Magnetic Resonance Imaging (MRI) plays a vital role in diagnosis, management and monitoring of many diseases. However, it is an inherently slow imaging technique. Over the last 20 years, parallel imaging, temporal encoding and compressed sensing have enabled substantial speed-ups in the acquisition of MRI data, by accurately recovering missing lines of k-space data. However, clinical uptake of vastly accelerated acquisitions has been limited, in particular in compressed sensing, due to the time-consuming nature of the reconstructions and unnatural looking images. Following the success of machine learning in a wide range of imaging tasks, there has been a recent explosion in the use of machine learning in the field of MRI image reconstruction. A wide range of approaches have been proposed, which can be applied in k-space and/or image-space. Promising results have been demonstrated from a range of methods, enabling natural looking images and rapid computation. In this review article we summarize the current machine learning approaches used in MRI reconstruction, discuss their drawbacks, clinical applications, and current trends. △ Less

Submitted 9 December, 2020; originally announced December 2020.

Comments: 34 pages, 3 figures, 1 table. review article

arXiv:2012.02426 [pdf, other]

Spatial-Temporal Alignment Network for Action Recognition and Detection

Authors: Junwei Liang, Liangliang Cao, Xuehan Xiong, Ting Yu, Alexander Hauptmann

Abstract: This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection. Although we have witnessed great progress of action recognition in the past decade, it remains challenging yet interesting how to efficiently model the geometric variations in large scale datasets. This paper proposes a novel Spatial-Temporal Alignment Network (STAN) that… ▽ More This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection. Although we have witnessed great progress of action recognition in the past decade, it remains challenging yet interesting how to efficiently model the geometric variations in large scale datasets. This paper proposes a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection. The STAN model is very light-weighted and generic, which could be plugged into existing action recognition models like ResNet3D and the SlowFast with a very low extra computational cost. We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets. The experimental results show that the STAN model can consistently improve the state of the arts in both action detection and action recognition tasks. We will release our data, models and code. △ Less

Submitted 4 December, 2020; originally announced December 2020.

arXiv:2011.08413 [pdf, other]

Quantifying Sources of Uncertainty in Deep Learning-Based Image Reconstruction

Authors: Riccardo Barbano, Željko Kereta, Chen Zhang, Andreas Hauptmann, Simon Arridge, Bangti Jin

Abstract: Image reconstruction methods based on deep neural networks have shown outstanding performance, equalling or exceeding the state-of-the-art results of conventional approaches, but often do not provide uncertainty information about the reconstruction. In this work we propose a scalable and efficient framework to simultaneously quantify aleatoric and epistemic uncertainties in learned iterative image… ▽ More Image reconstruction methods based on deep neural networks have shown outstanding performance, equalling or exceeding the state-of-the-art results of conventional approaches, but often do not provide uncertainty information about the reconstruction. In this work we propose a scalable and efficient framework to simultaneously quantify aleatoric and epistemic uncertainties in learned iterative image reconstruction. We build on a Bayesian deep gradient descent method for quantifying epistemic uncertainty, and incorporate the heteroscedastic variance of the noise to account for the aleatoric uncertainty. We show that our method exhibits competitive performance against conventional benchmarks for computed tomography with both sparse view and limited angle data. The estimated uncertainty captures the variability in the reconstructions, caused by the restricted measurement model, and by missing information, due to the limited angle geometry. △ Less

Submitted 29 November, 2020; v1 submitted 16 November, 2020; originally announced November 2020.

Journal ref: NeurIPS 2020 Workshop on Deep Learning and Inverse Problems

arXiv:2011.00681 [pdf, other]

Event-Related Bias Removal for Real-time Disaster Events

Authors: Evangelia Spiliopoulou, Salvador Medina Maza, Eduard Hovy, Alexander Hauptmann

Abstract: Social media has become an important tool to share information about crisis events such as natural disasters and mass attacks. Detecting actionable posts that contain useful information requires rapid analysis of huge volume of data in real-time. This poses a complex problem due to the large amount of posts that do not contain any actionable information. Furthermore, the classification of informat… ▽ More Social media has become an important tool to share information about crisis events such as natural disasters and mass attacks. Detecting actionable posts that contain useful information requires rapid analysis of huge volume of data in real-time. This poses a complex problem due to the large amount of posts that do not contain any actionable information. Furthermore, the classification of information in real-time systems requires training on out-of-domain data, as we do not have any data from a new emerging crisis. Prior work focuses on models pre-trained on similar event types. However, those models capture unnecessary event-specific biases, like the location of the event, which affect the generalizability and performance of the classifiers on new unseen data from an emerging new event. In our work, we train an adversarial neural model to remove latent event-specific biases and improve the performance on tweet importance classification. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: To appear in EMNLP Findings 2020

arXiv:2011.00147 [pdf, other]

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Authors: Guoliang Kang, Yunchao Wei, Yi Yang, Yueting Zhuang, Alexander G. Hauptmann

Abstract: Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. T… ▽ More Domain adaptive semantic segmentation aims to train a model performing satisfactory pixel-level predictions on the target with only out-of-domain (source) annotations. The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer. Previous domain discrepancy minimization methods are mainly based on the adversarial training. They tend to consider the domain discrepancy globally, which ignore the pixel-wise relationships and are less discriminative. In this paper, we propose to build the pixel-level cycle association between source and target pixel pairs and contrastively strengthen their connections to diminish the domain gap and make the features more discriminative. To the best of our knowledge, this is a new perspective for tackling such a challenging task. Experiment results on two representative domain adaptation benchmarks, i.e. GTAV $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes, verify the effectiveness of our proposed method and demonstrate that our method performs favorably against previous state-of-the-arts. Our method can be trained end-to-end in one stage and introduces no additional parameters, which is expected to serve as a general framework and help ease future research in domain adaptive semantic segmentation. Code is available at https://github.com/kgl-prml/Pixel- Level-Cycle-Association. △ Less

Submitted 30 October, 2020; originally announced November 2020.

Comments: Accepted by NeurIPS 2020 (oral). Code: https://github.com/kgl-prml/Pixel- Level-Cycle-Association

arXiv:2010.06847 [pdf, other]

doi 10.1098/rsta.2020.0194

Fusing electrical and elasticity imaging

Authors: Andreas Hauptmann, Danny Smyl

Abstract: Electrical and elasticity imaging are promising modalities for a suite of different applications including medical tomography, non-destructive testing, and structural health monitoring. These emerging modalities are capable of providing remote, non-invasive, and low cost opportunities. Unfortunately, both modalities are severely ill-posed nonlinear inverse problems, susceptive to noise and modelli… ▽ More Electrical and elasticity imaging are promising modalities for a suite of different applications including medical tomography, non-destructive testing, and structural health monitoring. These emerging modalities are capable of providing remote, non-invasive, and low cost opportunities. Unfortunately, both modalities are severely ill-posed nonlinear inverse problems, susceptive to noise and modelling errors. Nevertheless, the ability to incorporate complimentary data sets obtained simultaneously offers mutually-beneficial information. By fusing electrical and elastic modalities as a joint problem we are afforded the possibility to stabilise the inversion process via the utilisation of auxiliary information from both modalities as well as joint structural operators. In this study, we will discuss a possible approach to combine electrical and elasticity imaging in a joint reconstruction problem giving rise to novel multi-modality applications for use in both medical and structural engineering. △ Less

Submitted 17 May, 2021; v1 submitted 14 October, 2020; originally announced October 2020.

Journal ref: Philosophical Transactions of the Royal Society A, 379(2200), 20200194 (2021)

arXiv:2010.02824 [pdf, other]

Support-set bottlenecks for video-text representation learning

Authors: Mandela Patrick, Po-Yao Huang, Yuki Asano, Florian Metze, Alexander Hauptmann, João Henriques, Andrea Vedaldi

Abstract: The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples tha… ▽ More The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval. △ Less

Submitted 14 January, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: Accepted as spotlight paper at the International Conference on Learning Representations (ICLR) 2021

arXiv:2009.07608 [pdf, other]

doi 10.1117/1.JBO.25.11.112903

Deep Learning in Photoacoustic Tomography: Current approaches and future directions

Authors: Andreas Hauptmann, Ben Cox

Abstract: Biomedical photoacoustic tomography, which can provide high resolution 3D soft tissue images based on the optical absorption, has advanced to the stage at which translation from the laboratory to clinical settings is becoming possible. The need for rapid image formation and the practical restrictions on data acquisition that arise from the constraints of a clinical workflow are presenting new imag… ▽ More Biomedical photoacoustic tomography, which can provide high resolution 3D soft tissue images based on the optical absorption, has advanced to the stage at which translation from the laboratory to clinical settings is becoming possible. The need for rapid image formation and the practical restrictions on data acquisition that arise from the constraints of a clinical workflow are presenting new image reconstruction challenges. There are many classical approaches to image reconstruction, but ameliorating the effects of incomplete or imperfect data through the incorporation of accurate priors is challenging and leads to slow algorithms. Recently, the application of Deep Learning, or deep neural networks, to this problem has received a great deal of attention. This paper reviews the literature on learned image reconstruction, summarising the current trends, and explains how these new approaches fit within, and to some extent have arisen from, a framework that encompasses classical reconstruction methods. In particular, it shows how these new techniques can be understood from a Bayesian perspective, providing useful insights. The paper also provides a concise tutorial demonstration of three prototypical approaches to learned image reconstruction. The code and data sets for these demonstrations are available to researchers. It is anticipated that it is in in vivo applications - where data may be sparse, fast imaging critical and priors difficult to construct by hand - that Deep Learning will have the most impact. With this in mind, the paper concludes with some indications of possible future research directions. △ Less

Submitted 16 September, 2020; originally announced September 2020.

Journal ref: J. of Biomedical Optics, 25(11), 112903 (2020)

arXiv:2008.04722 [pdf, other]

Robust Long-Term Object Tracking via Improved Discriminative Model Prediction

Authors: Seokeon Choi, Junhyun Lee, Yunsung Lee, Alexander Hauptmann

Abstract: We propose an improved discriminative model prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline pre-trained short-term tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP with the standard DiMP classifier. Our tracker RLT-DiMP improves SuperDiMP in the following three aspects: (1) Uncertainty reduction using random erasing: T… ▽ More We propose an improved discriminative model prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline pre-trained short-term tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP with the standard DiMP classifier. Our tracker RLT-DiMP improves SuperDiMP in the following three aspects: (1) Uncertainty reduction using random erasing: To make our model robust, we exploit an agreement from multiple images after erasing random small rectangular areas as a certainty. And then, we correct the tracking state of our model accordingly. (2) Random search with spatio-temporal constraints: we propose a robust random search method with a score penalty applied to prevent the problem of sudden detection at a distance. (3) Background augmentation for more discriminative feature learning: We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter. In experiments on the VOT-LT2020 benchmark dataset, the proposed method achieves comparable performance to the state-of-the-art long-term trackers. The source code is available at: https://github.com/bismex/RLT-DIMP. △ Less

Submitted 25 August, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

Comments: Accepted to ECCV 2020 Workshop

arXiv:2007.15683 [pdf, other]

From A Glance to "Gotcha": Interactive Facial Image Retrieval with Progressive Relevance Feedback

Authors: Xinru Yang, Haozhi Qi, Mingyang Li, Alexander Hauptmann

Abstract: Facial image retrieval plays a significant role in forensic investigations where an untrained witness tries to identify a suspect from a massive pool of images. However, due to the difficulties in describing human facial appearances verbally and directly, people naturally tend to depict by referring to well-known existing images and comparing specific areas of faces with them and it is also challe… ▽ More Facial image retrieval plays a significant role in forensic investigations where an untrained witness tries to identify a suspect from a massive pool of images. However, due to the difficulties in describing human facial appearances verbally and directly, people naturally tend to depict by referring to well-known existing images and comparing specific areas of faces with them and it is also challenging to provide complete comparison at each time. Therefore, we propose an end-to-end framework to retrieve facial images with relevance feedback progressively provided by the witness, enabling an exploitation of history information during multiple rounds and an interactive and iterative approach to retrieving the mental image. With no need of any extra annotations, our model can be applied at the cost of a little response effort. We experiment on \texttt{CelebA} and evaluate the performance by ranking percentile and achieve 99\% under the best setting. Since this topic remains little explored to the best of our knowledge, we hope our work can serve as a stepping stone for further research. △ Less

Submitted 30 July, 2020; originally announced July 2020.

Journal ref: The SIGIR 2020 Workshop on Applied Interactive Information Systems

arXiv:2007.14745 [pdf, other]

On the unreasonable effectiveness of CNNs

Authors: Andreas Hauptmann, Jonas Adler

Abstract: Deep learning methods using convolutional neural networks (CNN) have been successfully applied to virtually all imaging problems, and particularly in image reconstruction tasks with ill-posed and complicated imaging models. In an attempt to put upper bounds on the capability of baseline CNNs for solving image-to-image problems we applied a widely used standard off-the-shelf network architecture (U… ▽ More Deep learning methods using convolutional neural networks (CNN) have been successfully applied to virtually all imaging problems, and particularly in image reconstruction tasks with ill-posed and complicated imaging models. In an attempt to put upper bounds on the capability of baseline CNNs for solving image-to-image problems we applied a widely used standard off-the-shelf network architecture (U-Net) to the "inverse problem" of XOR decryption from noisy data and show acceptable results. △ Less

Submitted 29 July, 2020; originally announced July 2020.

Showing 1–50 of 115 results for author: Hauptmann, A