subscribe to arXiv mailings

MATE: Meet At The Embedding -- Connecting Images with Long Texts

Authors: Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim

Abstract: While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this pape… ▽ More While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the embeddings from the VLM text encoder with those from the LLM using extensive text pairs. This module is then employed to seamlessly align image embeddings closely with LLM embeddings. We propose two new cross-modal retrieval benchmarks to assess the task of connecting images with long texts (lengthy captions / documents). Extensive experimental results demonstrate that MATE effectively connects images with long texts, uncovering diverse semantic relationships. △ Less

Submitted 26 June, 2024; originally announced July 2024.

arXiv:2407.03563 [pdf, other]

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Authors: Sungnyun Kim, Kangwook Jang, Sangmin Bae, Hoirin Kim, Se-Young Yun

Abstract: Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning th… ▽ More Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning three temporal dynamics in video data: context order, playback direction, and the speed of video frames. Cross-modal attention modules are introduced to enrich video features with audio information so that speech variability can be taken into account when training on the video temporal dynamics. Based on our approach, we achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings. Our approach excels in scenarios especially for babble and speech noise, indicating the ability to distinguish the speech signal that should be recognized from lip movements in the video modality. We support the validity of our methodology by offering the ablation experiments for the temporal dynamics losses and the cross-modal attention architecture design. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2406.16716 [pdf, other]

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Authors: Hyun Myung Kim, Kangwook Jang, Hoirin Kim

Abstract: As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafid… ▽ More As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.01192 [pdf, other]

Sparsity-Agnostic Linear Bandits with Adaptive Adversaries

Authors: Tianyuan Jin, Kyoungseok Jang, Nicolò Cesa-Bianchi

Abstract: We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function. Previous works f… ▽ More We study stochastic linear bandits where, in each round, the learner receives a set of actions (i.e., feature vectors), from which it chooses an element and obtains a stochastic reward. The expected reward is a fixed but unknown linear function of the chosen action. We study sparse regret bounds, that depend on the number $S$ of non-zero coefficients in the linear reward function. Previous works focused on the case where $S$ is known, or the action sets satisfy additional assumptions. In this work, we obtain the first sparse regret bounds that hold when $S$ is unknown and the action sets are adversarially generated. Our techniques combine online to confidence set conversions with a novel randomized model selection approach over a hierarchy of nested confidence sets. When $S$ is known, our analysis recovers state-of-the-art bounds for adversarial action sets. We also show that a variant of our approach, using Exp3 to dynamically select the confidence sets, can be used to improve the empirical performance of stochastic linear bandits while enjoying a regret bound with optimal dependence on the time horizon. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 25 pages

arXiv:2406.00014 [pdf, other]

KU-DMIS at EHRSQL 2024:Generating SQL query via question templatization in EHR

Authors: Hajung Kim, Chanhwi Kim, Hoonick Lee, Kyochul Jang, Jiwoo Lee, Kyungjae Lee, Gangwoo Kim, Jaewoo Kang

Abstract: Transforming natural language questions into SQL queries is crucial for precise data retrieval from electronic health record (EHR) databases. A significant challenge in this process is detecting and rejecting unanswerable questions that request information beyond the database's scope or exceed the system's capabilities. In this paper, we introduce a novel text-to-SQL framework that robustly handle… ▽ More Transforming natural language questions into SQL queries is crucial for precise data retrieval from electronic health record (EHR) databases. A significant challenge in this process is detecting and rejecting unanswerable questions that request information beyond the database's scope or exceed the system's capabilities. In this paper, we introduce a novel text-to-SQL framework that robustly handles out-of-domain questions and verifies the generated queries with query execution.Our framework begins by standardizing the structure of questions into a templated format. We use a powerful large language model (LLM), fine-tuned GPT-3.5 with detailed prompts involving the table schemas of the EHR database system. Our experimental results demonstrate the effectiveness of our framework on the EHRSQL-2024 benchmark benchmark, a shared task in the ClinicalNLP workshop. Although a straightforward fine-tuning of GPT shows promising results on the development set, it struggled with the out-of-domain questions in the test set. With our framework, we improve our system's adaptability and achieve competitive performances in the official leaderboard of the EHRSQL-2024 challenge. △ Less

Submitted 19 June, 2024; v1 submitted 21 May, 2024; originally announced June 2024.

Comments: Published at ClinicalNLP workshop @ NAACL 2024

arXiv:2405.14726 [pdf, other]

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

Authors: Young Kyun Jang, Donghyun Kim, Ser-nam Lim

Abstract: ``Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) mod… ▽ More ``Learning to hash'' is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) models. We introduce a novel method named Distillation for Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of VLP models to improve hash representation learning. Specifically, we use the VLP as a `teacher' to distill knowledge into a `student' hashing model equipped with codebooks. This process involves the replacement of supervised labels, which are composed of multi-hot vectors and lack semantics, with the rich semantics of VLP. In the end, we apply a transformation termed Normalization with Paired Consistency (NPC) to achieve a discriminative target for distillation. Further, we introduce a new quantization method, Product Quantization with Gumbel (PQG) that promotes balanced codebook learning, thereby improving the retrieval performance. Extensive benchmark testing demonstrates that DCMQ consistently outperforms existing supervised cross-modal hashing approaches, showcasing its significant potential. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14715 [pdf, other]

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Authors: Young Kyun Jang, Ser-nam Lim

Abstract: Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with… ▽ More Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.00571 [pdf, other]

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Authors: Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, Ser-Nam Lim

Abstract: Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies… ▽ More Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.15516 [pdf, other]

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

Authors: Young Kyun Jang, Donghyun Kim, Zihang Meng, Dat Huynh, Ser-Nam Lim

Abstract: Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the o… ▽ More Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 15 pages

arXiv:2404.05726 [pdf, other]

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Authors: Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim

Abstract: With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective… ▽ More With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/. △ Less

Submitted 24 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at CVPR 2024. Project Page https://boheumd.github.io/MA-LMM/

arXiv:2402.17050 [pdf, other]

Reinforcement Learning Based Oscillation Dampening: Scaling up Single-Agent RL algorithms to a 100 AV highway field operational test

Authors: Kathy Jang, Nathan Lichtlé, Eugene Vinitsky, Adit Shah, Matthew Bunting, Matthew Nice, Benedetto Piccoli, Benjamin Seibold, Daniel B. Work, Maria Laura Delle Monache, Jonathan Sprinkle, Jonathan W. Lee, Alexandre M. Bayen

Abstract: In this article, we explore the technical details of the reinforcement learning (RL) algorithms that were deployed in the largest field test of automated vehicles designed to smooth traffic flow in history as of 2023, uncovering the challenges and breakthroughs that come with developing RL controllers for automated vehicles. We delve into the fundamental concepts behind RL algorithms and their app… ▽ More In this article, we explore the technical details of the reinforcement learning (RL) algorithms that were deployed in the largest field test of automated vehicles designed to smooth traffic flow in history as of 2023, uncovering the challenges and breakthroughs that come with developing RL controllers for automated vehicles. We delve into the fundamental concepts behind RL algorithms and their application in the context of self-driving cars, discussing the developmental process from simulation to deployment in detail, from designing simulators to reward function shaping. We present the results in both simulation and deployment, discussing the flow-smoothing benefits of the RL controller. From understanding the basics of Markov decision processes to exploring advanced techniques such as deep RL, our article offers a comprehensive overview and deep dive of the theoretical foundations and practical implementations driving this rapidly evolving field. We also showcase real-world case studies and alternative research projects that highlight the impact of RL controllers in revolutionizing autonomous driving. From tackling complex urban environments to dealing with unpredictable traffic scenarios, these intelligent controllers are pushing the boundaries of what automated vehicles can achieve. Furthermore, we examine the safety considerations and hardware-focused technical details surrounding deployment of RL controllers into automated vehicles. As these algorithms learn and evolve through interactions with the environment, ensuring their behavior aligns with safety standards becomes crucial. We explore the methodologies and frameworks being developed to address these challenges, emphasizing the importance of building reliable control systems for automated vehicles. △ Less

Submitted 14 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.11156 [pdf, other]

Efficient Low-Rank Matrix Estimation, Experimental Design, and Arm-Set-Dependent Low-Rank Bandits

Authors: Kyoungseok Jang, Chicheng Zhang, Kwang-Sung Jun

Abstract: We study low-rank matrix trace regression and the related problem of low-rank matrix bandits. Assuming access to the distribution of the covariates, we propose a novel low-rank matrix estimation method called LowPopArt and provide its recovery guarantee that depends on a novel quantity denoted by B(Q) that characterizes the hardness of the problem, where Q is the covariance matrix of the measureme… ▽ More We study low-rank matrix trace regression and the related problem of low-rank matrix bandits. Assuming access to the distribution of the covariates, we propose a novel low-rank matrix estimation method called LowPopArt and provide its recovery guarantee that depends on a novel quantity denoted by B(Q) that characterizes the hardness of the problem, where Q is the covariance matrix of the measurement distribution. We show that our method can provide tighter recovery guarantees than classical nuclear norm penalized least squares (Koltchinskii et al., 2011) in several problems. To perform efficient estimation with a limited number of measurements from an arbitrarily given measurement set A, we also propose a novel experimental design criterion that minimizes B(Q) with computational efficiency. We leverage our novel estimator and design of experiments to derive two low-rank linear bandit algorithms for general arm sets that enjoy improved regret upper bounds. This improves over previous works on low-rank bandits, which make somewhat restrictive assumptions that the arm set is the unit ball or that an efficient exploration distribution is given. To our knowledge, our experimental design criterion is the first one tailored to low-rank matrix estimation beyond the naive reduction to linear regression, which can be of independent interest. △ Less

Submitted 8 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

arXiv:2402.10429 [pdf, ps, other]

Fixed Confidence Best Arm Identification in the Bayesian Setting

Authors: Kyoungseok Jang, Junpei Komiyama, Kazutoshi Yamazaki

Abstract: We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show t… ▽ More We consider the fixed-confidence best arm identification (FC-BAI) problem in the Bayesian setting. This problem aims to find the arm of the largest mean with a fixed confidence level when the bandit model has been sampled from the known prior. Most studies on the FC-BAI problem have been conducted in the frequentist setting, where the bandit model is predetermined before the game starts. We show that the traditional FC-BAI algorithms studied in the frequentist setting, such as track-and-stop and top-two algorithms, result in arbitrarily suboptimal performances in the Bayesian setting. We also obtain a lower bound of the expected number of samples in the Bayesian setting and introduce a variant of successive elimination that has a matching performance with the lower bound up to a logarithmic factor. Simulations verify the theoretical results. △ Less

Submitted 22 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.09201 [pdf, ps, other]

Better-than-KL PAC-Bayes Bounds

Authors: Ilja Kuzborskij, Kwang-Sung Jun, Yulian Wu, Kyoungseok Jang, Francesco Orabona

Abstract: Let $f(θ, X_1),$ $ \dots,$ $ f(θ, X_n)$ be a sequence of random elements, where $f$ is a fixed scalar function, $X_1, \dots, X_n$ are independent random variables (data), and $θ$ is a random parameter distributed according to some data-dependent posterior distribution $P_n$. In this paper, we consider the problem of proving concentration inequalities to estimate the mean of the sequence. An exampl… ▽ More Let $f(θ, X_1),$ $ \dots,$ $ f(θ, X_n)$ be a sequence of random elements, where $f$ is a fixed scalar function, $X_1, \dots, X_n$ are independent random variables (data), and $θ$ is a random parameter distributed according to some data-dependent posterior distribution $P_n$. In this paper, we consider the problem of proving concentration inequalities to estimate the mean of the sequence. An example of such a problem is the estimation of the generalization error of some predictor trained by a stochastic algorithm, such as a neural network where $f$ is a loss function. Classically, this problem is approached through a PAC-Bayes analysis where, in addition to the posterior, we choose a prior distribution which captures our belief about the inductive bias of the learning problem. Then, the key quantity in PAC-Bayes concentration bounds is a divergence that captures the complexity of the learning problem where the de facto standard choice is the KL divergence. However, the tightness of this choice has rarely been questioned. In this paper, we challenge the tightness of the KL-divergence-based bounds by showing that it is possible to achieve a strictly tighter bound. In particular, we demonstrate new high-probability PAC-Bayes bounds with a novel and better-than-KL divergence that is inspired by Zhang et al. (2022). Our proof is inspired by recent advances in regret analysis of gambling algorithms, and its use to derive concentration inequalities. Our result is first-of-its-kind in that existing PAC-Bayes bounds with non-KL divergences are not known to be strictly better than KL. Thus, we believe our work marks the first step towards identifying optimal rates of PAC-Bayes bounds. △ Less

Submitted 4 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2401.09666 [pdf, other]

Traffic Smoothing Controllers for Autonomous Vehicles Using Deep Reinforcement Learning and Real-World Trajectory Data

Authors: Nathan Lichtlé, Kathy Jang, Adit Shah, Eugene Vinitsky, Jonathan W. Lee, Alexandre M. Bayen

Abstract: Designing traffic-smoothing cruise controllers that can be deployed onto autonomous vehicles is a key step towards improving traffic flow, reducing congestion, and enhancing fuel efficiency in mixed autonomy traffic. We bypass the common issue of having to carefully fine-tune a large traffic microsimulator by leveraging real-world trajectory data from the I-24 highway in Tennessee, replayed in a o… ▽ More Designing traffic-smoothing cruise controllers that can be deployed onto autonomous vehicles is a key step towards improving traffic flow, reducing congestion, and enhancing fuel efficiency in mixed autonomy traffic. We bypass the common issue of having to carefully fine-tune a large traffic microsimulator by leveraging real-world trajectory data from the I-24 highway in Tennessee, replayed in a one-lane simulation. Using standard deep reinforcement learning methods, we train energy-reducing wave-smoothing policies. As an input to the agent, we observe the speed and distance of only the vehicle in front, which are local states readily available on most recent vehicles, as well as non-local observations about the downstream state of the traffic. We show that at a low 4% autonomous vehicle penetration rate, we achieve significant fuel savings of over 15% on trajectories exhibiting many stop-and-go waves. Finally, we analyze the smoothing effect of the controllers and demonstrate robustness to adding lane-changing into the simulation as well as the removal of downstream information. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: Accepted to be published as part of the 26th IEEE International Conference on Intelligent Transportation Systems (ITSC) 2023, Bilbao, Spain, September 24-28, 2023

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.09040 [pdf, other]

STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models

Authors: Kangwook Jang, Sungnyun Kim, Hoirin Kim

Abstract: Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers tempor… ▽ More Albeit great performance of Transformer-based speech selfsupervised learning (SSL) models, their large parameter size and computational cost make them unfavorable to utilize. In this study, we propose to compress the speech SSL models by distilling speech temporal relation (STaR). Unlike previous works that directly match the representation for each speech frame, STaR distillation transfers temporal relation between speech frames, which is more suitable for lightweight student with limited capacity. We explore three STaR distillation objectives and select the best combination as the final STaR loss. Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters. We show that our method is applicable across different speech SSL models and maintains robust performance with further reduced parameters. △ Less

Submitted 25 April, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

Comments: ICASSP 2024 Best Student Paper Awarded. Code URL: https://github.com/sungnyun/ARMHuBERT

arXiv:2312.03777 [pdf, other]

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Authors: Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, Ser-Nam Lim

Abstract: Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, ima… ▽ More Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models, the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks, evaluated across tasks including image classification, image captioning, and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However, our findings suggest that context provided to the model via prompts, such as questions in a QA pair helps to mitigate the effects of visual adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under-explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments. △ Less

Submitted 8 December, 2023; v1 submitted 5 December, 2023; originally announced December 2023.

arXiv:2308.14815 [pdf, other]

Distributionally Robust Statistical Verification with Imprecise Neural Networks

Authors: Souradeep Dutta, Michele Caprio, Vivian Lin, Matthew Cleaveland, Kuk Jin Jang, Ivan Ruchkin, Oleg Sokolsky, Insup Lee

Abstract: A particularly challenging problem in AI safety is providing guarantees on the behavior of high-dimensional autonomous systems. Verification approaches centered around reachability analysis fail to scale, and purely statistical approaches are constrained by the distributional assumptions about the sampling process. Instead, we pose a distributionally robust version of the statistical verification… ▽ More A particularly challenging problem in AI safety is providing guarantees on the behavior of high-dimensional autonomous systems. Verification approaches centered around reachability analysis fail to scale, and purely statistical approaches are constrained by the distributional assumptions about the sampling process. Instead, we pose a distributionally robust version of the statistical verification problem for black-box systems, where our performance guarantees hold over a large family of distributions. This paper proposes a novel approach based on a combination of active learning, uncertainty quantification, and neural network verification. A central piece of our approach is an ensemble technique called Imprecise Neural Networks, which provides the uncertainty to guide active learning. The active learning uses an exhaustive neural-network verification tool Sherlock to collect samples. An evaluation on multiple physical simulators in the openAI gym Mujoco environments with reinforcement-learned controllers demonstrates that our approach can provide useful and scalable guarantees for high-dimensional systems. △ Less

Submitted 11 December, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

arXiv:2307.06816 [pdf, other]

Data-driven Nonlinear Parametric Model Order Reduction Framework using Deep Hierarchical Variational Autoencoder

Authors: SiHun Lee, Sangmin Lee, Kijoo Jang, Haeseong Cho, SangJoon Shin

Abstract: A data-driven parametric model order reduction (MOR) method using a deep artificial neural network is proposed. The present network, which is the least-squares hierarchical variational autoencoder (LSH-VAE), is capable of performing nonlinear MOR for the parametric interpolation of a nonlinear dynamic system with a significant number of degrees of freedom. LSH-VAE exploits two major changes to the… ▽ More A data-driven parametric model order reduction (MOR) method using a deep artificial neural network is proposed. The present network, which is the least-squares hierarchical variational autoencoder (LSH-VAE), is capable of performing nonlinear MOR for the parametric interpolation of a nonlinear dynamic system with a significant number of degrees of freedom. LSH-VAE exploits two major changes to the existing networks: a hierarchical deep structure and a hybrid weighted, probabilistic loss function. The enhancements result in a significantly improved accuracy and stability compared against the conventional nonlinear MOR methods, autoencoder, and variational autoencoder. Upon LSH-VAE, a parametric MOR framework is presented based on the spherically linear interpolation of the latent manifold. The present framework is validated and evaluated on three nonlinear and multiphysics dynamic systems. First, the present framework is evaluated on the fluid-structure interaction benchmark problem to assess its efficiency and accuracy. Then, a highly nonlinear aeroelastic phenomenon, limit cycle oscillation, is analyzed. Finally, the present framework is applied to a three-dimensional fluid flow to demonstrate its capability of efficiently analyzing a significantly large number of degrees of freedom. The performance of LSH-VAE is emphasized by comparing its results against that of the widely used nonlinear MOR methods, convolutional autoencoder, and $β$-VAE. The present framework exhibits a significantly enhanced accuracy to the conventional methods while still exhibiting a large speed-up factor. △ Less

Submitted 9 July, 2023; originally announced July 2023.

arXiv:2306.04662 [pdf, other]

Understanding Place Identity with Generative AI

Authors: Kee Moon Jang, Junda Chen, Yuhao Kang, Junghwan Kim, Jinhyung Lee, Fábio Duarte

Abstract: Researchers are constantly leveraging new forms of data with the goal of understanding how people perceive the built environment and build the collective place identity of cities. Latest advancements in generative artificial intelligence (AI) models have enabled the production of realistic representations learned from vast amounts of data. In this study, we aim to test the potential of generative… ▽ More Researchers are constantly leveraging new forms of data with the goal of understanding how people perceive the built environment and build the collective place identity of cities. Latest advancements in generative artificial intelligence (AI) models have enabled the production of realistic representations learned from vast amounts of data. In this study, we aim to test the potential of generative AI as the source of textual and visual information in capturing the place identity of cities assessed by filtered descriptions and images. We asked questions on the place identity of a set of 31 global cities to two generative AI models, ChatGPT and DALL-E2. Since generative AI has raised ethical concerns regarding its trustworthiness, we performed cross-validation to examine whether the results show similar patterns to real urban settings. In particular, we compared the outputs with Wikipedia data for text and images searched from Google for image. Our results indicate that generative AI models have the potential to capture the collective image of cities that can make them distinguishable. This study is among the first attempts to explore the capabilities of generative AI in understanding human perceptions of the built environment. It contributes to urban design literature by discussing future research opportunities and potential limitations. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: 6 pages, 3 figures, GIScience 2023

arXiv:2305.11685 [pdf, other]

doi 10.21437/Interspeech.2023-1329

Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation

Authors: Kangwook Jang, Sungnyun Kim, Se-Young Yun, Hoirin Kim

Abstract: Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key… ▽ More Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on the SUPERB benchmark. △ Less

Submitted 26 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

Comments: Proceedings of Interspeech 2023. Code URL: https://github.com/sungnyun/ARMHuBERT

arXiv:2302.10341 [pdf, other]

DC4L: Distribution Shift Recovery via Data-Driven Control for Deep Learning Models

Authors: Vivian Lin, Kuk Jin Jang, Souradeep Dutta, Michele Caprio, Oleg Sokolsky, Insup Lee

Abstract: Deep neural networks have repeatedly been shown to be non-robust to the uncertainties of the real world, even to naturally occurring ones. A vast majority of current approaches have focused on data-augmentation methods to expand the range of perturbations that the classifier is exposed to while training. A relatively unexplored avenue that is equally promising involves sanitizing an image as a pre… ▽ More Deep neural networks have repeatedly been shown to be non-robust to the uncertainties of the real world, even to naturally occurring ones. A vast majority of current approaches have focused on data-augmentation methods to expand the range of perturbations that the classifier is exposed to while training. A relatively unexplored avenue that is equally promising involves sanitizing an image as a preprocessing step, depending on the nature of perturbation. In this paper, we propose to use control for learned models to recover from distribution shifts online. Specifically, our method applies a sequence of semantic-preserving transformations to bring the shifted data closer in distribution to the training set, as measured by the Wasserstein distance. Our approach is to 1) formulate the problem of distribution shift recovery as a Markov decision process, which we solve using reinforcement learning, 2) identify a minimum condition on the data for our method to be applied, which we check online using a binary classifier, and 3) employ dimensionality reduction through orthonormal projection to aid in our estimates of the Wasserstein distance. We provide theoretical evidence that orthonormal projection preserves characteristics of the data at the distributional level. We apply our distribution shift recovery approach to the ImageNet-C benchmark for distribution shifts, demonstrating an improvement in average accuracy of up to 14.21% across a variety of state-of-the-art ImageNet classifiers. We further show that our method generalizes to composites of shifts from the ImageNet-C benchmark, achieving improvements in average accuracy of up to 9.81%. Finally, we test our method on CIFAR-100-C and report improvements of up to 8.25%. △ Less

Submitted 15 May, 2024; v1 submitted 20 February, 2023; originally announced February 2023.

arXiv:2302.09656 [pdf, other]

Credal Bayesian Deep Learning

Authors: Michele Caprio, Souradeep Dutta, Kuk Jin Jang, Vivian Lin, Radoslav Ivanov, Oleg Sokolsky, Insup Lee

Abstract: Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present Credal Bayesian Deep Learning (CBDL). Heuristically, CBDL allows to train an (uncountably) infinite e… ▽ More Uncertainty quantification and robustness to distribution shifts are important goals in machine learning and artificial intelligence. Although Bayesian Neural Networks (BNNs) allow for uncertainty in the predictions to be assessed, different sources of uncertainty are indistinguishable. We present Credal Bayesian Deep Learning (CBDL). Heuristically, CBDL allows to train an (uncountably) infinite ensemble of BNNs, using only finitely many elements. This is possible thanks to prior and likelihood finitely generated credal sets (FGCSs), a concept from the imprecise probability literature. Intuitively, convex combinations of a finite collection of prior-likelihood pairs are able to represent infinitely many such pairs. After training, CBDL outputs a set of posteriors on the parameters of the neural network. At inference time, such posterior set is used to derive a set of predictive distributions that is in turn utilized to distinguish between aleatoric and epistemic uncertainties, and to quantify them. The predictive set also produces either (i) a collection of outputs enjoying desirable probabilistic guarantees, or (ii) the single output that is deemed the best, that is, the one having the highest predictive lower probability -- another imprecise-probabilistic concept. CBDL is more robust than single BNNs to prior and likelihood misspecification, and to distribution shift. We show that CBDL is better at quantifying and disentangling different types of uncertainties than single BNNs, ensemble of BNNs, and Bayesian Model Averaging. In addition, we apply CBDL to two case studies to demonstrate its downstream tasks capabilities: one, for motion prediction in autonomous driving scenarios, and two, to model blood glucose and insulin dynamics for artificial pancreas control. We show that CBDL performs better when compared to an ensemble of BNNs baseline. △ Less

Submitted 22 February, 2024; v1 submitted 19 February, 2023; originally announced February 2023.

MSC Class: Primary: 68T37; Secondary: 68T05; 68W25

arXiv:2302.05829 [pdf, other]

Tighter PAC-Bayes Bounds Through Coin-Betting

Authors: Kyoungseok Jang, Kwang-Sung Jun, Ilja Kuzborskij, Francesco Orabona

Abstract: We consider the problem of estimating the mean of a sequence of random elements $f(X_1, θ)$ $, \ldots, $ $f(X_n, θ)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $θ$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss fu… ▽ More We consider the problem of estimating the mean of a sequence of random elements $f(X_1, θ)$ $, \ldots, $ $f(X_n, θ)$ where $f$ is a fixed scalar function, $S=(X_1, \ldots, X_n)$ are independent random variables, and $θ$ is a possibly $S$-dependent parameter. An example of such a problem would be to estimate the generalization error of a neural network trained on $n$ examples where $f$ is a loss function. Classically, this problem is approached through concentration inequalities holding uniformly over compact parameter sets of functions $f$, for example as in Rademacher or VC type analysis. However, in many problems, such inequalities often yield numerically vacuous estimates. Recently, the \emph{PAC-Bayes} framework has been proposed as a better alternative for this class of problems for its ability to often give numerically non-vacuous bounds. In this paper, we show that we can do even better: we show how to refine the proof strategy of the PAC-Bayes bounds and achieve \emph{even tighter} guarantees. Our approach is based on the \emph{coin-betting} framework that derives the numerically tightest known time-uniform concentration inequalities from the regret guarantees of online gambling algorithms. In particular, we derive the first PAC-Bayes concentration inequality based on the coin-betting approach that holds simultaneously for all sample sizes. We demonstrate its tightness showing that by \emph{relaxing} it we obtain a number of previous results in a closed form including Bernoulli-KL and empirical Bernstein inequalities. Finally, we propose an efficient algorithm to numerically calculate confidence sequences from our bound, which often generates nonvacuous confidence bounds even with one sample, unlike the state-of-the-art PAC-Bayes bounds. △ Less

Submitted 11 February, 2023; originally announced February 2023.

arXiv:2210.15345 [pdf, other]

PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits

Authors: Kyoungseok Jang, Chicheng Zhang, Kwang-Sung Jun

Abstract: In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys… ▽ More In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys a tighter $\ell_1$ recovery guarantee compared to Lasso (Tibshirani, 1996) in many problems. Our bound naturally motivates an experimental design criterion that is convex and thus computationally efficient to solve. Based on our novel estimator and design criterion, we derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art (Hao et al., 2020), especially w.r.t. the geometry of the given action set. Finally, we prove a matching lower bound for sparse linear bandits in the data-poor regime, which closes the gap between upper and lower bounds in prior work. △ Less

Submitted 17 November, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: 10 pages, 1 figures, published in the 2022 Conference on Neural Information Processing Systems

arXiv:2207.00555 [pdf, other]

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Authors: Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoirin Kim

Abstract: Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such… ▽ More Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work. △ Less

Submitted 1 July, 2022; originally announced July 2022.

Comments: Accepted to Interspeech 2022

arXiv:2112.08816 [pdf, other]

Deep Hash Distillation for Image Retrieval

Authors: Young Kyun Jang, Geonmo Gu, Byungsoo Ko, Isaac Kang, Nam Ik Cho

Abstract: In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in represe… ▽ More In hash-based image retrieval systems, degraded or transformed inputs usually generate different codes from the original, deteriorating the retrieval accuracy. To mitigate this issue, data augmentation can be applied during training. However, even if augmented samples of an image are similar in real feature space, the quantization can scatter them far away in Hamming space. This results in representation discrepancies that can impede training and degrade performance. In this work, we propose a novel self-distilled hashing scheme to minimize the discrepancy while exploiting the potential of augmented data. By transferring the hash knowledge of the weakly-transformed samples to the strong ones, we make the hash code insensitive to various transformations. We also introduce hash proxy-based similarity learning and binary cross entropy-based quantization loss to provide fine quality hash codes. Ultimately, we construct a deep hashing framework that not only improves the existing deep hashing approaches, but also achieves the state-of-the-art retrieval results. Extensive experiments are conducted and confirm the effectiveness of our work. △ Less

Submitted 13 July, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: ECCV2022

arXiv:2109.02244 [pdf, other]

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval

Authors: Young Kyun Jang, Nam Ik Cho

Abstract: Supervised deep learning-based hash and vector quantization are enabling fast and large-scale image retrieval systems. By fully exploiting label annotations, they are achieving outstanding retrieval performances compared to the conventional methods. However, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle t… ▽ More Supervised deep learning-based hash and vector quantization are enabling fast and large-scale image retrieval systems. By fully exploiting label annotations, they are achieving outstanding retrieval performances compared to the conventional methods. However, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, we propose the first deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner. We design a Cross Quantized Contrastive learning strategy that jointly learns codewords and deep visual descriptors by comparing individually transformed images (views). Our method analyzes the image contents to extract descriptive features, allowing us to understand image representations for accurate retrieval. By conducting extensive experiments on benchmarks, we demonstrate that the proposed method yields state-of-the-art results even without supervised pretraining. △ Less

Submitted 12 January, 2022; v1 submitted 6 September, 2021; originally announced September 2021.

Comments: ICCV 2021

arXiv:2107.05025 [pdf, other]

Similarity Guided Deep Face Image Retrieval

Authors: Young Kyun Jang, Nam Ik Cho

Abstract: Face image retrieval, which searches for images of the same identity from the query input face image, is drawing more attention as the size of the image database increases rapidly. In order to conduct fast and accurate retrieval, a compact hash code-based methods have been proposed, and recently, deep face image hashing methods with supervised classification training have shown outstanding perform… ▽ More Face image retrieval, which searches for images of the same identity from the query input face image, is drawing more attention as the size of the image database increases rapidly. In order to conduct fast and accurate retrieval, a compact hash code-based methods have been proposed, and recently, deep face image hashing methods with supervised classification training have shown outstanding performance. However, classification-based scheme has a disadvantage in that it cannot reveal complex similarities between face images into the hash code learning. In this paper, we attempt to improve the face image retrieval quality by proposing a Similarity Guided Hashing (SGH) method, which gently considers self and pairwise-similarity simultaneously. SGH employs various data augmentations designed to explore elaborate similarities between face images, solving both intra and inter identity-wise difficulties. Extensive experimental results on the protocols with existing benchmarks and an additionally proposed large scale higher resolution face image dataset demonstrate that our SGH delivers state-of-the-art retrieval performance. △ Less

Submitted 11 July, 2021; originally announced July 2021.

Comments: 10 pages, 9 figures

arXiv:2104.07198 [pdf, other]

Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval

Authors: Kyoung-Rok Jang, Junmo Kang, Giwon Hong, Sung-Hyon Myaeng, Joohee Park, Taewon Yoon, Heecheol Seo

Abstract: The semantic matching capabilities of neural information retrieval can ameliorate synonymy and polysemy problems of symbolic approaches. However, neural models' dense representations are more suitable for re-ranking, due to their inefficiency. Sparse representations, either in symbolic or latent form, are more efficient with an inverted index. Taking the merits of the sparse and dense representati… ▽ More The semantic matching capabilities of neural information retrieval can ameliorate synonymy and polysemy problems of symbolic approaches. However, neural models' dense representations are more suitable for re-ranking, due to their inefficiency. Sparse representations, either in symbolic or latent form, are more efficient with an inverted index. Taking the merits of the sparse and dense representations, we propose an ultra-high dimensional (UHD) representation scheme equipped with directly controllable sparsity. UHD's large capacity and minimal noise and interference among the dimensions allow for binarized representations, which are highly efficient for storage and search. Also proposed is a bucketing method, where the embeddings from multiple layers of BERT are selected/merged to represent diverse linguistic aspects. We test our models with MS MARCO and TREC CAR, showing that our models outperforms other sparse models △ Less

Submitted 15 October, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

Comments: To appear at EMNLP 2021

arXiv:2104.05554 [pdf]

On Analyzing Churn Prediction in Mobile Games

Authors: Kihoon Jang, Junwhan Kim, Byunggu Yu

Abstract: In subscription-based businesses, the churn rate refers to the percentage of customers who discontinue their subscriptions within a given time period. Particularly, in the mobile games industry, the churn rate is often pronounced due to the high competition and cost in customer acquisition; therefore, the process of minimizing the churn rate is crucial. This needs churn prediction, predicting user… ▽ More In subscription-based businesses, the churn rate refers to the percentage of customers who discontinue their subscriptions within a given time period. Particularly, in the mobile games industry, the churn rate is often pronounced due to the high competition and cost in customer acquisition; therefore, the process of minimizing the churn rate is crucial. This needs churn prediction, predicting users who will be churning within a given time period. Accurate churn prediction can enable the businesses to devise and engage strategic remediations to maintain a low churn rate. The paper presents our highly accurate churn prediction method. We designed this method to take into account each individual user's distinct usage period in churn prediction. As presented in the paper, this approach was able to achieve 96.6% churn prediction accuracy on a real game business. In addition, the paper shows that other existing churn prediction algorithms are improved in prediction accuracy when this method is applied. △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: 8 pages, 10 figures, 2021 6th International Conference on Machine Learning Technologies

ACM Class: I.2.1

arXiv:2101.10404 [pdf, other]

Learning-'N-Flying: A Learning-based, Decentralized Mission Aware UAS Collision Avoidance Scheme

Authors: Alëna Rodionova, Yash Vardhan Pant, Connor Kurtz, Kuk Jang, Houssam Abbas, Rahul Mangharam

Abstract: Urban Air Mobility, the scenario where hundreds of manned and Unmanned Aircraft System (UAS) carry out a wide variety of missions (e.g. moving humans and goods within the city), is gaining acceptance as a transportation solution of the future. One of the key requirements for this to happen is safely managing the air traffic in these urban airspaces. Due to the expected density of the airspace, thi… ▽ More Urban Air Mobility, the scenario where hundreds of manned and Unmanned Aircraft System (UAS) carry out a wide variety of missions (e.g. moving humans and goods within the city), is gaining acceptance as a transportation solution of the future. One of the key requirements for this to happen is safely managing the air traffic in these urban airspaces. Due to the expected density of the airspace, this requires fast autonomous solutions that can be deployed online. We propose Learning-'N-Flying (LNF) a multi-UAS Collision Avoidance (CA) framework. It is decentralized, works on-the-fly and allows autonomous UAS managed by different operators to safely carry out complex missions, represented using Signal Temporal Logic, in a shared airspace. We initially formulate the problem of predictive collision avoidance for two UAS as a mixed-integer linear program, and show that it is intractable to solve online. Instead, we first develop Learning-to-Fly (L2F) by combining: a) learning-based decision-making, and b) decentralized convex optimization-based control. LNF extends L2F to cases where there are more than two UAS on a collision path. Through extensive simulations, we show that our method can run online (computation time in the order of milliseconds), and under certain assumptions has failure rates of less than 1% in the worst-case, improving to near 0% in more relaxed operations. We show the applicability of our scheme to a wide variety of settings through multiple case studies. △ Less

Submitted 25 January, 2021; originally announced January 2021.

Comments: to be published in ACM Transactions on Cyber-Physical Systems. arXiv admin note: text overlap with arXiv:2006.13267

arXiv:2010.06900 [pdf]

Development of Open Informal Dataset Affecting Autonomous Driving

Authors: Yong-Gu Lee, Seong-Jae Lee, Sang-Jin Lee, Tae-Seung Baek, Dong-Whan Lee, Kyeong-Chan Jang, Ho-Jin Sohn, Jin-Soo Kim

Abstract: This document is a document that has written procedures and methods for collecting objects and unstructured dynamic data on the road for the development of object recognition technology for self-driving cars, and outlines the methods of collecting data, annotation data, object classifier criteria, and data processing methods. On-road object and unstructured dynamic data were collected in various e… ▽ More This document is a document that has written procedures and methods for collecting objects and unstructured dynamic data on the road for the development of object recognition technology for self-driving cars, and outlines the methods of collecting data, annotation data, object classifier criteria, and data processing methods. On-road object and unstructured dynamic data were collected in various environments, such as weather, time and traffic conditions, and additional reception calls for police and safety personnel were collected. Finally, 100,000 images of various objects existing on pedestrians and roads, 200,000 images of police and traffic safety personnel, 5,000 images of police and traffic safety personnel, and data sets consisting of 5,000 image data were collected and built. △ Less

Submitted 14 October, 2020; originally announced October 2020.

Comments: 26 pages, 16 figures

arXiv:2008.01825 [pdf, other]

Robust Reinforcement Learning using Adversarial Populations

Authors: Eugene Vinitsky, Yuqing Du, Kanaad Parvate, Kathy Jang, Pieter Abbeel, Alexandre Bayen

Abstract: Reinforcement Learning (RL) is an effective tool for controller design but can struggle with issues of robustness, failing catastrophically when the underlying system dynamics are perturbed. The Robust RL formulation tackles this by adding worst-case adversarial noise to the dynamics and constructing the noise distribution as the solution to a zero-sum minimax game. However, existing work on learn… ▽ More Reinforcement Learning (RL) is an effective tool for controller design but can struggle with issues of robustness, failing catastrophically when the underlying system dynamics are perturbed. The Robust RL formulation tackles this by adding worst-case adversarial noise to the dynamics and constructing the noise distribution as the solution to a zero-sum minimax game. However, existing work on learning solutions to the Robust RL formulation has primarily focused on training a single RL agent against a single adversary. In this work, we demonstrate that using a single adversary does not consistently yield robustness to dynamics variations under standard parametrizations of the adversary; the resulting policy is highly exploitable by new adversaries. We propose a population-based augmentation to the Robust RL formulation in which we randomly initialize a population of adversaries and sample from the population uniformly during training. We empirically validate across robotics benchmarks that the use of an adversarial population results in a more robust policy that also improves out-of-distribution generalization. Finally, we demonstrate that this approach provides comparable robustness and generalization as domain randomization on these benchmarks while avoiding a ubiquitous domain randomization failure mode. △ Less

Submitted 22 September, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

arXiv:2006.13267 [pdf, other]

Learning-to-Fly: Learning-based Collision Avoidance for Scalable Urban Air Mobility

Authors: Alëna Rodionova, Yash Vardhan Pant, Kuk Jang, Houssam Abbas, Rahul Mangharam

Abstract: With increasing urban population, there is global interest in Urban Air Mobility (UAM), where hundreds of autonomous Unmanned Aircraft Systems (UAS) execute missions in the airspace above cities. Unlike traditional human-in-the-loop air traffic management, UAM requires decentralized autonomous approaches that scale for an order of magnitude higher aircraft densities and are applicable to urban set… ▽ More With increasing urban population, there is global interest in Urban Air Mobility (UAM), where hundreds of autonomous Unmanned Aircraft Systems (UAS) execute missions in the airspace above cities. Unlike traditional human-in-the-loop air traffic management, UAM requires decentralized autonomous approaches that scale for an order of magnitude higher aircraft densities and are applicable to urban settings. We present Learning-to-Fly (L2F), a decentralized on-demand airborne collision avoidance framework for multiple UAS that allows them to independently plan and safely execute missions with spatial, temporal and reactive objectives expressed using Signal Temporal Logic. We formulate the problem of predictively avoiding collisions between two UAS without violating mission objectives as a Mixed Integer Linear Program (MILP).This however is intractable to solve online. Instead, we develop L2F, a two-stage collision avoidance method that consists of: 1) a learning-based decision-making scheme and 2) a distributed, linear programming-based UAS control algorithm. Through extensive simulations, we show the real-time applicability of our method which is $\approx\!6000\times$ faster than the MILP approach and can resolve $100\%$ of collisions when there is ample room to maneuver, and shows graceful degradation in performance otherwise. We also compare L2F to two other methods and demonstrate an implementation on quad-rotor robots. △ Less

Submitted 23 June, 2020; originally announced June 2020.

Comments: To be published in IEEE International Conference on Intelligent Transportation Systems (ITSC), 2020

arXiv:2003.11583 [pdf, other]

doi 10.1038/s41467-020-17919-6

Nanophotonic spin-glass for realization of a coherent Ising machine

Authors: Yoshitomo Okawachi, Mengjie Yu, Jae K. Jang, Xingchen Ji, Yun Zhao, Bok Young Kim, Michal Lipson, Alexander L. Gaeta

Abstract: The need for solving optimization problems is prevalent in a wide range of physical applications, including neuroscience, network design, biological systems, socio-economics, and chemical reactions. Many of these are classified as non-deterministic polynomial-time (NP) hard and thus become intractable to solve as the system scales to a large number of elements. Recent research advances in photonic… ▽ More The need for solving optimization problems is prevalent in a wide range of physical applications, including neuroscience, network design, biological systems, socio-economics, and chemical reactions. Many of these are classified as non-deterministic polynomial-time (NP) hard and thus become intractable to solve as the system scales to a large number of elements. Recent research advances in photonics have sparked interest in using a network of coupled degenerate optical parametric oscillators (DOPO's) to effectively find the ground state of the Ising Hamiltonian, which can be used to solve other combinatorial optimization problems through polynomial-time mapping. Here, using the nanophotonic silicon-nitride platform, we propose a network of on-chip spatial-multiplexed DOPO's for the realization of a photonic coherent Ising machine. We demonstrate the generation and coupling of two microresonator-based DOPO's on a single chip. Through a reconfigurable phase link, we achieve both in-phase and out-of-phase operation, which can be deterministically achieved at a fast regeneration speed of 400 kHz with a large phase tolerance. Our work provides the critical building blocks towards the realization of a chip-scale photonic Ising machine. △ Less

Submitted 25 March, 2020; originally announced March 2020.

Comments: 8 pages, 6 figures

arXiv:2002.11281 [pdf, other]

Generalized Product Quantization Network for Semi-supervised Image Retrieval

Authors: Young Kyun Jang, Nam Ik Cho

Abstract: Image retrieval methods that employ hashing or vector quantization have achieved great success by taking advantage of deep learning. However, these approaches do not meet expectations unless expensive label information is sufficient. To resolve this issue, we propose the first quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network. We design a nov… ▽ More Image retrieval methods that employ hashing or vector quantization have achieved great success by taking advantage of deep learning. However, these approaches do not meet expectations unless expensive label information is sufficient. To resolve this issue, we propose the first quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network. We design a novel metric learning strategy that preserves semantic similarity between labeled data, and employ entropy regularization term to fully exploit inherent potentials of unlabeled data. Our solution increases the generalization capacity of the quantization network, which allows overcoming previous limitations in the retrieval community. Extensive experimental results demonstrate that GPQ yields state-of-the-art performance on large-scale real image benchmark datasets. △ Less

Submitted 11 June, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: 10 pages, 10 figures, Computer Vision and Pattern Recognition (CVPR) 2020 accpeted paper

arXiv:1812.06120 [pdf, other]

Simulation to Scaled City: Zero-Shot Policy Transfer for Traffic Control via Autonomous Vehicles

Authors: Kathy Jang, Eugene Vinitsky, Behdad Chalaki, Ben Remer, Logan Beaver, Andreas Malikopoulos, Alexandre Bayen

Abstract: Using deep reinforcement learning, we train control policies for autonomous vehicles leading a platoon of vehicles onto a roundabout. Using Flow, a library for deep reinforcement learning in micro-simulators, we train two policies, one policy with noise injected into the state and action space and one without any injected noise. In simulation, the autonomous vehicle learns an emergent metering beh… ▽ More Using deep reinforcement learning, we train control policies for autonomous vehicles leading a platoon of vehicles onto a roundabout. Using Flow, a library for deep reinforcement learning in micro-simulators, we train two policies, one policy with noise injected into the state and action space and one without any injected noise. In simulation, the autonomous vehicle learns an emergent metering behavior for both policies in which it slows to allow for smoother merging. We then directly transfer this policy without any tuning to the University of Delaware Scaled Smart City (UDSSC), a 1:25 scale testbed for connected and automated vehicles. We characterize the performance of both policies on the scaled city. We show that the noise-free policy winds up crashing and only occasionally metering. However, the noise-injected policy consistently performs the metering behavior and remains collision-free, suggesting that the noise helps with the zero-shot policy transfer. Additionally, the transferred, noise-injected policy leads to a 5% reduction of average travel time and a reduction of 22% in maximum travel time in the UDSSC. Videos of the controllers can be found at https://sites.google.com/view/iccps-policy-transfer. △ Less

Submitted 22 February, 2019; v1 submitted 14 December, 2018; originally announced December 2018.

Comments: To be published at the International Conference on Cyber Physical Systems (ICCPS) 2019. 10 pages, 9 figures

ACM Class: I.2.1; I.2.4; I.2.6; I.2.10; I.6.5

arXiv:1810.02186 [pdf, other]

OPERA: Reasoning about continuous common knowledge in asynchronous distributed systems

Authors: Sang-Min Choi, Jiho Park, Quan Nguyen, Andre Cronje, Kiyoung Jang, Hyunjoon Cheon, Yo-Sub Han, Byung-Ik Ahn

Abstract: This paper introduces a new family of consensus protocols, namely \emph{Lachesis-class} denoted by $\mathcal{L}$, for distributed networks with guaranteed Byzantine fault tolerance. Each Lachesis protocol $L$ in $\mathcal{L}$ has complete asynchrony, is leaderless, has no round robin, no proof-of-work, and has eventual consensus. The core concept of our technology is the \emph{OPERA chain}, gene… ▽ More This paper introduces a new family of consensus protocols, namely \emph{Lachesis-class} denoted by $\mathcal{L}$, for distributed networks with guaranteed Byzantine fault tolerance. Each Lachesis protocol $L$ in $\mathcal{L}$ has complete asynchrony, is leaderless, has no round robin, no proof-of-work, and has eventual consensus. The core concept of our technology is the \emph{OPERA chain}, generated by the Lachesis protocol. In the most general form, each node in Lachesis has a set of $k$ neighbours of most preference. When receiving transactions a node creates and shares an event block with all neighbours. Each event block is signed by the hashes of the creating node and its $k$ peers. The OPERA chain of the event blocks is a Directed Acyclic Graph (DAG); it guarantees practical Byzantine fault tolerance (pBFT). Our framework is then presented using Lamport timestamps and concurrent common knowledge. Further, we present an example of Lachesis consensus protocol $L_0$ of our framework. Our $L_0$ protocol can reach consensus upon 2/3 of all participants' agreement to an event block without any additional communication overhead. $L_0$ protocol relies on a cost function to identify $k$ peers and to generate the DAG-based OPERA chain. By creating a binary flag table that stores connection information and share information between blocks, Lachesis achieves consensus in fewer steps than pBFT protocol for consensus. △ Less

Submitted 4 October, 2018; originally announced October 2018.

arXiv:1610.04688 [pdf, other]

ExpressPass: End-to-End Credit-based Congestion Control for Datacenters

Authors: Inho Cho, Dongsu Han, Keon Jang

Abstract: As link speeds increase in datacenter networks, existing congestion control algorithms become less effective in providing fast convergence. TCP-based algorithms that probe for bandwidth take a long time to reach the fair-share and lead to long flow completion times. An ideal congestion control algorithms for datacenter must provide 1) zero data loss, 2) fast convergence, and 3) low buffer occupanc… ▽ More As link speeds increase in datacenter networks, existing congestion control algorithms become less effective in providing fast convergence. TCP-based algorithms that probe for bandwidth take a long time to reach the fair-share and lead to long flow completion times. An ideal congestion control algorithms for datacenter must provide 1) zero data loss, 2) fast convergence, and 3) low buffer occupancy. However, these requirements present conflicting goals. For fast convergence,flows must ramp up quickly, but this risks packet losses and large queues. Thus, even the state-of-the-art algorithms, such as TIMELY and DCQCN, rely on link layer flow control (e.g.,Priority-based Flow Control) to achieve zero loss. This paper presents a new approach, called ExpressPass, an end-to-end credit-based congestion control algorithm for datacenters. ExpressPass is inspired by credit-based flow control, but extends it to work end-to-end. The switches control the amount of credit packets by rate limiting and ensure data packets flow in the reverse direction without any loss. ExpressPass leverages this to ramp up aggressively. ExpressPass converges up to 80 times faster than DCTCP at 10Gbps link, and the gap increases as link speeds become faster. Our simulation with realistic workload shows that ExpressPass significantly reduces the flow completion time especially for small and medium size flows compared to DCTCP, HULL, and DX. △ Less

Submitted 15 October, 2016; originally announced October 2016.

arXiv:1411.3410 [pdf]

Person Re-identification Based on Color Histogram and Spatial Configuration of Dominant Color Regions

Authors: Kwangchol Jang, Sokmin Han, Insong Kim

Abstract: There is a requirement to determine whether a given person of interest has already been observed over a network of cameras in video surveillance systems. A human appearance obtained in one camera is usually different from the ones obtained in another camera due to difference in illumination, pose and viewpoint, camera parameters. Being related to appearance-based approaches for person re-identific… ▽ More There is a requirement to determine whether a given person of interest has already been observed over a network of cameras in video surveillance systems. A human appearance obtained in one camera is usually different from the ones obtained in another camera due to difference in illumination, pose and viewpoint, camera parameters. Being related to appearance-based approaches for person re-identification, we propose a novel method based on the dominant color histogram and spatial configuration of dominant color regions on human body parts. Dominant color histogram and spatial configuration of the dominant color regions based on dominant color descriptor(DCD) can be considered to be robust to illumination and pose, viewpoint changes. The proposed method is evaluated using benchmark video datasets. Experimental results using the cumulative matching characteristic(CMC) curve demonstrate the effectiveness of our approach for person re-identification. △ Less

Submitted 12 November, 2014; originally announced November 2014.

Comments: 12 pages, 6 figures

arXiv:1011.3867 [pdf, ps, other]

doi 10.1109/GLOCOMW.2010.5700129

Interference Alignment Through User Cooperation for Two-cell MIMO Interfering Broadcast Channels

Authors: Wonjae Shin, Namyoon Lee, Jong-Bu Lim, Changyong Shin, Kyunghun Jang

Abstract: This paper focuses on two-cell multiple-input multiple-output (MIMO) Gaussian interfering broadcast channels (MIMO-IFBC) with $K$ cooperating users on the cell-boundary of each BS. It corresponds to a downlink scenario for cellular networks with two base stations (BSs), and $K$ users equipped with Wi-Fi interfaces enabling to cooperate among users on a peer-to-peer basis. In this scenario, we prop… ▽ More This paper focuses on two-cell multiple-input multiple-output (MIMO) Gaussian interfering broadcast channels (MIMO-IFBC) with $K$ cooperating users on the cell-boundary of each BS. It corresponds to a downlink scenario for cellular networks with two base stations (BSs), and $K$ users equipped with Wi-Fi interfaces enabling to cooperate among users on a peer-to-peer basis. In this scenario, we propose a novel interference alignment (IA) technique exploiting user cooperation. Our proposed algorithm obtains the achievable degrees of freedom (DoF) of 2K when each BS and user have $M=K+1$ transmit antennas and $N=K$ receive antennas, respectively. Furthermore, the algorithm requires only a small amount of channel feedback information with the aid of the user cooperation channels. The simulations demonstrate that not only are the analytical results valid, but the achievable DoF of our proposed algorithm also outperforms those of conventional techniques. △ Less

Submitted 16 November, 2010; originally announced November 2010.

Comments: This paper will appear in IEEE GLOBECOM 2010

Showing 1–43 of 43 results for author: Jang, K