-
Direct Preference Optimization for Suppressing Hallucinated Prior Exams in Radiology Report Generation
Authors:
Oishi Banerjee,
Hong-Yu Zhou,
Subathra Adithan,
Stephen Kwak,
Kay Wu,
Pranav Rajpurkar
Abstract:
Recent advances in generative vision-language models (VLMs) have exciting potential implications for AI in radiology, yet VLMs are also known to produce hallucinations, nonsensical text, and other unwanted behaviors that can waste clinicians' time and cause patient harm. Drawing on recent work on direct preference optimization (DPO), we propose a simple method for modifying the behavior of pretrai…
▽ More
Recent advances in generative vision-language models (VLMs) have exciting potential implications for AI in radiology, yet VLMs are also known to produce hallucinations, nonsensical text, and other unwanted behaviors that can waste clinicians' time and cause patient harm. Drawing on recent work on direct preference optimization (DPO), we propose a simple method for modifying the behavior of pretrained VLMs performing radiology report generation by suppressing unwanted types of generations. We apply our method to the prevention of hallucinations of prior exams, addressing a long-established problem behavior in models performing chest X-ray report generation. Across our experiments, we find that DPO fine-tuning achieves a 3.2-4.8x reduction in lines hallucinating prior exams while maintaining model performance on clinical accuracy metrics. Our work is, to the best of our knowledge, the first work to apply DPO to medical VLMs, providing a data- and compute- efficient way to suppress problem behaviors while maintaining overall clinical accuracy.
△ Less
Submitted 14 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
Extreme Point Supervised Instance Segmentation
Authors:
Hyeonjun Lee,
Sehyun Hwang,
Suha Kwak
Abstract:
This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised meth…
▽ More
This paper introduces a novel approach to learning instance segmentation using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost points, of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation, and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points, which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks, our method significantly outperforms existing box-supervised methods, further narrowing the gap with its fully supervised counterpart. In particular, our model generates high-quality masks when a target object is separated into multiple parts, where previous box-supervised methods often fail.
△ Less
Submitted 3 June, 2024; v1 submitted 31 May, 2024;
originally announced May 2024.
-
Distilling Diffusion Models into Conditional GANs
Authors:
Minguk Kang,
Richard Zhang,
Connelly Barnes,
Sylvain Paris,
Suha Kwak,
Jaesik Park,
Eli Shechtman,
Jun-Yan Zhu,
Taesung Park
Abstract:
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose…
▽ More
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. Our approach interprets diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a perceptual loss operating directly in diffusion model's latent space, utilizing an ensemble of augmentations. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. E-LatentLPIPS converges more efficiently than many existing distillation methods, even accounting for dataset construction costs. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models -- DMD, SDXL-Turbo, and SDXL-Lightning -- on the zero-shot COCO benchmark.
△ Less
Submitted 13 June, 2024; v1 submitted 9 May, 2024;
originally announced May 2024.
-
Active Label Correction for Semantic Segmentation with Foundation Models
Authors:
Hoyoung Kim,
Sehyun Hwang,
Suha Kwak,
Jungseul Ok
Abstract:
Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels,…
▽ More
Training and validating models for semantic segmentation require datasets with pixel-wise annotations, which are notoriously labor-intensive. Although useful priors such as foundation models or crowdsourced datasets are available, they are error-prone. We hence propose an effective framework of active label correction (ALC) based on a design of correction query to rectify pseudo labels of pixels, which in turn is more annotator-friendly than the standard one inquiring to classify a pixel directly according to our theoretical analysis and user study. Specifically, leveraging foundation models providing useful zero-shot predictions on pseudo labels and superpixels, our method comprises two key techniques: (i) an annotator-friendly design of correction query with the pseudo labels, and (ii) an acquisition function looking ahead label expansions based on the superpixels. Experimental results on PASCAL, Cityscapes, and Kvasir-SEG datasets demonstrate the effectiveness of our ALC framework, outperforming prior methods for active semantic segmentation and label correction. Notably, utilizing our method, we obtained a revised dataset of PASCAL by rectifying errors in 2.6 million pixels in PASCAL dataset.
△ Less
Submitted 4 June, 2024; v1 submitted 16 March, 2024;
originally announced March 2024.
-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Improving Diffusion Models for Virtual Try-on
Authors:
Yisol Choi,
Sangkyung Kwak,
Kyungmin Lee,
Hyungwon Choi,
Jinwoo Shin
Abstract:
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve…
▽ More
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io
△ Less
Submitted 19 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Direct Consistency Optimization for Compositional Text-to-Image Personalization
Authors:
Kyungmin Lee,
Sangkyung Kwak,
Kihyuk Sohn,
Jinwoo Shin
Abstract:
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the…
▽ More
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model. We devise a novel training objective for T2I diffusion models that minimally fine-tunes the pretrained model to achieve consistency. Our method, dubbed \emph{Direct Consistency Optimization}, is as simple as regular diffusion loss, while significantly enhancing the compositionality of personalized T2I models. Also, our approach induces a new sampling method that controls the tradeoff between image fidelity and prompt fidelity. Lastly, we emphasize the necessity of using a comprehensive caption for reference images to further enhance the image-text alignment. We show the efficacy of the proposed method on the T2I personalization for subject, style, or both. In particular, our method results in a superior Pareto frontier to the baselines. Generated examples and codes are in our project page( https://dco-t2i.github.io/).
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
A Korean Legal Judgment Prediction Dataset for Insurance Disputes
Authors:
Alice Saebom Kwak,
Cheonkam Jeong,
Ji Weon Lim,
Byeongcheol Min
Abstract:
This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a…
▽ More
This paper introduces a Korean legal judgment prediction (LJP) dataset for insurance disputes. Successful LJP models on insurance disputes can benefit insurance companies and their customers. It can save both sides' time and money by allowing them to predict how the result would come out if they proceed to the dispute mediation process. As is often the case with low-resource languages, there is a limitation on the amount of data available for this specific task. To mitigate this issue, we investigate how one can achieve a good performance despite the limitation in data. In our experiment, we demonstrate that Sentence Transformer Fine-tuning (SetFit, Tunstall et al., 2022) is a good alternative to standard fine-tuning when training data are limited. The models fine-tuned with the SetFit approach on our data show similar performance to the Korean LJP benchmark models (Hwang et al., 2022) despite the much smaller data size.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Activity Grammars for Temporal Action Segmentation
Authors:
Dayoung Gong,
Joonseok Lee,
Deunsol Jung,
Suha Kwak,
Minsu Cho
Abstract:
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective act…
▽ More
Sequence prediction on temporal data requires the ability to understand compositional structures of multi-level semantics beyond individual and contextual properties. The task of temporal action segmentation, which aims at translating an untrimmed activity video into a sequence of action segments, remains challenging for this reason. This paper addresses the problem by introducing an effective activity grammar to guide neural predictions for temporal action segmentation. We propose a novel grammar induction algorithm that extracts a powerful context-free grammar from action sequence data. We also develop an efficient generalized parser that transforms frame-level probability distributions into a reliable sequence of actions according to the induced grammar with recursive rules. Our approach can be combined with any neural network for temporal action segmentation to enhance the sequence prediction and discover its compositional structure. Experimental results demonstrate that our method significantly improves temporal action segmentation in terms of both performance and interpretability on two standard benchmarks, Breakfast and 50 Salads.
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Towards More Practical Group Activity Detection: A New Benchmark and Model
Authors:
Dongkeun Kim,
Youngkil Song,
Minsu Cho,
Suha Kwak
Abstract:
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. U…
▽ More
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset and methodology due to their limited capability to address practical GAD scenarios. To resolve these issues, we first present a new dataset, dubbed Café. Unlike existing datasets, Café is constructed primarily for GAD and presents more practical evaluation scenarios and metrics, as well as being large-scale and providing rich annotations. Along with the dataset, we propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively. We evaluated our model on three datasets including Café, where it outperformed previous work in terms of both accuracy and inference speed. Both our dataset and code base will be open to the public to promote future research on GAD.
△ Less
Submitted 5 December, 2023;
originally announced December 2023.
-
Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting
Authors:
Benjamin Yan,
Ruochen Liu,
David E. Kuo,
Subathra Adithan,
Eduardo Pontes Reis,
Stephen Kwak,
Vasantha Kumar Venugopal,
Chloe P. O'Connell,
Agustina Saenz,
Pranav Rajpurkar,
Michael Moor
Abstract:
Automatically generated reports from medical images promise to improve the workflow of radiologists. Existing methods consider an image-to-report modeling task by directly generating a fully-fledged report from an image. However, this conflates the content of the report (e.g., findings and their attributes) with its style (e.g., format and choice of words), which can lead to clinically inaccurate…
▽ More
Automatically generated reports from medical images promise to improve the workflow of radiologists. Existing methods consider an image-to-report modeling task by directly generating a fully-fledged report from an image. However, this conflates the content of the report (e.g., findings and their attributes) with its style (e.g., format and choice of words), which can lead to clinically inaccurate reports. To address this, we propose a two-step approach for radiology report generation. First, we extract the content from an image; then, we verbalize the extracted content into a report that matches the style of a specific radiologist. For this, we leverage RadGraph -- a graph representation of reports -- together with large language models (LLMs). In our quantitative evaluations, we find that our approach leads to beneficial performance. Our human evaluation with clinical raters highlights that the AI-generated reports are indistinguishably tailored to the style of individual radiologist despite leveraging only a few examples as context.
△ Less
Submitted 31 October, 2023; v1 submitted 26 October, 2023;
originally announced October 2023.
-
Active Learning for Semantic Segmentation with Multi-class Label Query
Authors:
Sehyun Hwang,
Sohyun Lee,
Hoyoung Kim,
Minhyeon Oh,
Jungseul Ok,
Suha Kwak
Abstract:
This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing on…
▽ More
This paper proposes a new active learning method for semantic segmentation. The core of our method lies in a new annotation query design. It samples informative local image regions (e.g., superpixels), and for each of such regions, asks an oracle for a multi-hot vector indicating all classes existing in the region. This multi-class labeling strategy is substantially more efficient than existing ones like segmentation, polygon, and even dominant class labeling in terms of annotation time per click. However, it introduces the class ambiguity issue in training as it assigns partial labels (i.e., a set of candidate classes) to individual pixels. We thus propose a new algorithm for learning semantic segmentation while disambiguating the partial labels in two stages. In the first stage, it trains a segmentation model directly with the partial labels through two new loss functions motivated by partial label learning and multiple instance learning. In the second stage, it disambiguates the partial labels by generating pixel-wise pseudo labels, which are used for supervised learning of the model. Equipped with a new acquisition function dedicated to the multi-class labeling, our method outperforms previous work on Cityscapes and PASCAL VOC 2012 while spending less annotation cost. Our code and results are available at https://github.com/sehyun03/MulActSeg.
△ Less
Submitted 6 November, 2023; v1 submitted 17 September, 2023;
originally announced September 2023.
-
Universal Metric Learning with Parameter-Efficient Transfer Learning
Authors:
Sungyeon Kim,
Donghyun Kim,
Suha Kwak
Abstract:
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we introduce a novel metric learning paradigm, called Universal Metric Learning (UML), which learns a unified distance metric capable of capturing relations acr…
▽ More
A common practice in metric learning is to train and test an embedding model for each dataset. This dataset-specific approach fails to simulate real-world scenarios that involve multiple heterogeneous distributions of data. In this regard, we introduce a novel metric learning paradigm, called Universal Metric Learning (UML), which learns a unified distance metric capable of capturing relations across multiple data distributions. UML presents new challenges, such as imbalanced data distribution and bias towards dominant distributions. To address these challenges, we propose Parameter-efficient Universal Metric leArning (PUMA), which consists of a pre-trained frozen model and two additional modules, stochastic adapter and prompt pool. These modules enable to capture dataset-specific knowledge while avoiding bias towards dominant distributions. Additionally, we compile a new universal metric learning benchmark with a total of 8 different datasets. PUMA outperformed the state-of-the-art dataset-specific models while using about 69 times fewer trainable parameters.
△ Less
Submitted 16 September, 2023;
originally announced September 2023.
-
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Authors:
Dongwon Kim,
Namyup Kim,
Cuiling Lan,
Suha Kwak
Abstract:
Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source…
▽ More
Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source of supervision. To this end, we first present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. We also present a new loss function that allows the model to be trained without any further supervision. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
△ Less
Submitted 24 October, 2023; v1 submitted 29 August, 2023;
originally announced August 2023.
-
SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems
Authors:
Moon Ye-Bin,
Nam Hyeon-Woo,
Wonseok Choi,
Nayeong Kim,
Suha Kwak,
Tae-Hyun Oh
Abstract:
Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potent…
▽ More
Data imbalance in training data often leads to biased predictions from trained models, which in turn causes ethical and social issues. A straightforward solution is to carefully curate training data, but given the enormous scale of modern neural networks, this is prohibitively labor-intensive and thus impractical. Inspired by recent developments in generative models, this paper explores the potential of synthetic data to address the data imbalance problem. To be specific, our method, dubbed SYNAuG, leverages synthetic data to equalize the unbalanced distribution of training data. Our experiments demonstrate that, although a domain gap between real and synthetic data exists, training with SYNAuG followed by fine-tuning with a few real samples allows to achieve impressive performance on diverse tasks with different data imbalance issues, surpassing existing task-specific methods for the same purpose.
△ Less
Submitted 25 April, 2024; v1 submitted 2 August, 2023;
originally announced August 2023.
-
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization
Authors:
Junhyeong Cho,
Gilhyun Nam,
Sungyeon Kim,
Hunmin Yang,
Suha Kwak
Abstract:
In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse…
▽ More
In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse styles via prompts without using any images to deal with source-free domain generalization. The proposed method learns to generate a variety of style features (from "a S* style of a") via learnable style word vectors for pseudo-words S*. To ensure that learned styles do not distort content information, we force style-content features (from "a S* style of a [class]") to be located nearby their corresponding content features (from "[class]") in the joint vision-language space. After learning style word vectors, we train a linear classifier using synthesized style-content features. PromptStyler achieves the state of the art on PACS, VLCS, OfficeHome and DomainNet, even though it does not require any images for training.
△ Less
Submitted 15 August, 2023; v1 submitted 27 July, 2023;
originally announced July 2023.
-
Rate-Splitting Multiple Access for 6G Networks: Ten Promising Scenarios and Applications
Authors:
Jeonghun Park,
Byungju Lee,
Jinseok Choi,
Hoon Lee,
Namyoon Lee,
Seok-Hwan Park,
Kyoung-Jae Lee,
Junil Choi,
Sung Ho Chae,
Sang-Woon Jeon,
Kyung Sup Kwak,
Bruno Clerckx,
Wonjae Shin
Abstract:
In the upcoming 6G era, multiple access (MA) will play an essential role in achieving high throughput performances required in a wide range of wireless applications. Since MA and interference management are closely related issues, the conventional MA techniques are limited in that they cannot provide near-optimal performance in universal interference regimes. Recently, rate-splitting multiple acce…
▽ More
In the upcoming 6G era, multiple access (MA) will play an essential role in achieving high throughput performances required in a wide range of wireless applications. Since MA and interference management are closely related issues, the conventional MA techniques are limited in that they cannot provide near-optimal performance in universal interference regimes. Recently, rate-splitting multiple access (RSMA) has been gaining much attention. RSMA splits an individual message into two parts: a common part, decodable by every user, and a private part, decodable only by the intended user. Each user first decodes the common message and then decodes its private message by applying successive interference cancellation (SIC). By doing so, RSMA not only embraces the existing MA techniques as special cases but also provides significant performance gains by efficiently mitigating inter-user interference in a broad range of interference regimes. In this article, we first present the theoretical foundation of RSMA. Subsequently, we put forth four key benefits of RSMA: spectral efficiency, robustness, scalability, and flexibility. Upon this, we describe how RSMA can enable ten promising scenarios and applications along with future research directions to pave the way for 6G.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Extending CLIP's Image-Text Alignment to Referring Image Segmentation
Authors:
Seoyeon Kim,
Minguk Kang,
Dongwon Kim,
Jaesik Park,
Suha Kwak
Abstract:
Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose…
▽ More
Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.
△ Less
Submitted 7 April, 2024; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Adaptive Superpixel for Active Learning in Semantic Segmentation
Authors:
Hoyoung Kim,
Minhyeon Oh,
Sehyun Hwang,
Suha Kwak,
Jungseul Ok
Abstract:
Learning semantic segmentation requires pixel-wise annotations, which can be time-consuming and expensive. To reduce the annotation cost, we propose a superpixel-based active learning (AL) framework, which collects a dominant label per superpixel instead. To be specific, it consists of adaptive superpixel and sieving mechanisms, fully dedicated to AL. At each round of AL, we adaptively merge neigh…
▽ More
Learning semantic segmentation requires pixel-wise annotations, which can be time-consuming and expensive. To reduce the annotation cost, we propose a superpixel-based active learning (AL) framework, which collects a dominant label per superpixel instead. To be specific, it consists of adaptive superpixel and sieving mechanisms, fully dedicated to AL. At each round of AL, we adaptively merge neighboring pixels of similar learned features into superpixels. We then query a selected subset of these superpixels using an acquisition function assuming no uniform superpixel size. This approach is more efficient than existing methods, which rely only on innate features such as RGB color and assume uniform superpixel sizes. Obtaining a dominant label per superpixel drastically reduces annotators' burden as it requires fewer clicks. However, it inevitably introduces noisy annotations due to mismatches between superpixel and ground truth segmentation. To address this issue, we further devise a sieving mechanism that identifies and excludes potentially noisy annotations from learning. Our experiments on both Cityscapes and PASCAL VOC datasets demonstrate the efficacy of adaptive superpixel and sieving mechanisms.
△ Less
Submitted 20 August, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Human Pose Estimation in Extremely Low-Light Conditions
Authors:
Sohyun Lee,
Jaesung Rim,
Boseung Jeong,
Geonu Kim,
Byungju Woo,
Haechan Lee,
Sunghyun Cho,
Suha Kwak
Abstract:
We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significantly. To address the first issue, we develop a dedicated camera system and build a new dataset of real low-light images with accurate pose labels. Thanks to our c…
▽ More
We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significantly. To address the first issue, we develop a dedicated camera system and build a new dataset of real low-light images with accurate pose labels. Thanks to our camera system, each low-light image in our dataset is coupled with an aligned well-lit image, which enables accurate pose labeling and is used as privileged information during training. We also propose a new model and a new training strategy that fully exploit the privileged information to learn representation insensitive to lighting conditions. Our method demonstrates outstanding performance on real extremely low light images, and extensive analyses validate that both of our model and dataset contribute to the success.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
The dynamic nature of trust: Trust in Human-Robot Interaction revisited
Authors:
Jimin Rhim,
Sonya S. Kwak,
Angelica Lim,
Jason Millar
Abstract:
The role of robots is expanding from tool to collaborator. Socially assistive robots (SARs) are an example of collaborative robots that assist humans in the real world. As robots enter our social sphere, unforeseen risks occur during human-robot interaction (HRI), as everyday human space is full of uncertainties. Risk introduces an element of trust, so understanding human trust in the robot is imp…
▽ More
The role of robots is expanding from tool to collaborator. Socially assistive robots (SARs) are an example of collaborative robots that assist humans in the real world. As robots enter our social sphere, unforeseen risks occur during human-robot interaction (HRI), as everyday human space is full of uncertainties. Risk introduces an element of trust, so understanding human trust in the robot is imperative to initiate and maintain interactions with robots over time. While many scholars have investigated the issue of human-robot trust, a significant portion of that discussion is rooted in the human-automation interaction literature. As robots are no longer mere instruments, but social agents that co-exist with humans, we need a new lens to investigate the longitudinal dynamic nature of trust in HRI. In this position paper, we contend that focusing on the dynamic nature of trust as a new inquiry will help us better design trustworthy robots.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
Authors:
Sungyeon Kim,
Boseung Jeong,
Suha Kwak
Abstract:
Supervision for metric learning has long been given in the form of equivalence between human-labeled classes. Although this type of supervision has been a basis of metric learning for decades, we argue that it hinders further advances in the field. In this regard, we propose a new regularization method, dubbed HIER, to discover the latent semantic hierarchy of training data, and to deploy the hier…
▽ More
Supervision for metric learning has long been given in the form of equivalence between human-labeled classes. Although this type of supervision has been a basis of metric learning for decades, we argue that it hinders further advances in the field. In this regard, we propose a new regularization method, dubbed HIER, to discover the latent semantic hierarchy of training data, and to deploy the hierarchy to provide richer and more fine-grained supervision than inter-class separability induced by common metric learning losses.HIER achieves this goal with no annotation for the semantic hierarchy but by learning hierarchical proxies in hyperbolic spaces. The hierarchical proxies are learnable parameters, and each of them is trained to serve as an ancestor of a group of data or other proxies to approximate the semantic hierarchy among them. HIER deals with the proxies along with data in hyperbolic space since the geometric properties of the space are well-suited to represent their hierarchical structure. The efficacy of HIER is evaluated on four standard benchmarks, where it consistently improved the performance of conventional methods when integrated with them, and consequently achieved the best records, surpassing even the existing hyperbolic metric learning technique, in almost all settings.
△ Less
Submitted 10 April, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Learning to Detect Semantic Boundaries with Image-level Class Labels
Authors:
Namyup Kim,
Sehyun Hwang,
Suha Kwak
Abstract:
This paper presents the first attempt to learn semantic boundary detection using image-level class labels as supervision. Our method starts by estimating coarse areas of object classes through attentions drawn by an image classification network. Since boundaries will locate somewhere between such areas of different classes, our task is formulated as a multiple instance learning (MIL) problem, wher…
▽ More
This paper presents the first attempt to learn semantic boundary detection using image-level class labels as supervision. Our method starts by estimating coarse areas of object classes through attentions drawn by an image classification network. Since boundaries will locate somewhere between such areas of different classes, our task is formulated as a multiple instance learning (MIL) problem, where pixels on a line segment connecting areas of two different classes are regarded as a bag of boundary candidates. Moreover, we design a new neural network architecture that can learn to estimate semantic boundaries reliably even with uncertain supervision given by the MIL strategy. Our network is used to generate pseudo semantic boundary labels of training images, which are in turn used to train fully supervised models. The final model trained with our pseudo labels achieves an outstanding performance on the SBD dataset, where it is as competitive as some of previous arts trained with stronger supervision.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Improving Cross-Modal Retrieval with Set of Diverse Embeddings
Authors:
Dongwon Kim,
Namyup Kim,
Suha Kwak
Abstract:
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this pape…
▽ More
Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference.
△ Less
Submitted 24 July, 2023; v1 submitted 30 November, 2022;
originally announced November 2022.
-
Cross-Domain Ensemble Distillation for Domain Generalization
Authors:
Kyungmoon Lee,
Sungyeon Kim,
Suha Kwak
Abstract:
Domain generalization is the task of learning models that generalize to unseen target domains. We propose a simple yet effective method for domain generalization, named cross-domain ensemble distillation (XDED), that learns domain-invariant features while encouraging the model to converge to flat minima, which recently turned out to be a sufficient condition for domain generalization. To this end,…
▽ More
Domain generalization is the task of learning models that generalize to unseen target domains. We propose a simple yet effective method for domain generalization, named cross-domain ensemble distillation (XDED), that learns domain-invariant features while encouraging the model to converge to flat minima, which recently turned out to be a sufficient condition for domain generalization. To this end, our method generates an ensemble of the output logits from training data with the same label but from different domains and then penalizes each output for the mismatch with the ensemble. Also, we present a de-stylization technique that standardizes features to encourage the model to produce style-consistent predictions even in an arbitrary target domain. Our method greatly improves generalization capability in public benchmarks for cross-domain image classification, cross-dataset person re-ID, and cross-dataset semantic segmentation. Moreover, we show that models learned by our method are robust against adversarial attacks and image corruptions.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
Few-shot Metric Learning: Online Adaptation of Embedding for Retrieval
Authors:
Deunsol Jung,
Dahyun Kang,
Suha Kwak,
Minsu Cho
Abstract:
Metric learning aims to build a distance metric typically by learning an effective embedding function that maps similar objects into nearby points in its embedding space. Despite recent advances in deep metric learning, it remains challenging for the learned metric to generalize to unseen classes with a substantial domain gap. To tackle the issue, we explore a new problem of few-shot metric learni…
▽ More
Metric learning aims to build a distance metric typically by learning an effective embedding function that maps similar objects into nearby points in its embedding space. Despite recent advances in deep metric learning, it remains challenging for the learned metric to generalize to unseen classes with a substantial domain gap. To tackle the issue, we explore a new problem of few-shot metric learning that aims to adapt the embedding function to the target domain with only a few annotated data. We introduce three few-shot metric learning baselines and propose the Channel-Rectifier Meta-Learning (CRML), which effectively adapts the metric space online by adjusting channels of intermediate layers. Experimental analyses on miniImageNet, CUB-200-2011, MPII, as well as a new dataset, miniDeepFashion, demonstrate that our method consistently improves the learned metric by adapting it to target classes and achieves a greater gain in image retrieval when the domain gap from the source classes is larger.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
Validity Assessment of Legal Will Statements as Natural Language Inference
Authors:
Alice Saebom Kwak,
Jacob O. Israelsen,
Clayton T. Morrison,
Derek E. Bambauer,
Mihai Surdeanu
Abstract:
This work introduces a natural language inference (NLI) dataset that focuses on the validity of statements in legal wills. This dataset is unique because: (a) each entailment decision requires three inputs: the statement from the will, the law, and the conditions that hold at the time of the testator's death; and (b) the included texts are longer than the ones in current NLI datasets. We trained e…
▽ More
This work introduces a natural language inference (NLI) dataset that focuses on the validity of statements in legal wills. This dataset is unique because: (a) each entailment decision requires three inputs: the statement from the will, the law, and the conditions that hold at the time of the testator's death; and (b) the included texts are longer than the ones in current NLI datasets. We trained eight neural NLI models in this dataset. All the models achieve more than 80% macro F1 and accuracy, which indicates that neural approaches can handle this task reasonably well. However, group accuracy, a stricter evaluation measure that is calculated with a group of positive and negative examples generated from the same statement as a unit, is in mid 80s at best, which suggests that the models' understanding of the task remains superficial. Further ablative analyses and explanation experiments indicate that all three text segments are used for prediction, but some decisions rely on semantically irrelevant tokens. This indicates that overfitting on these longer texts likely happens, and that additional research is required for this task to be solved.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
Cognitive Radio-Inspired Rate-Splitting Multiple Access for Semi-Grant-Free Transmissions
Authors:
Hongwu Liu,
Kyeong Jin Kim,
Theodoros A. Tsiftsis,
Bruno Clerckx,
Kyung Sup Kwak,
H. Vincent Poor
Abstract:
In this paper, we propose a cognitive radio-inspired rate-splitting multiple access (CR-RSMA) scheme to assist semi-grant-free (SGF) transmissions in which a grant-based user (GBU) and multiple grant-free users (GFUs) access the base-station (BS) by sharing the same resource block. Using the cognitive radio principle, the GBU and admitted GFU are treated as the primary and secondary users, respect…
▽ More
In this paper, we propose a cognitive radio-inspired rate-splitting multiple access (CR-RSMA) scheme to assist semi-grant-free (SGF) transmissions in which a grant-based user (GBU) and multiple grant-free users (GFUs) access the base-station (BS) by sharing the same resource block. Using the cognitive radio principle, the GBU and admitted GFU are treated as the primary and secondary users, respectively, and rate-splitting is applied at the admitted GFU to realize SGF transmissions. The admitted GFU's transmit power allocation, target rate allocation, and successive interference cancellation decoding order at the BS are jointly optimized to attain the maximum achievable rate for the admitted GFU without deteriorating the GBU's outage performance compared to orthogonal multiple access. Due to the extended non-outage zone, CR-RSMA-assised SGF (CR-RSMA-SGF) transmissions achieve a lower outage probability than SGF transmissions assisted by cognitive radio-inspired non-orthogonal multiple access. Exact expressions and asymptotic analysis for the admitted GFU's outage probability are derived to evaluate the system performance achieved by CR-RSMA-SGF transmissions. The superior outage performance and full multiuser diversity gain achieved by CR-RSMA-SGF transmissions are verified by the analytical and simulation results.
△ Less
Submitted 15 August, 2022;
originally announced August 2022.
-
Combating Label Distribution Shift for Active Domain Adaptation
Authors:
Sehyun Hwang,
Sohyun Lee,
Sungyeon Kim,
Jungseul Ok,
Suha Kwak
Abstract:
We consider the problem of active domain adaptation (ADA) to unlabeled target data, of which subset is actively selected and labeled given a budget constraint. Inspired by recent analysis on a critical issue from label distribution mismatch between source and target in domain adaptation, we devise a method that addresses the issue for the first time in ADA. At its heart lies a novel sampling strat…
▽ More
We consider the problem of active domain adaptation (ADA) to unlabeled target data, of which subset is actively selected and labeled given a budget constraint. Inspired by recent analysis on a critical issue from label distribution mismatch between source and target in domain adaptation, we devise a method that addresses the issue for the first time in ADA. At its heart lies a novel sampling strategy, which seeks target data that best approximate the entire target distribution as well as being representative, diverse, and uncertain. The sampled target data are then used not only for supervised learning but also for matching label distributions of source and target domains, leading to remarkable performance improvement. On four public benchmarks, our method substantially outperforms existing methods in every adaptation scenario.
△ Less
Submitted 13 August, 2022;
originally announced August 2022.
-
Learning Debiased Classifier with Biased Committee
Authors:
Nayeong Kim,
Sehyun Hwang,
Sungsoo Ahn,
Jaesik Park,
Suha Kwak
Abstract:
Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. We propose a new method for training debiased classifiers with no spurious attribute label. The key idea is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting…
▽ More
Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. We propose a new method for training debiased classifiers with no spurious attribute label. The key idea is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting data, i.e., data without spurious correlation, and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of bias-conflicting data accordingly. The consensus within the committee on prediction difficulty thus provides a reliable cue for identifying and weighting bias-conflicting data. Moreover, the committee is also trained with knowledge transferred from the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult data as training progresses. On five real-world datasets, our method outperforms prior arts using no spurious attribute label like ours and even surpasses those relying on bias labels occasionally.
△ Less
Submitted 1 May, 2023; v1 submitted 22 June, 2022;
originally announced June 2022.
-
Self-Taught Metric Learning without Labels
Authors:
Sungyeon Kim,
Dongwon Kim,
Minsu Cho,
Suha Kwak
Abstract:
We present a novel self-taught framework for unsupervised metric learning, which alternates between predicting class-equivalence relations between data through a moving average of an embedding model and learning the model with the predicted relations as pseudo labels. At the heart of our framework lies an algorithm that investigates contexts of data on the embedding space to predict their class-eq…
▽ More
We present a novel self-taught framework for unsupervised metric learning, which alternates between predicting class-equivalence relations between data through a moving average of an embedding model and learning the model with the predicted relations as pseudo labels. At the heart of our framework lies an algorithm that investigates contexts of data on the embedding space to predict their class-equivalence relations as pseudo labels. The algorithm enables efficient end-to-end training since it demands no off-the-shelf module for pseudo labeling. Also, the class-equivalence relations provide rich supervisory signals for learning an embedding space. On standard benchmarks for metric learning, it clearly outperforms existing unsupervised learning methods and sometimes even beats supervised learning models using the same backbone network. It is also applied to semi-supervised metric learning as a way of exploiting additional unlabeled data, and achieves the state of the art by boosting performance of supervised learning substantially.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
Detector-Free Weakly Supervised Group Activity Recognition
Authors:
Dongkeun Kim,
Jinsung Lee,
Minsu Cho,
Suha Kwak
Abstract:
Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a multi-person video. Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors. Motivated by this, we propose a novel model for group activity recognition that depends ne…
▽ More
Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a multi-person video. Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors. Motivated by this, we propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detector. Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism, and represents a video clip as a set of partial context embeddings. The embedding vectors are then aggregated to form a single group representation that reflects the entire context of an activity while capturing temporal evolution of each partial context. Our method achieves outstanding performance on two benchmarks, Volleyball and NBA datasets, surpassing not only the state of the art trained with the same level of supervision, but also some of existing models relying on stronger supervision.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
Semi-supervised Semantic Segmentation with Error Localization Network
Authors:
Donghyeon Kwon,
Suha Kwak
Abstract:
This paper studies semi-supervised learning of semantic segmentation, which assumes that only a small portion of training images are labeled and the others remain unlabeled. The unlabeled images are usually assigned pseudo labels to be used in training, which however often causes the risk of performance degradation due to the confirmation bias towards errors on the pseudo labels. We present a nove…
▽ More
This paper studies semi-supervised learning of semantic segmentation, which assumes that only a small portion of training images are labeled and the others remain unlabeled. The unlabeled images are usually assigned pseudo labels to be used in training, which however often causes the risk of performance degradation due to the confirmation bias towards errors on the pseudo labels. We present a novel method that resolves this chronic issue of pseudo labeling. At the heart of our method lies error localization network (ELN), an auxiliary module that takes an image and its segmentation prediction as input and identifies pixels whose pseudo labels are likely to be wrong. ELN enables semi-supervised learning to be robust against inaccurate pseudo labels by disregarding label noises during training and can be naturally integrated with self-training and contrastive learning. Moreover, we introduce a new learning strategy for ELN that simulates plausible and diverse segmentation errors during training of ELN to enhance its generalization. Our method is evaluated on PASCAL VOC 2012 and Cityscapes, where it outperforms all existing methods in every evaluation setting.
△ Less
Submitted 31 May, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
FIFO: Learning Fog-invariant Features for Foggy Scene Segmentation
Authors:
Sohyun Lee,
Taeyoung Son,
Suha Kwak
Abstract:
Robust visual recognition under adverse weather conditions is of great importance in real-world applications. In this context, we propose a new method for learning semantic segmentation models robust against fog. Its key idea is to consider the fog condition of an image as its style and close the gap between images with different fog conditions in neural style spaces of a segmentation model. In pa…
▽ More
Robust visual recognition under adverse weather conditions is of great importance in real-world applications. In this context, we propose a new method for learning semantic segmentation models robust against fog. Its key idea is to consider the fog condition of an image as its style and close the gap between images with different fog conditions in neural style spaces of a segmentation model. In particular, since the neural style of an image is in general affected by other factors as well as fog, we introduce a fog-pass filter module that learns to extract a fog-relevant factor from the style. Optimizing the fog-pass filter and the segmentation model alternately gradually closes the style gap between different fog conditions and allows to learn fog-invariant features in consequence. Our method substantially outperforms previous work on three real foggy image datasets. Moreover, it improves performance on both foggy and clear weather images, while existing methods often degrade performance on clear scenes.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Reflection and Rotation Symmetry Detection via Equivariant Learning
Authors:
Ahyun Seo,
Byungjin Kim,
Suha Kwak,
Minsu Cho
Abstract:
The inherent challenge of detecting symmetries stems from arbitrary orientations of symmetry patterns; a reflection symmetry mirrors itself against an axis with a specific orientation while a rotation symmetry matches its rotated copy with a specific orientation. Discovering such symmetry patterns from an image thus benefits from an equivariant feature representation, which varies consistently wit…
▽ More
The inherent challenge of detecting symmetries stems from arbitrary orientations of symmetry patterns; a reflection symmetry mirrors itself against an axis with a specific orientation while a rotation symmetry matches its rotated copy with a specific orientation. Discovering such symmetry patterns from an image thus benefits from an equivariant feature representation, which varies consistently with reflection and rotation of the image. In this work, we introduce a group-equivariant convolutional network for symmetry detection, dubbed EquiSym, which leverages equivariant feature maps with respect to a dihedral group of reflection and rotation. The proposed network is built end-to-end with dihedrally-equivariant layers and trained to output a spatial map for reflection axes or rotation centers. We also present a new dataset, DENse and DIverse symmetry (DENDI), which mitigates limitations of existing benchmarks for reflection and rotation symmetry detection. Experiments show that our method achieves the state of the arts in symmetry detection on LDRS and DENDI datasets.
△ Less
Submitted 31 March, 2022;
originally announced March 2022.
-
ReSTR: Convolution-free Referring Image Segmentation Using Transformers
Authors:
Namyup Kim,
Dongwon Kim,
Cuiling Lan,
Wenjun Zeng,
Suha Kwak
Abstract:
Referring image segmentation is an advanced semantic segmentation task where target is not a predefined class but is described in natural language. Most of existing methods for this task rely heavily on convolutional neural networks, which however have trouble capturing long-range dependencies between entities in the language expression and are not flexible enough for modeling interactions between…
▽ More
Referring image segmentation is an advanced semantic segmentation task where target is not a predefined class but is described in natural language. Most of existing methods for this task rely heavily on convolutional neural networks, which however have trouble capturing long-range dependencies between entities in the language expression and are not flexible enough for modeling interactions between the two different modalities. To address these issues, we present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR. Since it extracts features of both modalities through transformer encoders, it can capture long-range dependencies between entities within each modality. Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process. The fused features are fed to a segmentation module, which works adaptively according to the image and language expression in hand. ReSTR is evaluated and compared with previous work on all public benchmarks, where it outperforms all existing models.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Collaborative Transformers for Grounded Situation Recognition
Authors:
Junhyeong Cho,
Youngseok Yoon,
Suha Kwak
Abstract:
Grounded situation recognition is the task of predicting the main activity, entities playing certain roles within the activity, and bounding-box groundings of the entities in the given image. To effectively deal with this challenging task, we introduce a novel approach where the two processes for activity classification and entity estimation are interactive and complementary. To implement this ide…
▽ More
Grounded situation recognition is the task of predicting the main activity, entities playing certain roles within the activity, and bounding-box groundings of the entities in the given image. To effectively deal with this challenging task, we introduce a novel approach where the two processes for activity classification and entity estimation are interactive and complementary. To implement this idea, we propose Collaborative Glance-Gaze TransFormer (CoFormer) that consists of two modules: Glance transformer for activity classification and Gaze transformer for entity estimation. Glance transformer predicts the main activity with the help of Gaze transformer that analyzes entities and their relations, while Gaze transformer estimates the grounded entities by focusing only on the entities relevant to the activity predicted by Glance transformer. Our CoFormer achieves the state of the art in all evaluation metrics on the SWiG dataset. Training code and model weights are available at https://github.com/jhcho99/CoFormer.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Extracting Space Situational Awareness Events from News Text
Authors:
Zhengnan Xie,
Alice Saebom Kwak,
Enfa George,
Laura W. Dozal,
Hoang Van,
Moriba Jah,
Roberto Furfaro,
Peter Jansen
Abstract:
Space situational awareness typically makes use of physical measurements from radar, telescopes, and other assets to monitor satellites and other spacecraft for operational, navigational, and defense purposes. In this work we explore using textual input for the space situational awareness task. We construct a corpus of 48.5k news articles spanning all known active satellites between 2009 and 2020.…
▽ More
Space situational awareness typically makes use of physical measurements from radar, telescopes, and other assets to monitor satellites and other spacecraft for operational, navigational, and defense purposes. In this work we explore using textual input for the space situational awareness task. We construct a corpus of 48.5k news articles spanning all known active satellites between 2009 and 2020. Using a dependency-rule-based extraction system designed to target three high-impact events -- spacecraft launches, failures, and decommissionings, we identify 1,787 space-event sentences that are then annotated by humans with 15.9k labels for event slots. We empirically demonstrate a state-of-the-art neural extraction system achieves an overall F1 between 53 and 91 per slot for event extraction in this low-resource, high-impact domain.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Learning to Generate Novel Classes for Deep Metric Learning
Authors:
Kyungmoon Lee,
Sungyeon Kim,
Seunghoon Hong,
Suha Kwak
Abstract:
Deep metric learning aims to learn an embedding space where the distance between data reflects their class equivalence, even when their classes are unseen during training. However, the limited number of classes available in training precludes generalization of the learned embedding space. Motivated by this, we introduce a new data augmentation approach that synthesizes novel classes and their embe…
▽ More
Deep metric learning aims to learn an embedding space where the distance between data reflects their class equivalence, even when their classes are unseen during training. However, the limited number of classes available in training precludes generalization of the learned embedding space. Motivated by this, we introduce a new data augmentation approach that synthesizes novel classes and their embedding vectors. Our approach can provide rich semantic information to an embedding model and improve its generalization by augmenting training data with novel classes unavailable in the original data. We implement this idea by learning and exploiting a conditional generative model, which, given a class label and a noise, produces a random embedding vector of the class. Our proposed generator allows the loss to use richer class relations by augmenting realistic and diverse classes, resulting in better generalization to unseen samples. Experimental results on public benchmark datasets demonstrate that our method clearly enhances the performance of proxy-based losses.
△ Less
Submitted 4 January, 2022;
originally announced January 2022.
-
Grounded Situation Recognition with Transformers
Authors:
Junhyeong Cho,
Youngseok Yoon,
Hyeonjun Lee,
Suha Kwak
Abstract:
Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accura…
▽ More
Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark. Our code is available at https://github.com/jhcho99/gsrtr .
△ Less
Submitted 19 November, 2021;
originally announced November 2021.
-
Relational Self-Attention: What's Missing in Attention for Video Understanding
Authors:
Manjin Kim,
Heeseung Kwon,
Chunyu Wang,
Suha Kwak,
Minsu Cho
Abstract:
Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, incl…
▽ More
Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Rate Splitting Multiple Access for Semi-Grant-Free Transmissions
Authors:
Hongwu Liu,
Theodoros A. Tsiftsis,
Bruno Clerckx,
Kyeong Jin Kim,
Kyung Sup Kwak,
H. Vincent Poor
Abstract:
Enabled by hybrid grant-based (GB) and grant-free (GF) transmission techniques, GF users of internet of things (IoT) devices and massive machine-type communications (mMTC) meet opportunities to share wireless resources with GB users. In this paper, we propose a rate splitting multiple access (RSMA) strategy for an emerging semi-grant-free (SGF) transmission system to increase connectivity and reli…
▽ More
Enabled by hybrid grant-based (GB) and grant-free (GF) transmission techniques, GF users of internet of things (IoT) devices and massive machine-type communications (mMTC) meet opportunities to share wireless resources with GB users. In this paper, we propose a rate splitting multiple access (RSMA) strategy for an emerging semi-grant-free (SGF) transmission system to increase connectivity and reliability. In the proposed RSMA assisted SGF (RSMA-SGF) scheme, the GF users apply the rate splitting principle to realize distributed contentions and utilize transmit power most effectively for robust transmissions, meanwhile keeping themselves transparent to the GB user. Compared to existing non-orthogonal multiple access (NOMA) assisted SGF schemes, the RSMA-SGF scheme significantly decreases outage probability and achieves full multiuser diversity gain without restricting the GB and GF users' target rates to a limited value region. Exact expressions and asymptotic analysis for the outage probability are provided to facilitate the system performance evaluation of the proposed RSMA-SGF scheme. Computer simulation results clarify the superior outage performance of the RSMA-SGF scheme and verify the accuracy of the developed analytical results.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
WEDGE: Web-Image Assisted Domain Generalization for Semantic Segmentation
Authors:
Namyup Kim,
Taeyoung Son,
Jaehyun Pahk,
Cuiling Lan,
Wenjun Zeng,
Suha Kwak
Abstract:
Domain generalization for semantic segmentation is highly demanded in real applications, where a trained model is expected to work well in previously unseen domains. One challenge lies in the lack of data which could cover the diverse distributions of the possible unseen domains for training. In this paper, we propose a WEb-image assisted Domain GEneralization (WEDGE) scheme, which is the first to…
▽ More
Domain generalization for semantic segmentation is highly demanded in real applications, where a trained model is expected to work well in previously unseen domains. One challenge lies in the lack of data which could cover the diverse distributions of the possible unseen domains for training. In this paper, we propose a WEb-image assisted Domain GEneralization (WEDGE) scheme, which is the first to exploit the diversity of web-crawled images for generalizable semantic segmentation. To explore and exploit the real-world data distributions, we collect web-crawled images which present large diversity in terms of weather conditions, sites, lighting, camera styles, etc. We also present a method which injects styles of the web-crawled images into training images on-the-fly during training, which enables the network to experience images of diverse styles with reliable labels for effective training. Moreover, we use the web-crawled images with their predicted pseudo labels for training to further enhance the capability of the network. Extensive experiments demonstrate that our method clearly outperforms existing domain generalization techniques.
△ Less
Submitted 2 May, 2023; v1 submitted 29 September, 2021;
originally announced September 2021.
-
ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer
Authors:
Boseung Jeong,
Jicheol Park,
Suha Kwak
Abstract:
Attribute-based person search is the task of finding person images that are best matched with a set of text attributes given as query. The main challenge of this task is the large modality gap between attributes and images. To reduce the gap, we present a new loss for learning cross-modal embeddings in the context of attribute-based person search. We regard a set of attributes as a category of peo…
▽ More
Attribute-based person search is the task of finding person images that are best matched with a set of text attributes given as query. The main challenge of this task is the large modality gap between attributes and images. To reduce the gap, we present a new loss for learning cross-modal embeddings in the context of attribute-based person search. We regard a set of attributes as a category of people sharing the same traits. In a joint embedding space of the two modalities, our loss pulls images close to their person categories for modality alignment. More importantly, it pushes apart a pair of person categories by a margin determined adaptively by their semantic distance, where the distance metric is learned end-to-end so that the loss considers importance of each attribute when relating person categories. Our loss guided by the adaptive semantic margin leads to more discriminative and semantically well-arranged distributions of person images. As a consequence, it enables a simple embedding model to achieve state-of-the-art records on public benchmarks without bells and whistles.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
On The Distribution of Penultimate Activations of Classification Networks
Authors:
Minkyo Seo,
Yoonho Lee,
Suha Kwak
Abstract:
This paper studies probability distributions of penultimate activations of classification networks. We show that, when a classification network is trained with the cross-entropy loss, its final classification layer forms a Generative-Discriminative pair with a generative classifier based on a specific distribution of penultimate activations. More importantly, the distribution is parameterized by t…
▽ More
This paper studies probability distributions of penultimate activations of classification networks. We show that, when a classification network is trained with the cross-entropy loss, its final classification layer forms a Generative-Discriminative pair with a generative classifier based on a specific distribution of penultimate activations. More importantly, the distribution is parameterized by the weights of the final fully-connected layer, and can be considered as a generative model that synthesizes the penultimate activations without feeding input data. We empirically demonstrate that this generative model enables stable knowledge distillation in the presence of domain shift, and can transfer knowledge from a classifier to variational autoencoders and generative adversarial networks for class-conditional image generation.
△ Less
Submitted 5 July, 2021; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Traffic signal prediction on transportation networks using spatio-temporal correlations on graphs
Authors:
Semin Kwak,
Nikolas Geroliminis,
Pascal Frossard
Abstract:
Multivariate time series forecasting poses challenges as the variables are intertwined in time and space, like in the case of traffic signals. Defining signals on graphs relaxes such complexities by representing the evolution of signals over a space using relevant graph kernels such as the heat diffusion kernel. However, this kernel alone does not fully capture the actual dynamics of the data as i…
▽ More
Multivariate time series forecasting poses challenges as the variables are intertwined in time and space, like in the case of traffic signals. Defining signals on graphs relaxes such complexities by representing the evolution of signals over a space using relevant graph kernels such as the heat diffusion kernel. However, this kernel alone does not fully capture the actual dynamics of the data as it only relies on the graph structure. The gap can be filled by combining the graph kernel representation with data-driven models that utilize historical data. This paper proposes a traffic propagation model that merges multiple heat diffusion kernels into a data-driven prediction model to forecast traffic signals. We optimize the model parameters using Bayesian inference to minimize the prediction errors and, consequently, determine the mixing ratio of the two approaches. Such mixing ratio strongly depends on training data size and data anomalies, which typically correspond to the peak hours for traffic data. The proposed model demonstrates prediction accuracy comparable to that of the state-of-the-art deep neural networks with lower computational effort. It notably achieves excellent performance for long-term prediction through the inheritance of periodicity modeling in data-driven models.
△ Less
Submitted 5 October, 2021; v1 submitted 27 April, 2021;
originally announced April 2021.
-
Embedding Transfer with Label Relaxation for Improved Metric Learning
Authors:
Sungyeon Kim,
Dongwon Kim,
Minsu Cho,
Suha Kwak
Abstract:
This paper presents a novel method for embedding transfer, a task of transferring knowledge of a learned embedding model to another. Our method exploits pairwise similarities between samples in the source embedding space as the knowledge, and transfers them through a loss used for learning target embedding models. To this end, we design a new loss called relaxed contrastive loss, which employs the…
▽ More
This paper presents a novel method for embedding transfer, a task of transferring knowledge of a learned embedding model to another. Our method exploits pairwise similarities between samples in the source embedding space as the knowledge, and transfers them through a loss used for learning target embedding models. To this end, we design a new loss called relaxed contrastive loss, which employs the pairwise similarities as relaxed labels for inter-sample relations. Our loss provides a rich supervisory signal beyond class equivalence, enables more important pairs to contribute more to training, and imposes no restriction on manifolds of target embedding spaces. Experiments on metric learning benchmarks demonstrate that our method largely improves performance, or reduces sizes and output dimensions of target models effectively. We further show that it can be also used to enhance quality of self-supervised representation and performance of classification models. In all the experiments, our method clearly outperforms existing embedding transfer techniques.
△ Less
Submitted 27 March, 2021;
originally announced March 2021.
-
Learning Self-Similarity in Space and Time as Generalized Motion for Video Action Recognition
Authors:
Heeseung Kwon,
Manjin Kim,
Suha Kwak,
Minsu Cho
Abstract:
Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By…
▽ More
Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.
△ Less
Submitted 2 November, 2021; v1 submitted 14 February, 2021;
originally announced February 2021.
-
A Comprehensive Utility Function for Resource Allocation in Mobile Edge Computing
Authors:
Zaiwar Ali,
Sadia Khaf,
Ziaul Haq Abba,
Ghulam Abbas,
Lei Jiao,
Amna Irshad,
Kyung Sup Kwak,
Muhammad Bilal
Abstract:
In mobile edge computing (MEC), one of the important challenges is how much resources of which mobile edge server (MES) should be allocated to which user equipment (UE). The existing resource allocation schemes only consider CPU as the requested resource and assume utility for MESs as either a random variable or dependent on the requested CPU only. This paper presents a novel comprehensive utility…
▽ More
In mobile edge computing (MEC), one of the important challenges is how much resources of which mobile edge server (MES) should be allocated to which user equipment (UE). The existing resource allocation schemes only consider CPU as the requested resource and assume utility for MESs as either a random variable or dependent on the requested CPU only. This paper presents a novel comprehensive utility function for resource allocation in MEC. The utility function considers the heterogeneous nature of applications that a UE offloads to MES. The proposed utility function considers all important parameters, including CPU, RAM, hard disk space, required time, and distance, to calculate a more realistic utility value for MESs. Moreover, we improve upon some general algorithms, used for resource allocation in MEC and cloud computing, by considering our proposed utility function. We name the improved versions of these resource allocation schemes as comprehensive resource allocation schemes. The UE requests are modeled to represent the amount of resources requested by the UE as well as the time for which the UE has requested these resources. The utility function depends upon the UE requests and the distance between UEs and MES, and serves as a realistic means of comparison between different types of UE requests. Choosing (or selecting) an optimal MES with the optimal amount of resources to be allocated to each UE request is a challenging task. We show that MES resource allocation is sub-optimal if CPU is the only resource considered. By taking into account the other resources, i.e., RAM, disk space, request time, and distance in the utility function, we demonstrate improvement in the resource allocation algorithms in terms of service rate, utility, and MES energy consumption.
△ Less
Submitted 18 December, 2020;
originally announced December 2020.