Skip to main content

Showing 1–50 of 120 results for author: Sohn, K

  1. arXiv:2405.20829  [pdf, other

    cs.CV cs.LG

    Rethinking Open-World Semi-Supervised Learning: Distribution Mismatch and Inductive Inference

    Authors: Seongheon Park, Hyuk Kwon, Kwanghoon Sohn, Kibok Lee

    Abstract: Open-world semi-supervised learning (OWSSL) extends conventional semi-supervised learning to open-world scenarios by taking account of novel categories in unlabeled datasets. Despite the recent advancements in OWSSL, the success often relies on the assumptions that 1) labeled and unlabeled datasets share the same balanced class prior distribution, which does not generally hold in real-world applic… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: CVPR Workshop on Computer Vision in the Wild (CVinW), 2024

  2. arXiv:2405.13951  [pdf, other

    cs.CV

    Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

    Authors: Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh

    Abstract: We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Paper accepted to AI4CC Workshop at CVPR 2024

  3. arXiv:2405.13779  [pdf, other

    cs.CV cs.AI cs.LG

    Robust Disaster Assessment from Aerial Imagery Using Text-to-Image Synthetic Data

    Authors: Tarun Kalluri, Jihyeon Lee, Kihyuk Sohn, Sahil Singla, Manmohan Chandraker, Joseph Xu, Jeremiah Liu

    Abstract: We present a simple and efficient method to leverage emerging text-to-image generative models in creating large-scale synthetic supervision for the task of damage assessment from aerial images. While significant recent advances have resulted in improved techniques for damage assessment using aerial or satellite imagery, they still suffer from poor robustness to domains where manual labeled data is… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  4. arXiv:2405.04356  [pdf, other

    cs.CV

    Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

    Authors: Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn

    Abstract: We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simp… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Accepted by CVPR 2024

  5. arXiv:2404.09632  [pdf, other

    cs.CV cs.LG

    Bridging Vision and Language Spaces with Assignment Prediction

    Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn

    Abstract: This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-establis… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: ICLR 2024 Camera-ready

  6. arXiv:2404.00974  [pdf, other

    cs.CV

    Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

    Authors: Hyeongjun Kwon, Jinhyun Jang, Jin Kim, Kwonyoung Kim, Kwanghoon Sohn

    Abstract: Visual scenes are naturally organized in a hierarchy, where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements, leading to a comprehensive scene understanding. In this paper, we propose a Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the structured understanding… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: This paper is accepted to CVPR 2024. The supplementary material is included. The code is available at \url{https://github.com/kwonjunn01/Hi-Mapper}

  7. arXiv:2404.00930  [pdf, other

    cs.CL

    PSYDIAL: Personality-based Synthetic Dialogue Generation using Large Language Models

    Authors: Ji-Eun Han, Jun-Seok Koh, Hyeon-Tae Seo, Du-Seong Chang, Kyung-Ah Sohn

    Abstract: We present a novel end-to-end personality-based synthetic dialogue data generation pipeline, specifically designed to elicit responses from large language models via prompting. We design the prompts to generate more human-like dialogues considering real-world scenarios when users engage with chatbots. We introduce PSYDIAL, the first Korean dialogue dataset focused on personality-based dialogues, c… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: LREC-COLING 2024 Main

  8. arXiv:2403.14966  [pdf, other

    cs.CV

    DreamFlow: High-Quality Text-to-3D Generation by Approximating Probability Flow

    Authors: Kyungmin Lee, Kihyuk Sohn, Jinwoo Shin

    Abstract: Recent progress in text-to-3D generation has been achieved through the utilization of score distillation methods: they make use of the pre-trained text-to-image (T2I) diffusion models by distilling via the diffusion model training objective. However, such an approach inevitably results in the use of random timesteps at each update, which increases the variance of the gradient and ultimately prolon… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: ICLR 2024

  9. arXiv:2402.12170  [pdf, other

    cs.CL cs.AI

    Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction

    Authors: Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee, Yoshitaka Ushiku

    Abstract: Large language models require updates to remain up-to-date or adapt to new domains by fine-tuning them with new documents. One key is memorizing the latest information in a way that the memorized information is extractable with a query prompt. However, LLMs suffer from a phenomenon called perplexity curse; despite minimizing document perplexity during fine-tuning, LLMs struggle to extract informat… ▽ More

    Submitted 23 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  10. arXiv:2402.12004  [pdf, other

    cs.CV

    Direct Consistency Optimization for Compositional Text-to-Image Personalization

    Authors: Kyungmin Lee, Sangkyung Kwak, Kihyuk Sohn, Jinwoo Shin

    Abstract: Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. However, they still lack in synthesizing images of different scenarios or styles that are possible in the original pretrained models. To address this, we propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the… ▽ More

    Submitted 19 February, 2024; originally announced February 2024.

    Comments: Preprint. See our project page (https://dco-t2i.github.io/) for more examples and codes

  11. arXiv:2401.01952  [pdf, other

    cs.CV cs.AI cs.CL

    Instruct-Imagen: Image Generation with Multi-modal Instruction

    Authors: Hexiang Hu, Kelvin C. K. Chan, Yu-Chuan Su, Wenhu Chen, Yandong Li, Kihyuk Sohn, Yang Zhao, Xue Ben, Boqing Gong, William Cohen, Ming-Wei Chang, Xuhui Jia

    Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant gener… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: 20 pages, 18 figures

  12. arXiv:2312.14125  [pdf, other

    cs.CV cs.AI

    VideoPoet: A Large Language Model for Zero-Shot Video Generation

    Authors: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam , et al. (6 additional authors not shown)

    Abstract: We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and tas… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: To appear at ICML 2024; Project page: http://sites.research.google/videopoet/

  13. arXiv:2312.06662  [pdf, other

    cs.CV cs.AI cs.LG

    Photorealistic Video Generation with Diffusion Models

    Authors: Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

    Abstract: We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Project website https://walt-video-diffusion.github.io/

  14. arXiv:2312.00950  [pdf, other

    cs.CV

    Improve Supervised Representation Learning with Masked Image Modeling

    Authors: Kaifeng Chen, Daniel Salz, Huiwen Chang, Kihyuk Sohn, Dilip Krishnan, Mojtaba Seyedhosseini

    Abstract: Training visual embeddings with labeled data supervision has been the de facto setup for representation learning in computer vision. Inspired by recent success of adopting masked image modeling (MIM) in self-supervised representation learning, we propose a simple yet effective setup that can easily integrate MIM into existing supervised training paradigms. In our design, in addition to the origina… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  15. arXiv:2311.05858  [pdf, other

    cs.LG cs.CV

    Layer-wise Auto-Weighting for Non-Stationary Test-Time Adaptation

    Authors: Junyoung Park, Jin Kim, Hyeongjun Kwon, Ilhoon Yoon, Kwanghoon Sohn

    Abstract: Given the inevitability of domain shifts during inference in real-world applications, test-time adaptation (TTA) is essential for model adaptation after deployment. However, the real-world scenario of continuously changing target distributions presents challenges including catastrophic forgetting and error accumulation. Existing TTA methods for non-stationary domain shifts, while effective, incur… ▽ More

    Submitted 26 November, 2023; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: WACV 2024

  16. arXiv:2310.16095  [pdf, other

    cs.CL cs.CE

    CR-COPEC: Causal Rationale of Corporate Performance Changes to Learn from Financial Reports

    Authors: Ye Eun Chun, Sunjae Kwon, Kyunghwan Sohn, Nakwon Sung, Junyoup Lee, Byungki Seo, Kevin Compher, Seung-won Hwang, Jaesik Choi

    Abstract: In this paper, we introduce CR-COPEC called Causal Rationale of Corporate Performance Changes from financial reports. This is a comprehensive large-scale domain-adaptation causal sentence dataset to detect financial performance changes of corporate. CR-COPEC contributes to two major achievements. First, it detects causal rationale from 10-K annual reports of the U.S. companies, which contain exper… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Accepted in Findings of EMNLP 2023

  17. arXiv:2310.05737  [pdf, other

    cs.CV cs.AI cs.MM

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Authors: Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang

    Abstract: While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer… ▽ More

    Submitted 29 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  18. arXiv:2308.12949  [pdf, other

    cs.LG cs.CV

    Label Budget Allocation in Multi-Task Learning

    Authors: Ximeng Sun, Kihyuk Sohn, Kate Saenko, Clayton Mellina, Xiao Bian

    Abstract: The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

  19. arXiv:2308.06947  [pdf, other

    cs.CV cs.LG

    Knowing Where to Focus: Event-aware Transformer for Video Grounding

    Authors: Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, Kwanghoon Sohn

    Abstract: Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate a… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: ICCV 2023. Code is available at https://github.com/jinhyunj/EaTR

  20. arXiv:2308.06945  [pdf, other

    cs.CV cs.LG

    Semantic-aware Network for Aerial-to-Ground Image Synthesis

    Authors: Jinhyun Jang, Taeyong Song, Kwanghoon Sohn

    Abstract: Aerial-to-ground image synthesis is an emerging and challenging problem that aims to synthesize a ground image from an aerial image. Due to the highly different layout and object representation between the aerial and ground images, existing approaches usually fail to transfer the components of the aerial scene into the ground scene. In this paper, we propose a novel framework to explore the challe… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: ICIP 2021. Code is available at https://github.com/jinhyunj/SANet

  21. arXiv:2308.04016  [pdf, other

    cs.CV

    Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

    Authors: Hanjae Kim, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

    Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framew… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  22. arXiv:2307.04787  [pdf, other

    cs.CV cs.LG

    Collaborative Score Distillation for Consistent Visual Synthesis

    Authors: Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin

    Abstract: Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Co… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Project page with visuals: https://subin-kim-cv.github.io/CSD/

  23. arXiv:2306.00983  [pdf, other

    cs.CV cs.AI

    StyleDrop: Text-to-Image Generation in Any Style

    Authors: Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang, Glenn Entis, Yuanzhen Li, Yuan Hao, Irfan Essa, Michael Rubinstein, Dilip Krishnan

    Abstract: Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language and out-of-distribution effects make it hard to synthesize image styles, that leverage a specific design pattern, texture or material. In this paper, we introduce StyleDrop, a method that enables the synthesis of images that faithfully follo… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Preprint. Project page at https://styledrop.github.io

  24. arXiv:2306.00763  [pdf, other

    cs.CV cs.AI

    Learning Disentangled Prompts for Compositional Image Synthesis

    Authors: Kihyuk Sohn, Albert Shaw, Yuan Hao, Han Zhang, Luisa Polania, Huiwen Chang, Lu Jiang, Irfan Essa

    Abstract: We study domain-adaptive image synthesis, the problem of teaching pretrained image generative models a new style or concept from as few as one image to synthesize novel images, to better understand the compositional image synthesis. We present a framework that leverages a pretrained class-conditional generation model and visual prompt tuning. Specifically, we propose a novel source class distilled… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: tech report

  25. arXiv:2305.02549  [pdf, other

    cs.CL cs.CV cs.LG

    FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

    Authors: Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang Zhang, Kihyuk Sohn, Nikolai Glushnev, Renshen Wang, Joshua Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister

    Abstract: The recent advent of self-supervised pre-training techniques has led to a surge in the use of multimodal learning in form document understanding. However, existing approaches that extend the mask language modeling to other modalities require careful multi-task tuning, complex reconstruction target designs, or additional pre-training data. In FormNetV2, we introduce a centralized multimodal graph c… ▽ More

    Submitted 13 June, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023

  26. arXiv:2304.01537  [pdf, other

    cs.CV

    PartMix: Regularization Strategy to Learn Part Discovery for Visible-Infrared Person Re-identification

    Authors: Minsu Kim, Seungryong Kim, JungIn Park, Seongheon Park, Kwanghoon Sohn

    Abstract: Modern data augmentation using a mixture-based technique can regularize the models from overfitting to the training data in various computer vision applications, but a proper data augmentation technique tailored for the part-based Visible-Infrared person Re-IDentification (VI-ReID) models remains unexplored. In this paper, we present a novel data augmentation technique, dubbed PartMix, that synthe… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  27. arXiv:2304.00779  [pdf, other

    cs.CV cs.AI

    Probabilistic Prompt Learning for Dense Prediction

    Authors: Hyeongjun Kwon, Taeyong Song, Somi Jeong, Jin Kim, Jinhyun Jang, Kwanghoon Sohn

    Abstract: Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-language models. However, this approach results in limited performance for dense prediction tasks that require handling more complex and diverse objects, since a single and deterministic… ▽ More

    Submitted 3 April, 2023; originally announced April 2023.

    Comments: accepted to CVPR 2023

  28. arXiv:2303.09857  [pdf, other

    cs.CV cs.LG

    Dual-path Adaptation from Image to Video Transformers

    Authors: Jungin Park, Jiyoung Lee, Kwanghoon Sohn

    Abstract: In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image trans… ▽ More

    Submitted 17 March, 2023; originally announced March 2023.

    Comments: CVPR 2023. Code is available at https://github.com/park-jungin/DualPath

  29. arXiv:2303.09055  [pdf, other

    cs.CV

    TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

    Authors: Tuan N. Tang, Kwonyoung Kim, Kwanghoon Sohn

    Abstract: Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the importance of applying long-term temporal context modeling (TCM) blocks to the extracted video clip features such as employing complex self-attention mechanisms. In this paper, we present the simplest method ever to a… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

  30. arXiv:2302.07685  [pdf, other

    cs.CV cs.LG

    Video Probabilistic Diffusion Models in Projected Latent Space

    Authors: Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin

    Abstract: Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that… ▽ More

    Submitted 30 March, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

    Comments: CVPR 2023. Project page: https://sihyun.me/PVDM

  31. arXiv:2302.05496  [pdf, other

    cs.CV cs.AI

    MaskSketch: Unpaired Structure-guided Masked Image Generation

    Authors: Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa

    Abstract: Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  32. arXiv:2302.03084  [pdf, other

    cs.CV

    Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

    Authors: Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister

    Abstract: In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-S… ▽ More

    Submitted 15 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: CVPR2023

  33. arXiv:2212.05199  [pdf, other

    cs.CV

    MAGVIT: Masked Generative Video Transformer

    Authors: Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

    Abstract: We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MA… ▽ More

    Submitted 4 April, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: CVPR 2023 highlight

  34. arXiv:2212.00173  [pdf, other

    cs.LG

    SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch

    Authors: Jinsung Yoon, Kihyuk Sohn, Chun-Liang Li, Sercan O. Arik, Tomas Pfister

    Abstract: Semi-supervised anomaly detection is a common problem, as often the datasets containing anomalies are partially labeled. We propose a canonical framework: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling (SPADE) that isn't limited by the assumption that labeled and unlabeled data come from the same distribution. Indeed, the assumption is often violated in many applications - for ex… ▽ More

    Submitted 30 November, 2022; originally announced December 2022.

  35. arXiv:2211.04905  [pdf, other

    cs.CV

    SimOn: A Simple Framework for Online Temporal Action Localization

    Authors: Tuan N. Tang, Jungin Park, Kwonyoung Kim, Kwanghoon Sohn

    Abstract: Online Temporal Action Localization (On-TAL) aims to immediately provide action instances from untrimmed streaming videos. The model is not allowed to utilize future frames and any processing techniques to modify past predictions, making On-TAL much more challenging. In this paper, we propose a simple yet effective framework, termed SimOn, that learns to predict action instances using the popular… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

  36. arXiv:2211.00243  [pdf, other

    cs.CL cs.AI cs.SI

    Why Is It Hate Speech? Masked Rationale Prediction for Explainable Hate Speech Detection

    Authors: Jiyun Kim, Byounghan Lee, Kyung-Ah Sohn

    Abstract: In a hate speech detection model, we should consider two critical aspects in addition to detection performance-bias and explainability. Hate speech cannot be identified based solely on the presence of specific words: the model should be able to reason like humans and be explainable. To improve the performance concerning the two aspects, we propose Masked Rationale Prediction (MRP) as an intermedia… ▽ More

    Submitted 31 October, 2022; originally announced November 2022.

    Comments: 10 pages, 4 figures, 3 tables. Accepted at COLING 2022

    MSC Class: 68T50 (Primary); 91F20 (Secondary)

  37. arXiv:2210.12977  [pdf, other

    cs.CV

    Language-free Training for Zero-shot Video Grounding

    Authors: Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn

    Abstract: Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted to WACV 2023

  38. arXiv:2210.00990  [pdf, other

    cs.CV cs.AI

    Visual Prompt Tuning for Generative Transfer Learning

    Authors: Kihyuk Sohn, Yuan Hao, José Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang

    Abstract: Transferring knowledge from an image synthesis model trained on a large dataset is a promising direction for learning generative image models from various domains efficiently. While previous works have studied GAN models, we present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on state-of-the-art generative vision transformers that represent an… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: technical report

  39. arXiv:2209.05771  [pdf, other

    eess.IV cs.CV

    Moving from 2D to 3D: volumetric medical image classification for rectal cancer staging

    Authors: Joohyung Lee, Jieun Oh, Inkyu Shin, You-sung Kim, Dae Kyung Sohn, Tae-sung Kim, In So Kweon

    Abstract: Volumetric images from Magnetic Resonance Imaging (MRI) provide invaluable information in preoperative staging of rectal cancer. Above all, accurate preoperative discrimination between T2 and T3 stages is arguably both the most challenging and clinically significant task for rectal cancer treatment, as chemo-radiotherapy is usually recommended to patients with T3 (or greater) stage cancer. In this… ▽ More

    Submitted 13 September, 2022; originally announced September 2022.

    Comments: 11 pages, 2 figures, accepted to MICCAI 2022

  40. arXiv:2207.13340  [pdf, other

    cs.CV cs.LG

    PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation

    Authors: Kwonyoung Kim, Jungin Park, Jiyoung Lee, Dongbo Min, Kwanghoon Sohn

    Abstract: Online stereo adaptation tackles the domain shift problem, caused by different environments between synthetic (training) and real (test) datasets, to promptly adapt stereo models in dynamic real-world applications such as autonomous driving. However, previous methods often fail to counteract particular regions related to dynamic objects with more severe environmental changes. To mitigate this issu… ▽ More

    Submitted 27 July, 2022; originally announced July 2022.

    Comments: Accepted to ECCV 2022

  41. arXiv:2206.01125  [pdf, other

    cs.CV

    Prefix Conditioning Unifies Language and Label Supervision

    Authors: Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister

    Abstract: Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated str… ▽ More

    Submitted 15 May, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

    Comments: CVPR2023

  42. arXiv:2205.13921  [pdf, other

    cs.LG cs.AI

    Federated Semi-Supervised Learning with Prototypical Networks

    Authors: Woojung Kim, Keondo Park, Kihyuk Sohn, Raphael Shu, Hyung-Sin Kim

    Abstract: With the increasing computing power of edge devices, Federated Learning (FL) emerges to enable model training without privacy concerns. The majority of existing studies assume the data are fully labeled on the client side. In practice, however, the amount of labeled data is often limited. Recently, federated semi-supervised learning (FSSL) is explored as a way to effectively utilize unlabeled data… ▽ More

    Submitted 30 May, 2022; v1 submitted 27 May, 2022; originally announced May 2022.

  43. arXiv:2204.03946  [pdf, other

    cs.CV

    Probabilistic Representations for Video Contrastive Learning

    Authors: Jungin Park, Jiyoung Lee, Ig-Jae Kim, Kwanghoon Sohn

    Abstract: This paper presents Probabilistic Video Contrastive Learning, a self-supervised representation learning method that bridges contrastive learning with probabilistic representation. We hypothesize that the clips composing the video have different distributions in short-term duration, but can represent the complicated and sophisticated video distribution through combination in a common embedding spac… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: CVPR 2022

  44. arXiv:2204.03609  [pdf, other

    cs.CV cs.LG

    Pin the Memory: Learning to Generalize Semantic Segmentation

    Authors: Jin Kim, Jiyoung Lee, Jungin Park, Dongbo Min, Kwanghoon Sohn

    Abstract: The rise of deep neural networks has led to several breakthroughs for semantic segmentation. In spite of this, a model trained on source domain often fails to work properly in new challenging domains, that is directly concerned with the generalization capability of the model. In this paper, we present a novel memory-guided domain generalization method for semantic segmentation based on meta-learni… ▽ More

    Submitted 30 May, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: Accepted to CVPR 2022

  45. arXiv:2202.06513  [pdf, other

    cs.CV

    Context-Preserving Instance-Level Augmentation and Deformable Convolution Networks for SAR Ship Detection

    Authors: Taeyong Song, Sunok Kim, SungTai Kim, Jaeseok Lee, Kwanghoon Sohn

    Abstract: Shape deformation of targets in SAR image due to random orientation and partial information loss caused by occlusion of the radar signal, is an essential challenge in SAR ship detection. In this paper, we propose a data augmentation method to train a deep network that is robust to partial information loss within the targets. Taking advantage of ground-truth annotations for bounding box and instanc… ▽ More

    Submitted 14 February, 2022; originally announced February 2022.

    Comments: Accepted to 2022 IEEE Radar Conference

  46. arXiv:2202.02779  [pdf, other

    cs.CV

    Multi-domain Unsupervised Image-to-Image Translation with Appearance Adaptive Convolution

    Authors: Somi Jeong, Jiyoung Lee, Kwanghoon Sohn

    Abstract: Over the past few years, image-to-image (I2I) translation methods have been proposed to translate a given image into diverse outputs. Despite the impressive results, they mainly focus on the I2I translation between two domains, so the multi-domain I2I translation still remains a challenge. To address this problem, we propose a novel multi-domain unsupervised image-to-image translation (MDUIT) fram… ▽ More

    Submitted 6 February, 2022; originally announced February 2022.

    Comments: ICASSP 2022

  47. arXiv:2201.03668  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Towards Group Robustness in the presence of Partial Group Labels

    Authors: Vishnu Suresh Lokhande, Kihyuk Sohn, Jinsung Yoon, Madeleine Udell, Chen-Yu Lee, Tomas Pfister

    Abstract: Learning invariant representations is an important requirement when training machine learning models that are driven by spurious correlations in the datasets. These spurious correlations, between input samples and the target labels, wrongly direct the neural network predictions resulting in poor performance on certain groups, especially the minority groups. Robust training against these spurious c… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

  48. Memory-guided Image De-raining Using Time-Lapse Data

    Authors: Jaehoon Cho, Seungryong Kim, Kwanghoon Sohn

    Abstract: This paper addresses the problem of single image de-raining, that is, the task of recovering clean and rain-free background scenes from a single image obscured by a rainy artifact. Although recent advances adopt real-world time-lapse data to overcome the need for paired rain-clean images, they are limited to fully exploit the time-lapse data. The main cause is that, in terms of network architectur… ▽ More

    Submitted 5 January, 2022; originally announced January 2022.

  49. arXiv:2112.11573  [pdf, other

    cs.CV

    Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly Types

    Authors: Kihyuk Sohn, Jinsung Yoon, Chun-Liang Li, Chen-Yu Lee, Tomas Pfister

    Abstract: We study anomaly clustering, grouping data into coherent clusters of anomaly types. This is different from anomaly detection that aims to divide anomalies from normal data. Unlike object-centered image clustering, anomaly clustering is particularly challenging as anomalous patterns are subtle and local. We present a simple yet effective clustering framework using a patch-based pretrained deep embe… ▽ More

    Submitted 14 October, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: WACV2023

  50. arXiv:2111.04982  [pdf, other

    cs.CV

    Dual Prototypical Contrastive Learning for Few-shot Semantic Segmentation

    Authors: Hyeongjun Kwon, Somi Jeong, Sunok Kim, Kwanghoon Sohn

    Abstract: We address the problem of few-shot semantic segmentation (FSS), which aims to segment novel class objects in a target image with a few annotated samples. Though recent advances have been made by incorporating prototype-based metric learning, existing methods still show limited performance under extreme intra-class object variations and semantically similar inter-class objects due to their poor fea… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

    Comments: 8 pages, 7 figures, https://github.com/kwonjunn01/DPCL