Skip to main content

Showing 1–50 of 312 results for author: Kim, W

  1. arXiv:2407.10733  [pdf, other

    cs.CV

    Joint-Embedding Predictive Architecture for Self-Supervised Learning of Mask Classification Architecture

    Authors: Dong-Hee Kim, Sungduk Cho, Hyeonwoo Cho, Chanmin Park, Jinyoung Kim, Won Hwa Kim

    Abstract: In this work, we introduce Mask-JEPA, a self-supervised learning framework tailored for mask classification architectures (MCA), to overcome the traditional constraints associated with training segmentation models. Mask-JEPA combines a Joint Embedding Predictive Architecture with MCA to adeptly capture intricate semantics and precise object boundaries. Our approach addresses two critical challenge… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 27 pages, 5 figures

  2. arXiv:2407.09303  [pdf, other

    cs.CV

    ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

    Authors: Sungmin Woo, Wonjoon Lee, Woo Jin Kim, Dogyoon Lee, Sangyoun Lee

    Abstract: Self-supervised multi-frame monocular depth estimation relies on the geometric consistency between successive frames under the assumption of a static scene. However, the presence of moving objects in dynamic scenes introduces inevitable inconsistencies, causing misaligned multi-frame feature matching and misleading self-supervision during training. In this paper, we propose a novel framework calle… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024. Project Page: https://sungmin-woo.github.io/prodepth/

  3. RISC-V R-Extension: Advancing Efficiency with Rented-Pipeline for Edge DNN Processing

    Authors: Won Hyeok Kim, Hyeong Jin Kim, Tae Hee Han

    Abstract: The proliferation of edge devices necessitates efficient computational architectures for lightweight tasks, particularly deep neural network (DNN) inference. Traditional NPUs, though effective for such operations, face challenges in power, cost, and area when integrated into lightweight edge devices. The RISC-V architecture, known for its modularity and open-source nature, offers a viable alternat… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: 6 pages, 6 figures, ICAIIC 2024

  4. arXiv:2406.19148  [pdf, other

    cs.CV cs.AI

    BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision

    Authors: Kit Mills Bransby, Arian Beqiri, Woo-Jin Cho Kim, Jorge Oliveira, Agisilaos Chartsias, Alberto Gomez

    Abstract: Neural networks can learn spurious correlations that lead to the correct prediction in a validation set, but generalise poorly because the predictions are right for the wrong reason. This undesired learning of naive shortcuts (Clever Hans effect) can happen for example in echocardiogram view classification when background cues (e.g. metadata) are biased towards a class and the model learns to focu… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted at MICCAI 2024 (Pre-print)

  5. arXiv:2406.17869  [pdf, other

    cs.CV

    Burst Image Super-Resolution with Base Frame Selection

    Authors: Sanghyun Kim, Min Jung Lee, Woohyeok Kim, Deunsol Jung, Jaesung Rim, Sunghyun Cho, Minsu Cho

    Abstract: Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: CVPR2024W NTIRE accepted

  6. arXiv:2406.12246  [pdf, other

    cs.LG cs.CL cs.CV

    TroL: Traversal of Layers for Large Language and Vision Models

    Authors: Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro

    Abstract: Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparabl… ▽ More

    Submitted 19 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Code is available in https://github.com/ByungKwanLee/TroL

  7. arXiv:2406.12095  [pdf, other

    cs.CV cs.AI cs.RO

    DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

    Authors: Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

    Abstract: We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB,… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  8. arXiv:2406.11427  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

    Authors: Keon Lee, Dong Won Kim, Jaehyeon Kim, Jaewoong Cho

    Abstract: Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models f… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  9. arXiv:2406.11313  [pdf, other

    cs.CV

    Semi-Supervised Domain Adaptation Using Target-Oriented Domain Augmentation for 3D Object Detection

    Authors: Yecheol Kim, Junho Lee, Changsoo Park, Hyoung won Kim, Inho Lim, Christopher Chang, Jun Won Choi

    Abstract: 3D object detection is crucial for applications like autonomous driving and robotics. However, in real-world environments, variations in sensor data distribution due to sensor upgrades, weather changes, and geographic differences can adversely affect detection performance. Semi-Supervised Domain Adaptation (SSDA) aims to mitigate these challenges by transferring knowledge from a source domain, abu… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted to IEEE Transactions on Intelligent Vehicles (T-IV). The code is available at: https://github.com/rasd3/TODA

  10. arXiv:2406.10324  [pdf, other

    cs.CV cs.LG

    L4GM: Large 4D Gaussian Reconstruction Model

    Authors: Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling

    Abstract: We present L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input -- in a single feed-forward pass that takes only a second. Key to our success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Project page: https://research.nvidia.com/labs/toronto-ai/l4gm

  11. arXiv:2406.09188  [pdf, ps, other

    cs.CV cs.IR

    Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

    Authors: Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, Sanghyuk Chun, Taesup Moon

    Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable searches. Due to the expensive dataset construction cost for CIR triplets, a zero-shot (ZS) CIR setting has been actively studied to eliminate the need for human-collected triplet datasets. The mainstream of ZS-CIR employs an efficient projection module that projec… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 17 pages

  12. arXiv:2406.07867  [pdf, other

    cs.CV cs.AI cs.HC

    Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

    Authors: Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro

    Abstract: In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corp… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024

  13. arXiv:2406.06650  [pdf, other

    eess.IV cs.CV

    Predicting the risk of early-stage breast cancer recurrence using H\&E-stained tissue images

    Authors: Geongyu Lee, Joonho Lee, Tae-Yeong Kwak, Sun Woo Kim, Youngmee Kwon, Chungyeul Kim, Hyeyoon Chang

    Abstract: Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients' risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images la… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 12 pages, 7 figures

  14. arXiv:2406.06149  [pdf, other

    cs.LG stat.ML

    Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations

    Authors: Yujee Song, Donghyun Lee, Rui Meng, Won Hwa Kim

    Abstract: A Marked Temporal Point Process (MTPP) is a stochastic process whose realization is a set of event-time data. MTPP is often used to understand complex dynamics of asynchronous temporal events such as money transaction, social media, healthcare, etc. Recent studies have utilized deep neural networks to capture complex temporal dependencies of events and generate embedding that aptly represent the o… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 18 pages, 8 figures, The Twelfth International Conference on Learning Representations (ICLR 2024)

  15. arXiv:2405.19961  [pdf, other

    cs.LG

    Collective Variable Free Transition Path Sampling with Generative Flow Network

    Authors: Kiyoung Seong, Seonghyun Park, Seonghwan Kim, Woo Youn Kim, Sungsoo Ahn

    Abstract: Understanding transition paths between meta-stable states in molecular systems is fundamental for material design and drug discovery. However, sampling these paths via molecular dynamics simulations is computationally prohibitive due to the high-energy barriers between the meta-stable states. Recent machine learning approaches are often restricted to simple systems or rely on collective variables… ▽ More

    Submitted 31 May, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

    Comments: 9 pages, 5 figures, 2 tables

  16. arXiv:2405.16861  [pdf, other

    q-bio.BM cs.LG physics.bio-ph

    NCIDiff: Non-covalent Interaction-generative Diffusion Model for Improving Reliability of 3D Molecule Generation Inside Protein Pocket

    Authors: Joongwon Lee, Wonho Zhung, Woo Youn Kim

    Abstract: Advancements in deep generative modeling have changed the paradigm of drug discovery. Among such approaches, target-aware methods that exploit 3D structures of protein pockets were spotlighted for generating ligand molecules with their plausible binding modes. While docking scores superficially assess the quality of generated ligands, closer inspection of the binding structures reveals the inconsi… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  17. arXiv:2405.15574  [pdf, other

    cs.CV

    Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

    Authors: Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro

    Abstract: The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to m… ▽ More

    Submitted 27 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Code is available in https://github.com/ByungKwanLee/Meteor

  18. arXiv:2405.14126  [pdf, other

    cs.LG cs.AI cs.CV

    The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

    Authors: Bum Jun Kim, Yoshinobu Kawahara, Sang Woo Kim

    Abstract: Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 14 pages, 7 figures

  19. arXiv:2405.14115  [pdf, other

    cs.CV cs.AI cs.LG

    Configuring Data Augmentations to Reduce Variance Shift in Positional Embedding of Vision Transformers

    Authors: Bum Jun Kim, Sang Woo Kim

    Abstract: Vision transformers (ViTs) have demonstrated remarkable performance in a variety of vision tasks. Despite their promising capabilities, training a ViT requires a large amount of diverse data. Several studies empirically found that using rich data augmentations, such as Mixup, Cutmix, and random erasing, is critical to the successful training of ViTs. Now, the use of rich data augmentations has bec… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 16 pages, 4 figures

  20. arXiv:2405.02066  [pdf, other

    cs.CV eess.IV

    WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights

    Authors: Youngdong Jang, Dong In Lee, MinHyuk Jang, Jong Wook Kim, Feng Yang, Sangpil Kim

    Abstract: The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representat… ▽ More

    Submitted 11 July, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

  21. arXiv:2405.00748  [pdf, other

    cs.HC cs.AI cs.CY

    ChatGPT in Data Visualization Education: A Student Perspective

    Authors: Nam Wook Kim, Hyung-Kwon Ko, Grace Myers, Benjamin Bach

    Abstract: Unlike traditional educational chatbots that rely on pre-programmed responses, large-language model-driven chatbots, such as ChatGPT, demonstrate remarkable versatility and have the potential to serve as a dynamic resource for addressing student needs from understanding advanced concepts to solving complex problems. This work explores the impact of such technology on student learning in an interdi… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: 12 pages; 3 figures

  22. arXiv:2405.00021  [pdf, other

    cs.CV cs.AI cs.CL

    SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials

    Authors: Wonjoong Kim, Sangwu Park, Yeonjun In, Seokwon Han, Chanyoung Park

    Abstract: Recently, interpreting complex charts with logical reasoning has emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Model (LLM) for reasoning. However, unlike natural images, charts contain a mix of essen… ▽ More

    Submitted 17 June, 2024; v1 submitted 22 February, 2024; originally announced May 2024.

  23. arXiv:2404.17507  [pdf, other

    cs.CV

    HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

    Authors: Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, Sangdoo Yun

    Abstract: In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our appr… ▽ More

    Submitted 16 July, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

    Comments: ECCV 2024; 33pages, 4.5MB

  24. arXiv:2404.10765  [pdf, other

    cs.CV

    RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

    Authors: Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic

    Abstract: Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plau… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: Project page: https://reffusion.github.io

  25. arXiv:2404.10633  [pdf, other

    cs.CV

    Contextrast: Contextual Contrastive Learning for Semantic Segmentation

    Authors: Changki Sung, Wanhee Kim, Jungho An, Wooju Lee, Hyungtae Lim, Hyun Myung

    Abstract: Despite great improvements in semantic segmentation, challenges persist because of the lack of local/global contexts and the relationship between them. In this paper, we propose Contextrast, a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive lea… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  26. arXiv:2404.02583  [pdf, other

    cs.LG

    Transformer-based Stagewise Decomposition for Large-Scale Multistage Stochastic Optimization

    Authors: Chanyeong Kim, Jongwoong Park, Hyunglip Bae, Woo Chang Kim

    Abstract: Solving large-scale multistage stochastic programming (MSP) problems poses a significant challenge as commonly used stagewise decomposition algorithms, including stochastic dual dynamic programming (SDDP), face growing time complexity as the subproblem size and problem count increase. Traditional approaches approximate the value functions as piecewise linear convex functions by incrementally accum… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Accepted at ICML 2023

  27. arXiv:2404.01954  [pdf, other

    cs.CL cs.AI

    HyperCLOVA X Technical Report

    Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

    Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More

    Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: 44 pages; updated authors list and fixed author names

  28. arXiv:2403.18406  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

    Authors: Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

    Abstract: Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily availab… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Our code is available at https://github.com/imagegridworth/IG-VLM

  29. arXiv:2403.18062  [pdf, other

    cs.RO cs.AI

    ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition

    Authors: Samuel Li, Sarthak Bhagat, Joseph Campbell, Yaqi Xie, Woojun Kim, Katia Sycara, Simon Stepputtis

    Abstract: Task-oriented grasping of unfamiliar objects is a necessary skill for robots in dynamic in-home environments. Inspired by the human capability to grasp such objects through intuition about their shape and structure, we present a novel zero-shot task-oriented grasping method leveraging a geometric decomposition of the target object into simple, convex shapes that we represent in a graph structure,… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: 8 pages

  30. arXiv:2403.14353  [pdf, other

    cs.AR cs.LG cs.RO

    DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

    Authors: Yoonsung Kim, Changhun Oh, Jinwoo Hwang, Wonung Kim, Seongryong Oh, Yubin Lee, Hardik Sharma, Amir Yazdanbakhsh, Jongse Park

    Abstract: Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a l… ▽ More

    Submitted 16 July, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  31. arXiv:2403.14111  [pdf, other

    cs.CR cs.LG

    HETAL: Efficient Privacy-preserving Transfer Learning with Homomorphic Encryption

    Authors: Seewoo Lee, Garam Lee, Jung Woo Kim, Junbum Shin, Mun-Kyu Lee

    Abstract: Transfer learning is a de facto standard method for efficiently training machine learning models for data-scarce problems by adding and fine-tuning new classification layers to a model pre-trained on large datasets. Although numerous previous studies proposed to use homomorphic encryption to resolve the data privacy issue in transfer learning in the machine learning as a service setting, most of t… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: ICML 2023, Appendix D includes some updates after official publication

    Journal ref: PMLR 202:19010-19035, 2023

  32. arXiv:2403.08187  [pdf, other

    cs.CL cs.SD eess.AS

    Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children

    Authors: Taekyung Ahn, Yeonjung Hong, Younggon Im, Do Hyung Kim, Dayoung Kang, Joo Won Jeong, Jae Won Kim, Min Jung Kim, Ah-ra Cho, Dae-Hyun Jang, Hosung Nam

    Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children wit… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 12 pages, 2 figures

    ACM Class: I.2.7

  33. arXiv:2403.07508  [pdf, other

    cs.CV

    MoAI: Mixture of All Intelligence for Large Language and Vision Models

    Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

    Abstract: The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed a… ▽ More

    Submitted 14 July, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: ECCV 2024. Code available: https://github.com/ByungKwanLee/MoAI

  34. arXiv:2403.05949  [pdf, other

    cs.CV cs.LG q-bio.TO

    General surgery vision transformer: A video pre-trained foundation model for general surgery

    Authors: Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger

    Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general… ▽ More

    Submitted 12 April, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

  35. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  36. arXiv:2403.05086  [pdf, other

    cs.CV

    UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and UnFavOrable Sets

    Authors: Youngju Na, Woo Jae Kim, Kyu Beom Han, Suhyeon Ha, Sung-eui Yoon

    Abstract: Generalizable neural implicit surface reconstruction aims to obtain an accurate underlying geometry given a limited number of multi-view images from unseen scenes. However, existing methods select only informative and relevant views using predefined scores for training and testing phases. This constraint renders the model impractical in real-world scenarios, where the availability of favorable com… ▽ More

    Submitted 17 May, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: accepted at CVPR 2024 project page: https://youngju-na.github.io/uforecon.github.io/

  37. arXiv:2403.04981  [pdf, other

    cs.ET

    Paving the Way for Pass Disturb Free Vertical NAND Storage via A Dedicated and String-Compatible Pass Gate

    Authors: Zijian Zhao, Sola Woo, Khandker Akif Aabrar, Sharadindu Gopal Kirtania, Zhouhang Jiang, Shan Deng, Yi Xiao, Halid Mulaosmanovic, Stefan Duenkel, Dominik Kleimaier, Steven Soss, Sven Beyer, Rajiv Joshi, Scott Meninger, Mohamed Mohamed, Kijoon Kim, Jongho Woo, Suhwan Lim, Kwangsoo Kim, Wanki Kim, Daewon Ha, Vijaykrishnan Narayanan, Suman Datta, Shimeng Yu, Kai Ni

    Abstract: In this work, we propose a dual-port cell design to address the pass disturb in vertical NAND storage, which can pass signals through a dedicated and string-compatible pass gate. We demonstrate that: i) the pass disturb-free feature originates from weakening of the depolarization field by the pass bias at the high-${V}_{TH}$ (HVT) state and the screening of the applied field by channel at the low-… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 29 pages, 7 figures

  38. arXiv:2403.04212  [pdf, other

    cs.CL

    Persona Extraction Through Semantic Similarity for Emotional Support Conversation Generation

    Authors: Seunghee Han, Se Jin Park, Chae Won Kim, Yong Man Ro

    Abstract: Providing emotional support through dialogue systems is becoming increasingly important in today's world, as it can support both mental health and social interactions in many conversation scenarios. Previous works have shown that using persona is effective for generating empathetic and supportive responses. They have often relied on pre-provided persona rather than inferring them during conversati… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: Accepted by ICASSP2024

  39. arXiv:2403.01875  [pdf, other

    cs.LG cs.AI

    ICLN: Input Convex Loss Network for Decision Focused Learning

    Authors: Haeun Jeon, Hyunglip Bae, Minsu Park, Chanyeong Kim, Woo Chang Kim

    Abstract: In decision-making problem under uncertainty, predicting unknown parameters is often considered independent of the optimization part. Decision-focused Learning (DFL) is a task-oriented framework to integrate prediction and optimization by adapting predictive model to give better decision for the corresponding task. Here, an inevitable challenge arises when computing gradients of the optimal decisi… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  40. arXiv:2403.00225  [pdf, other

    cs.LG cs.AI cs.RO

    Robust Policy Learning via Offline Skill Diffusion

    Authors: Woo Kyung Kim, Minjong Yoo, Honguk Woo

    Abstract: Skill-based reinforcement learning (RL) approaches have shown considerable promise, especially in solving long-horizon tasks via hierarchical structures. These skills, learned task-agnostically from offline datasets, can accelerate the policy learning process for new tasks. Yet, the application of these skills in different domains remains restricted due to their inherent dependency on the datasets… ▽ More

    Submitted 5 March, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

    Comments: Accepted for AAAI 2024

  41. DSLR: Diversity Enhancement and Structure Learning for Rehearsal-based Graph Continual Learning

    Authors: Seungyoon Choi, Wonjoong Kim, Sungwon Kim, Yeonjun In, Sein Kim, Chanyoung Park

    Abstract: We investigate the replay buffer in rehearsal-based approaches for graph continual learning (GCL) methods. Existing rehearsal-based GCL methods select the most representative nodes for each class and store them in a replay buffer for later use in training subsequent tasks. However, we discovered that considering only the class representativeness of each replayed node makes the replayed nodes to be… ▽ More

    Submitted 3 March, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: Accepted at ACM TheWebConf 2024 (WWW 2024) (Oral presentation)

  42. arXiv:2402.11248  [pdf, other

    cs.CV

    CoLLaVO: Crayon Large Language and Vision mOdel

    Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

    Abstract: The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box… ▽ More

    Submitted 2 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

    Comments: ACL 2024 Findings. Code available: https://github.com/ByungKwanLee/CoLLaVO

  43. arXiv:2402.08369  [pdf, other

    cs.AI

    One-shot Imitation in a Non-Stationary Environment via Multi-Modal Skill

    Authors: Sangwoo Shin, Daehee Lee, Minjong Yoo, Woo Kyung Kim, Honguk Woo

    Abstract: One-shot imitation is to learn a new task from a single demonstration, yet it is a challenging problem to adopt it for complex tasks with the high domain diversity inherent in a non-stationary environment. To tackle the problem, we explore the compositionality of complex tasks, and present a novel skill-based imitation learning framework enabling one-shot imitation and zero-shot adaptation; from a… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: ICML-2023 Camera Ready Version

  44. arXiv:2402.08113  [pdf, other

    cs.CL cs.HC

    Addressing cognitive bias in medical language models

    Authors: Samuel Schmidgall, Carl Harris, Ime Essien, Daniel Olshvang, Tawsifur Rahman, Ji Woong Kim, Rojin Ziaei, Jason Eshraghian, Peter Abadir, Rama Chellappa

    Abstract: There is increasing interest in the application large language models (LLMs) to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical… ▽ More

    Submitted 20 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  45. arXiv:2402.02017  [pdf, other

    cs.LG

    Value-Aided Conditional Supervised Learning for Offline RL

    Authors: Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung

    Abstract: Offline reinforcement learning (RL) has seen notable advancements through return-conditioned supervised learning (RCSL) and value-based methods, yet each approach comes with its own set of practical challenges. Addressing these, we propose Value-Aided Conditional Supervised Learning (VCS), a method that effectively synergizes the stability of RCSL with the stitching ability of value-based methods.… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  46. arXiv:2402.01149  [pdf, other

    cs.CV

    Scale Equalization for Multi-Level Feature Fusion

    Authors: Bum Jun Kim, Sang Woo Kim

    Abstract: Deep neural networks have exhibited remarkable performance in a variety of computer vision fields, especially in semantic segmentation tasks. Their success is often attributed to multi-level feature fusion, which enables them to understand both global and local information from an image. However, we found that multi-level features from parallel branches are on different scales. The scale disequili… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: 10 pages, 3 figures

  47. arXiv:2402.00977  [pdf, other

    cs.CV eess.IV

    Enhanced fringe-to-phase framework using deep learning

    Authors: Won-Hoe Kim, Bongjoong Kim, Hyung-Gun Chi, Jae-Sang Hyun

    Abstract: In Fringe Projection Profilometry (FPP), achieving robust and accurate 3D reconstruction with a limited number of fringe patterns remains a challenge in structured light 3D imaging. Conventional methods require a set of fringe images, but using only one or two patterns complicates phase recovery and unwrapping. In this study, we introduce SFNet, a symmetric fusion network that transforms two fring… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

    Comments: 35 pages, 13 figures, 6 tables

  48. arXiv:2401.18006  [pdf, other

    q-bio.QM cs.LG eess.SP

    EEG-GPT: Exploring Capabilities of Large Language Models for EEG Classification and Interpretation

    Authors: Jonathan W. Kim, Ahmed Alaa, Danilo Bernardo

    Abstract: In conventional machine learning (ML) approaches applied to electroencephalography (EEG), this is often a limited focus, isolating specific brain activities occurring across disparate temporal scales (from transient spikes in milliseconds to seizures lasting minutes) and spatial scales (from localized high-frequency oscillations to global sleep activity). This siloed approach limits the developmen… ▽ More

    Submitted 3 February, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

  49. arXiv:2401.16808  [pdf, other

    cs.LG cs.AI

    Encoding Temporal Statistical-space Priors via Augmented Representation

    Authors: Insu Choi, Woosung Koh, Gimin Kang, Yuntae Jang, Woo Chang Kim

    Abstract: Modeling time series data remains a pervasive issue as the temporal dimension is inherent to numerous domains. Despite significant strides in time series forecasting, high noise-to-signal ratio, non-normality, non-stationarity, and lack of data continue challenging practitioners. In response, we leverage a simple representation augmentation technique to overcome these challenges. Our augmented rep… ▽ More

    Submitted 3 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: pre-print

  50. arXiv:2401.14587  [pdf, other

    cs.CV

    CNG-SFDA: Clean-and-Noisy Region Guided Online-Offline Source-Free Domain Adaptation

    Authors: Hyeonwoo Cho, Chanmin Park, Donghee Kim, Jinyoung Kim, Won Hwa Kim

    Abstract: Domain shift occurs when training (source) and test (target) data diverge in their distribution. Source-Free Domain Adaptation (SFDA) addresses this domain shift problem, aiming to adopt a trained model on the source domain to the target domain in a scenario where only a well-trained source model and unlabeled target data are available. In this scenario, handling false labels in the target domain… ▽ More

    Submitted 4 July, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: 15 pages, 5 figures