subscribe to arXiv mailings

Enhancing Temporal Action Localization: Advanced S6 Modeling with Recurrent Mechanism

Authors: Sangyoun Lee, Juho Jung, Changdae Oh, Sunghee Yun

Abstract: Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates th… ▽ More Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: 8 pages, 3 figures, Preprint

arXiv:2407.08245 [pdf, other]

Feature Diversification and Adaptation for Federated Domain Generalization

Authors: Seunghan Yang, Seokeon Choi, Hyunsin Park, Sungha Choi, Simyung Chang, Sungrack Yun

Abstract: Federated learning, a distributed learning paradigm, utilizes multiple clients to build a robust global model. In real-world applications, local clients often operate within their limited domains, leading to a `domain shift' across clients. Privacy concerns limit each client's learning to its own domain data, which increase the risk of overfitting. Moreover, the process of aggregating models train… ▽ More Federated learning, a distributed learning paradigm, utilizes multiple clients to build a robust global model. In real-world applications, local clients often operate within their limited domains, leading to a `domain shift' across clients. Privacy concerns limit each client's learning to its own domain data, which increase the risk of overfitting. Moreover, the process of aggregating models trained on own limited domain can be potentially lead to a significant degradation in the global model performance. To deal with these challenges, we introduce the concept of federated feature diversification. Each client diversifies the own limited domain data by leveraging global feature statistics, i.e., the aggregated average statistics over all participating clients, shared through the global model's parameters. This data diversification helps local models to learn client-invariant representations while preserving privacy. Our resultant global model shows robust performance on unseen test domain data. To enhance performance further, we develop an instance-adaptive inference approach tailored for test domain data. Our proposed instance feature adapter dynamically adjusts feature statistics to align with the test input, thereby reducing the domain gap between the test and training domains. We show that our method achieves state-of-the-art performance on several domain generalization benchmarks within a federated learning setting. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.06123 [pdf, other]

Investigating User Perceptions of Collaborative Agenda Setting in Virtual Health Counseling Session

Authors: Mina Fallah, Farnaz Nouraei, Hye Sun Yun, Timothy Bickmore

Abstract: Virtual health counselors offer the potential to provide users with information and counseling in complex areas such as disease management and health education. However, ensuring user engagement is challenging, particularly when the volume of information and length of counseling sessions increase. Agenda setting a clinical counseling technique where a patient and clinician collaboratively decide o… ▽ More Virtual health counselors offer the potential to provide users with information and counseling in complex areas such as disease management and health education. However, ensuring user engagement is challenging, particularly when the volume of information and length of counseling sessions increase. Agenda setting a clinical counseling technique where a patient and clinician collaboratively decide on session topics is an effective approach to tailoring discussions for individual patient needs and sustaining engagement. We explore the effectiveness of agenda setting in a virtual counselor system designed to counsel women for breast cancer genetic testing. In a between subjects study, we assessed three versions of the system with varying levels of user control in the system's agenda setting approach. We found that participants' knowledge improved across all conditions. Although our results showed that any type of agenda setting was perceived as useful, regardless of user control, interviews revealed a preference for more collaboration and user involvement in the agenda setting process. Our study highlights the importance of using patient-centered approaches, such as tailored discussions, when using virtual counselors in healthcare. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.03563 [pdf, other]

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition

Authors: Sungnyun Kim, Kangwook Jang, Sangmin Bae, Hoirin Kim, Se-Young Yun

Abstract: Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning th… ▽ More Audio-visual speech recognition (AVSR) aims to transcribe human speech using both audio and video modalities. In practical environments with noise-corrupted audio, the role of video information becomes crucial. However, prior works have primarily focused on enhancing audio features in AVSR, overlooking the importance of video features. In this study, we strengthen the video features by learning three temporal dynamics in video data: context order, playback direction, and the speed of video frames. Cross-modal attention modules are introduced to enrich video features with audio information so that speech variability can be taken into account when training on the video temporal dynamics. Based on our approach, we achieve the state-of-the-art performance on the LRS2 and LRS3 AVSR benchmarks for the noise-dominant settings. Our approach excels in scenarios especially for babble and speech noise, indicating the ability to distinguish the speech signal that should be recognized from lip movements in the video modality. We support the validity of our methodology by offering the ablation experiments for the temporal dynamics losses and the cross-modal attention architecture design. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2407.01639 [pdf, other]

ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks

Authors: Tianhao Wei, Luca Marzari, Kai S. Yun, Hanjiang Hu, Peizhi Niu, Xusheng Luo, Changliu Liu

Abstract: Deep Neural Networks (DNN) are crucial in approximating nonlinear functions across diverse applications, ranging from image classification to control. Verifying specific input-output properties can be a highly challenging task due to the lack of a single, self-contained framework that allows a complete range of verification types. To this end, we present \texttt{ModelVerification.jl (MV)}, the fir… ▽ More Deep Neural Networks (DNN) are crucial in approximating nonlinear functions across diverse applications, ranging from image classification to control. Verifying specific input-output properties can be a highly challenging task due to the lack of a single, self-contained framework that allows a complete range of verification types. To this end, we present \texttt{ModelVerification.jl (MV)}, the first comprehensive, cutting-edge toolbox that contains a suite of state-of-the-art methods for verifying different types of DNNs and safety specifications. This versatile toolbox is designed to empower developers and machine learning practitioners with robust tools for verifying and ensuring the trustworthiness of their DNN models. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.01624 [pdf, other]

Guided Trajectory Generation with Diffusion Models for Offline Model-based Optimization

Authors: Taeyoung Yun, Sujin Yun, Jaewoo Lee, Jinkyoo Park

Abstract: Optimizing complex and high-dimensional black-box functions is ubiquitous in science and engineering fields. Unfortunately, the online evaluation of these functions is restricted due to time and safety constraints in most cases. In offline model-based optimization (MBO), we aim to find a design that maximizes the target function using only a pre-existing offline dataset. While prior methods consid… ▽ More Optimizing complex and high-dimensional black-box functions is ubiquitous in science and engineering fields. Unfortunately, the online evaluation of these functions is restricted due to time and safety constraints in most cases. In offline model-based optimization (MBO), we aim to find a design that maximizes the target function using only a pre-existing offline dataset. While prior methods consider forward or inverse approaches to address the problem, these approaches are limited by conservatism and the difficulty of learning highly multi-modal mappings. Recently, there has been an emerging paradigm of learning to improve solutions with synthetic trajectories constructed from the offline dataset. In this paper, we introduce a novel conditional generative modeling approach to produce trajectories toward high-scoring regions. First, we construct synthetic trajectories toward high-scoring regions using the dataset while injecting locality bias for consistent improvement directions. Then, we train a conditional diffusion model to generate trajectories conditioned on their scores. Lastly, we sample multiple trajectories from the trained model with guidance to explore high-scoring regions beyond the dataset and select high-fidelity designs among generated trajectories with the proxy function. Extensive experiment results demonstrate that our method outperforms competitive baselines on Design-Bench and its practical variants. The code is publicly available in \texttt{https://github.com/dbsxodud-11/GTG}. △ Less

Submitted 29 June, 2024; originally announced July 2024.

Comments: 29 pages, 11 figures, 17 tables

arXiv:2407.00693 [pdf, other]

BAPO: Base-Anchored Preference Optimization for Personalized Alignment in Large Language Models

Authors: Gihun Lee, Minchan Jeong, Yujin Kim, Hojung Jung, Jaehoon Oh, Sangmook Kim, Se-Young Yun

Abstract: While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneit… ▽ More While learning to align Large Language Models (LLMs) with human preferences has shown remarkable success, aligning these models to meet the diverse user preferences presents further challenges in preserving previous knowledge. This paper examines the impact of personalized preference optimization on LLMs, revealing that the extent of knowledge loss varies significantly with preference heterogeneity. Although previous approaches have utilized the KL constraint between the reference model and the policy model, we observe that they fail to maintain general knowledge and alignment when facing personalized preferences. To this end, we introduce Base-Anchored Preference Optimization (BAPO), a simple yet effective approach that utilizes the initial responses of reference model to mitigate forgetting while accommodating personalized alignment. BAPO effectively adapts to diverse user preferences while minimally affecting global knowledge or general alignment. Our experiments demonstrate the efficacy of BAPO in various setups. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: under review

arXiv:2406.20098 [pdf, other]

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Authors: Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

Abstract: Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-t… ▽ More Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: Website at https://mbzuai-llm.github.io/webpage2code/

arXiv:2406.18815 [pdf, other]

MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation

Authors: Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, Mohsen Imani

Abstract: In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challen… ▽ More In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning. This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models and enabling fully frame-level training without fixed video segmentation. Utilizing automated, mission-specific knowledge graph generation, our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches. Experimental validation on benchmark datasets demonstrates our model's performance in VAD and VAR, highlighting its potential to redefine the landscape of anomaly detection and recognition in video surveillance systems. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.16758 [pdf, other]

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Authors: Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun

Abstract: Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and… ▽ More Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.07975 [pdf, other]

FINER: Far-Infrared Nebular Emission Receiver for the Large Millimeter Telescope

Authors: Yoichi Tamura, Takeshi Sakai, Ryohei Kawabe, Takafumi Kojima, Akio Taniguchi, Tatsuya Takekoshi, Haoran Kang, Wenlei Shan, Masato Hagimoto, Norika Okauchi, Airi Tetsuka, Akio K. Inoue, Kotaro Kohno, Kunihiko Tanaka, Tom J. L. C. Bakx, Yoshinobu Fudamoto, Kazuyuki Fujita, Yuichi Harikane, Takuya Hashimoto, Bunyo Hatsukade, David H. Hughes, Takahiro Iino, Yuki Kimura, Hiroyuki Maezawa, Yuichi Matsuda , et al. (12 additional authors not shown)

Abstract: Unveiling the emergence and prevalence of massive/bright galaxies during the epoch of reionization and beyond, within the first 600 million years of the Universe, stands as a pivotal pursuit in astronomy. Remarkable progress has been made by JWST in identifying an immense population of bright galaxies, which hints at exceptionally efficient galaxy assembly processes. However, the underlying physic… ▽ More Unveiling the emergence and prevalence of massive/bright galaxies during the epoch of reionization and beyond, within the first 600 million years of the Universe, stands as a pivotal pursuit in astronomy. Remarkable progress has been made by JWST in identifying an immense population of bright galaxies, which hints at exceptionally efficient galaxy assembly processes. However, the underlying physical mechanisms propelling their rapid growth remain unclear. With this in mind, millimeter and submillimeter-wave spectroscopic observations of redshifted far-infrared spectral lines, particularly the [O III] 88 micron and [C II] 158 micron lines, offers a crucial pathway to address this fundamental query. To this end, we develop a dual-polarization sideband-separating superconductor-insulator-superconductor (SIS) mixer receiver, FINER, for the Large Millimeter Telescope (LMT) situated in Mexico. Harnessing advancements from ALMA's wideband sensitivity upgrade (WSU) technology, FINER covers radio frequencies spanning 120-360 GHz, delivering an instantaneous intermediate frequency (IF) of 3-21 GHz per sideband per polarization, which is followed by a set of 10.24 GHz-wide digital spectrometers. At 40% of ALMA's light-collecting area, the LMT's similar atmospheric transmittance and FINER's 5 times wider bandwidth compared to ALMA culminate in an unparalleled spectral scanning capability in the northern hemisphere, paving the way for finer spectral-resolution detection of distant galaxies. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 12 pages, 8 figures, and 3 tables. Proceedings paper presented in SPIE Astronomical Telescope and Instrumentation 2024

arXiv:2406.02657 [pdf, other]

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Authors: Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun

Abstract: This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inferenc… ▽ More This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at https://github.com/itsnamgyu/block-transformer. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: 30 pages, 21 figures, 5 tables

arXiv:2406.02355 [pdf, other]

FedDr+: Stabilizing Dot-regression with Global Feature Distillation for Federated Learning

Authors: Seongyoon Kim, Minchan Jeong, Sungnyun Kim, Sungwoo Cho, Sumyeong Ahn, Se-Young Yun

Abstract: Federated Learning (FL) has emerged as a pivotal framework for the development of effective global models (global FL) or personalized models (personalized FL) across clients with heterogeneous, non-iid data distribution. A key challenge in FL is client drift, where data heterogeneity impedes the aggregation of scattered knowledge. Recent studies have tackled the client drift issue by identifying s… ▽ More Federated Learning (FL) has emerged as a pivotal framework for the development of effective global models (global FL) or personalized models (personalized FL) across clients with heterogeneous, non-iid data distribution. A key challenge in FL is client drift, where data heterogeneity impedes the aggregation of scattered knowledge. Recent studies have tackled the client drift issue by identifying significant divergence in the last classifier layer. To mitigate this divergence, strategies such as freezing the classifier weights and aligning the feature extractor accordingly have proven effective. Although the local alignment between classifier and feature extractor has been studied as a crucial factor in FL, we observe that it may lead the model to overemphasize the observed classes within each client. Thus, our objectives are twofold: (1) enhancing local alignment while (2) preserving the representation of unseen class samples. This approach aims to effectively integrate knowledge from individual clients, thereby improving performance for both global and personalized FL. To achieve this, we introduce a novel algorithm named FedDr+, which empowers local model alignment using dot-regression loss. FedDr+ freezes the classifier as a simplex ETF to align the features and improves aggregated global models by employing a feature distillation mechanism to retain information about unseen/missing classes. Consequently, we provide empirical evidence demonstrating that our algorithm surpasses existing methods that use a frozen classifier to boost alignment across the diverse distribution. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.02021 [pdf, other]

MetaMixer Is All You Need

Authors: Seokju Yun, Dongheon Lee, Youngmin Ro

Abstract: Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention,… ▽ More Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. We hypothesize that the importance lies in query-key-value framework itself rather than in self-attention. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key and attention coefficient-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks. Our FFNet achieves remarkable performance improvements over previous state-of-the-art methods across a wide range of tasks. The strong and general performance of our proposed method validates our hypothesis and leads us to introduce MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework. We show that using only simple operations like convolution and GELU in the MetaMixer can achieve superior performance. △ Less

Submitted 4 June, 2024; originally announced June 2024.

Comments: Code: https://github.com/ysj9909/FFNet

arXiv:2405.19806 [pdf, other]

Preference Alignment with Flow Matching

Authors: Minu Kim, Yongsik Lee, Sehyeok Kang, Jihwan Oh, Song Chong, Seyoung Yun

Abstract: We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs lik… ▽ More We present Preference Flow Matching (PFM), a new framework for preference-based reinforcement learning (PbRL) that streamlines the integration of preferences into an arbitrary class of pre-trained models. Existing PbRL methods require fine-tuning pre-trained models, which presents challenges such as scalability, inefficiency, and the need for model modifications, especially with black-box APIs like GPT-4. In contrast, PFM utilizes flow matching techniques to directly learn from preference data, thereby reducing the dependency on extensive fine-tuning of pre-trained models. By leveraging flow-based models, PFM transforms less preferred data into preferred outcomes, and effectively aligns model outputs with human preferences without relying on explicit or implicit reward function estimation, thus avoiding common issues like overfitting in reward models. We provide theoretical insights that support our method's alignment with standard PbRL objectives. Experimental results indicate the practical effectiveness of our method, offering a new direction in aligning a pre-trained model to preference. △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.18027 [pdf, other]

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models

Authors: Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdoo Yun, Hwaran Lee, Gunhee Kim

Abstract: While Large Language Models (LLMs) can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurat… ▽ More While Large Language Models (LLMs) can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters' identities and historical timelines. We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g., GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: ACL 2024 Findings. Code and dataset are released at https://ahnjaewoo.github.io/timechara

arXiv:2405.17995 [pdf, other]

DMT-JEPA: Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Authors: Shentong Mo, Sukmin Yun

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can… ▽ More The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA demonstrates strong discriminative power, offering benefits across a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks. Code is available at: \url{https://github.com/DMTJEPA/DMTJEPA}. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2405.16907 [pdf, other]

GTA: Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning

Authors: Jaewoo Lee, Sujin Yun, Taeyoung Yun, Jinkyoo Park

Abstract: Offline Reinforcement Learning (Offline RL) presents challenges of learning effective decision-making policies from static datasets without any online interactions. Data augmentation techniques, such as noise injection and data synthesizing, aim to improve Q-function approximation by smoothing the learned state-action region. However, these methods often fall short of directly improving the qualit… ▽ More Offline Reinforcement Learning (Offline RL) presents challenges of learning effective decision-making policies from static datasets without any online interactions. Data augmentation techniques, such as noise injection and data synthesizing, aim to improve Q-function approximation by smoothing the learned state-action region. However, these methods often fall short of directly improving the quality of offline datasets, leading to suboptimal results. In response, we introduce \textbf{GTA}, Generative Trajectory Augmentation, a novel generative data augmentation approach designed to enrich offline data by augmenting trajectories to be both high-rewarding and dynamically plausible. GTA applies a diffusion model within the data augmentation framework. GTA partially noises original trajectories and then denoises them with classifier-free guidance via conditioning on amplified return value. Our results show that GTA, as a general data augmentation strategy, enhances the performance of widely used offline RL algorithms in both dense and sparse reward settings. Furthermore, we conduct a quality analysis of data augmented by GTA and demonstrate that GTA improves the quality of the data. Our code is available at https://github.com/Jaewoopudding/GTA △ Less

Submitted 12 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted (Spotlight) to ICLR 2024 Workshop on Generative Models for Decision Making. Jaewoo Lee and Sujin Yun are equal contribution authors

arXiv:2405.13396 [pdf, other]

Why In-Context Learning Transformers are Tabular Data Classifiers

Authors: Felix den Breejen, Sangmin Bae, Stephen Cha, Se-Young Yun

Abstract: The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. As synthetic data does not share features or labels with real-world data, the underlying mechanism that contributes to the success of this method remains unclear. This study provides an explanation by demonstrating that ICL-transformers acquire the ability to… ▽ More The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. As synthetic data does not share features or labels with real-world data, the underlying mechanism that contributes to the success of this method remains unclear. This study provides an explanation by demonstrating that ICL-transformers acquire the ability to create complex decision boundaries during pretraining. To validate our claim, we develop a novel forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. Our experiments confirm the effectiveness of ICL-transformers pretrained on this data. Furthermore, we create TabForestPFN, the ICL-transformer pretrained on both the original TabPFN synthetic dataset generator and our forest dataset generator. By fine-tuning this model, we reach the current state-of-the-art on tabular data classification. Code is available at https://github.com/FelixdenBreejen/TabForestPFN. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 9 pages main body, 22 pages total. Preprint under review

arXiv:2405.07986 [pdf, other]

JWST's PEARLS: resolved study of the stellar and dust components in starburst galaxies at cosmic noon

Authors: M. Polletta, B. L. Frye, N. Garuda, S. P. Willner, S. Berta, R. Kneissl, H. Dole, R. A. Jansen, M. D. Lehnert, S. H. Cohen, J. Summers, R. A. Windhorst, J. C. J. D'Silva, A. M. Koekemoer, D. Coe, C. J. Conselice, S. P. Driver, N. A. Grogin, M. A. Marshall, M. Nonino, R. Ortiz III, N. Pirzkal, A. Robotham, R. E. Ryan, Jr., C. N. A. Willmer , et al. (13 additional authors not shown)

Abstract: Dusty star-forming galaxies (DSFGs) contribute significantly to the stellar buildup at cosmic noon. Major mergers and gas accretion are often invoked to explain DSFGs' prodigious star-formation rates (SFRs) and large stellar masses. We conducted a spatially-resolved morphological analysis of the rest-frame UV/NIR emission in three DSFGs at z~2.5. Initially discovered as CO emitters by NOEMA observ… ▽ More Dusty star-forming galaxies (DSFGs) contribute significantly to the stellar buildup at cosmic noon. Major mergers and gas accretion are often invoked to explain DSFGs' prodigious star-formation rates (SFRs) and large stellar masses. We conducted a spatially-resolved morphological analysis of the rest-frame UV/NIR emission in three DSFGs at z~2.5. Initially discovered as CO emitters by NOEMA observations of a bright Herschel source, we observed them with the JWST/NIRCam as part of the PEARLS program. The NIRCam data reveal the galaxies' stellar population and dust distribution on scales of 250 pc. Spatial variations in stellar mass, SFR, and dust extinction are determined in resolved maps obtained through pixel-based SED fitting. The CO emitters are massive, dusty starburst galaxies with SFRs ranging from 340 to 2500 Msun/yr, positioning them among the most active SFGs at 2<z<3. Notably, they belong to the ~1.5% of the entire JWST population with extremely red colors. Their morphologies are disk-like, with effective radii of 2.0-4.4 kpc, and exhibit sub-structures such as clumps and spiral arms. The galaxies have dust extinctions up to Av=5-7 mag with asymmetric distributions extending over several kpc and including off-center regions resembling bent spiral arms and clumps. The NIR dust-attenuation curve in these sources deviates from standard laws, implying different dust grain properties than commonly assumed in starburst galaxies. The proximity of galaxies with consistent redshifts, strong color gradients, overall disturbed appearance, asymmetric dust obscuration, and wide-spread star formation favor interactions (minor mergers and flybys) as the mechanism driving the CO galaxies' exceptional SFRs. Their large masses and rich environment hint at membership in two proto-structures, as initially inferred from their association with a Planck-selected high-z source. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 24 pages, 21 figures + appendix. Submitted to A&A. Comments welcome!

arXiv:2405.07857 [pdf, other]

Synergistic Integration of Coordinate Network and Tensorial Feature for Improving Neural Radiance Fields from Sparse Inputs

Authors: Mingyu Kim, Jun-Seong Kim, Se-Young Yun, Jin-Hwa Kim

Abstract: The multi-plane representation has been highlighted for its fast training and inference across static and dynamic neural radiance fields. This approach constructs relevant features via projection onto learnable grids and interpolating adjacent vertices. However, it has limitations in capturing low-frequency details and tends to overuse parameters for low-frequency features due to its bias toward f… ▽ More The multi-plane representation has been highlighted for its fast training and inference across static and dynamic neural radiance fields. This approach constructs relevant features via projection onto learnable grids and interpolating adjacent vertices. However, it has limitations in capturing low-frequency details and tends to overuse parameters for low-frequency features due to its bias toward fine details, despite its multi-resolution concept. This phenomenon leads to instability and inefficiency when training poses are sparse. In this work, we propose a method that synergistically integrates multi-plane representation with a coordinate-based MLP network known for strong bias toward low-frequency signals. The coordinate-based network is responsible for capturing low-frequency details, while the multi-plane representation focuses on capturing fine-grained details. We demonstrate that using residual connections between them seamlessly preserves their own inherent properties. Additionally, the proposed progressive training scheme accelerates the disentanglement of these two features. We demonstrate empirically that our proposed method not only outperforms baseline models for both static and dynamic NeRFs with sparse inputs, but also achieves comparable results with fewer parameters. △ Less

Submitted 5 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: ICML2024 ; Project page is accessible at https://mingyukim87.github.io/SynergyNeRF ; Code is available at https://github.com/MingyuKim87/SynergyNeRF

arXiv:2405.04819 [pdf, other]

DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions with Scientific Literature

Authors: Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sukwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, Huan Liu, Li Shen, Tianlong Chen

Abstract: Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on… ▽ More Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on studying Alzheimer's Disease (AD), a specialized sub-field in biomedicine and a global health priority. With a synergized framework of LLM and KG mutually enhancing each other, we first leverage LLM to construct an evolving AD-specific knowledge graph (KG) sourced from AD-related scientific literature, and then we utilize a coarse-to-fine sampling method with a novel self-aware knowledge retrieval approach to select appropriate knowledge from the KG to augment LLM inference capabilities. The experimental results, conducted on our constructed AD question answering (ADQA) benchmark, underscore the efficacy of DALK. Additionally, we perform a series of detailed analyses that can offer valuable insights and guidelines for the emerging topic of mutually enhancing KG and LLM. We will release the code and data at https://github.com/David-Li0406/DALK. △ Less

Submitted 12 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: Under Review; Incorrect author name revised

arXiv:2405.04497 [pdf, other]

Unveiling Disparities in Web Task Handling Between Human and Web Agent

Authors: Kihoon Son, Jinhyeon Kwon, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun, Juho Kim

Abstract: With the advancement of Large-Language Models (LLMs) and Large Vision-Language Models (LVMs), agents have shown significant capabilities in various tasks, such as data analysis, gaming, or code generation. Recently, there has been a surge in research on web agents, capable of performing tasks within the web environment. However, the web poses unforeseeable scenarios, challenging the generalizabili… ▽ More With the advancement of Large-Language Models (LLMs) and Large Vision-Language Models (LVMs), agents have shown significant capabilities in various tasks, such as data analysis, gaming, or code generation. Recently, there has been a surge in research on web agents, capable of performing tasks within the web environment. However, the web poses unforeseeable scenarios, challenging the generalizability of these agents. This study investigates the disparities between human and web agents' performance in web tasks (e.g., information search) by concentrating on planning, action, and reflection aspects during task execution. We conducted a web task study with a think-aloud protocol, revealing distinct cognitive actions and operations on websites employed by humans. Comparative examination of existing agent structures and human behavior with thought processes highlighted differences in knowledge updating and ambiguity handling when performing the task. Humans demonstrated a propensity for exploring and modifying plans based on additional information and investigating reasons for failure. These findings offer insights into designing planning, reflection, and information discovery modules for web agents and designing the capturing method for implicit human knowledge in a web task. △ Less

Submitted 8 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.01686 [pdf, other]

Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Authors: Hye Sun Yun, David Pogrebitskiy, Iain J. Marshall, Byron C. Wallace

Abstract: Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individu… ▽ More Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs -- including ones trained on biomedical texts -- perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 24 pages, 7 figures, 6 tables

arXiv:2405.01588 [pdf, other]

Towards Unbiased Evaluation of Detecting Unanswerable Questions in EHRSQL

Authors: Yongjin Yang, Sihyeon Kim, SangMook Kim, Gyubok Lee, Se-Young Yun, Edward Choi

Abstract: Incorporating unanswerable questions into EHR QA systems is crucial for testing the trustworthiness of a system, as providing non-existent responses can mislead doctors in their diagnoses. The EHRSQL dataset stands out as a promising benchmark because it is the only dataset that incorporates unanswerable questions in the EHR QA system alongside practical questions. However, in this work, we identi… ▽ More Incorporating unanswerable questions into EHR QA systems is crucial for testing the trustworthiness of a system, as providing non-existent responses can mislead doctors in their diagnoses. The EHRSQL dataset stands out as a promising benchmark because it is the only dataset that incorporates unanswerable questions in the EHR QA system alongside practical questions. However, in this work, we identify a data bias in these unanswerable questions; they can often be discerned simply by filtering with specific N-gram patterns. Such biases jeopardize the authenticity and reliability of QA system evaluations. To tackle this problem, we propose a simple debiasing method of adjusting the split between the validation and test sets to neutralize the undue influence of N-gram filtering. By experimenting on the MIMIC-III dataset, we demonstrate both the existing data bias in EHRSQL and the effectiveness of our data split strategy in mitigating this bias. △ Less

Submitted 28 April, 2024; originally announced May 2024.

Comments: DPFM Workshop, ICLR 2024

arXiv:2404.17507 [pdf, other]

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

Authors: Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, Sangdoo Yun

Abstract: In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our appr… ▽ More In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity $ε_{i}$ can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score. △ Less

Submitted 16 July, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

Comments: ECCV 2024; 33pages, 4.5MB

arXiv:2404.14202 [pdf, other]

An Adaptive Approach for Infinitely Many-armed Bandits under Generalized Rotting Constraints

Authors: Jung-hun Kim, Milan Vojnovic, Se-Young Yun

Abstract: In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative… ▽ More In this study, we consider the infinitely many-armed bandit problems in a rested rotting setting, where the mean reward of an arm may decrease with each pull, while otherwise, it remains unchanged. We explore two scenarios regarding the rotting of rewards: one in which the cumulative amount of rotting is bounded by $V_T$, referred to as the slow-rotting case, and the other in which the cumulative number of rotting instances is bounded by $S_T$, referred to as the abrupt-rotting case. To address the challenge posed by rotting rewards, we introduce an algorithm that utilizes UCB with an adaptive sliding window, designed to manage the bias and variance trade-off arising due to rotting rewards. Our proposed algorithm achieves tight regret bounds for both slow and abrupt rotting scenarios. Lastly, we demonstrate the performance of our algorithm using numerical experiments. △ Less

Submitted 24 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.13949 [pdf, other]

PeLiCal: Targetless Extrinsic Calibration via Penetrating Lines for RGB-D Cameras with Limited Co-visibility

Authors: Jaeho Shin, Seungsang Yun, Ayoung Kim

Abstract: RGB-D cameras are crucial in robotic perception, given their ability to produce images augmented with depth data. However, their limited FOV often requires multiple cameras to cover a broader area. In multi-camera RGB-D setups, the goal is typically to reduce camera overlap, optimizing spatial coverage with as few cameras as possible. The extrinsic calibration of these systems introduces additiona… ▽ More RGB-D cameras are crucial in robotic perception, given their ability to produce images augmented with depth data. However, their limited FOV often requires multiple cameras to cover a broader area. In multi-camera RGB-D setups, the goal is typically to reduce camera overlap, optimizing spatial coverage with as few cameras as possible. The extrinsic calibration of these systems introduces additional complexities. Existing methods for extrinsic calibration either necessitate specific tools or highly depend on the accuracy of camera motion estimation. To address these issues, we present PeLiCal, a novel line-based calibration approach for RGB-D camera systems exhibiting limited overlap. Our method leverages long line features from surroundings, and filters out outliers with a novel convergence voting algorithm, achieving targetless, real-time, and outlier-robust performance compared to existing methods. We open source our implementation on https://github.com/joomeok/PeLiCal.git. △ Less

Submitted 23 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.11848 [pdf, other]

Partial Large Kernel CNNs for Efficient Super-Resolution

Authors: Dongheon Lee, Seokju Yun, Youngmin Ro

Abstract: Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Tran… ▽ More Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Transformers into CNNs, we aim to achieve both computational efficiency and enhanced performance. However, using a large kernel in the SR domain, which mainly processes large images, incurs a large computational overhead. To overcome this, we propose novel approaches to employing the large kernel, which can reduce latency by 86\% compared to the naive large kernel, and leverage an Element-wise Attention module to imitate instance-dependent weights. As a result, we introduce Partial Large Kernel CNNs for Efficient Super-Resolution (PLKSR), which achieves state-of-the-art performance on four datasets at a scale of $\times$4, with reductions of 68.1\% in latency and 80.2\% in maximum GPU memory occupancy compared to SRFormer-light. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.11025 [pdf, other]

NeuroHash: A Hyperdimensional Neuro-Symbolic Framework for Spatially-Aware Image Hashing and Retrieval

Authors: Sanggeon Yun, Ryozo Masukawa, SungHeon Jeong, Mohsen Imani

Abstract: Customizable image retrieval from large datasets remains a critical challenge, particularly when preserving spatial relationships within images. Traditional hashing methods, primarily based on deep learning, often fail to capture spatial information adequately and lack transparency. In this paper, we introduce NeuroHash, a novel neuro-symbolic framework leveraging Hyperdimensional Computing (HDC)… ▽ More Customizable image retrieval from large datasets remains a critical challenge, particularly when preserving spatial relationships within images. Traditional hashing methods, primarily based on deep learning, often fail to capture spatial information adequately and lack transparency. In this paper, we introduce NeuroHash, a novel neuro-symbolic framework leveraging Hyperdimensional Computing (HDC) to enable highly customizable, spatially-aware image retrieval. NeuroHash combines pre-trained deep neural network models with HDC-based symbolic models, allowing for flexible manipulation of hash values to support conditional image retrieval. Our method includes a self-supervised context-aware HDC encoder and novel loss terms for optimizing lower-dimensional bipolar hashing using multilinear hyperplanes. We evaluate NeuroHash on two benchmark datasets, demonstrating superior performance compared to state-of-the-art hashing methods, as measured by mAP@5K scores and our newly introduced metric, mAP@5Kr, which assesses spatial alignment. The results highlight NeuroHash's ability to achieve competitive performance while offering significant advantages in flexibility and customization, paving the way for more advanced and versatile image retrieval systems. △ Less

Submitted 22 May, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

arXiv:2404.10709 [pdf, other]

doi 10.17909/jtd6-af15

PEARLS: Discovery of Point-Source Features Within Galaxies in the North Ecliptic Pole Time Domain Field

Authors: Rafael Ortiz III, Rogier A. Windhorst, Seth H. Cohen, S. P. Willner, Rolf A. Jansen, Timothy Carleton, Patrick S. Kamieneski, Michael J. Rutkowski, Brent Smith, Jake Summers, Tyler J. McCabe, Rosalia O'Brien, Jose M. Diego, Min S. Yun, Jordan C. J. D'Silva, Juno Li, Hansung B. Gim, Nimish P. Hathi, Benne W. Holwerda, Adi Zitrin, Cheng Cheng, Noah J. McLeod, Christopher J. Conselice, Simon P. Driver, Haojing Yan , et al. (9 additional authors not shown)

Abstract: $ $The first public 0.9-4.4 $μ$m NIRCam images of the North Ecliptic Pole (NEP) Time Domain Field (TDF) uncovered many galaxies that display point-source features in their cores as seen in the longer wavelength filters. We visually identified a sample of 66 galaxies ($\sim$1 galaxy per arcmin$^2… ▽ More $ $The first public 0.9-4.4 $μ$m NIRCam images of the North Ecliptic Pole (NEP) Time Domain Field (TDF) uncovered many galaxies that display point-source features in their cores as seen in the longer wavelength filters. We visually identified a sample of 66 galaxies ($\sim$1 galaxy per arcmin$^2$) with point-like cores and fit their spectral energy distributions (SED)s using EAZY and CIGALE to characterize the sample's active galactic nucleus (AGN) and host galaxy parameters. Single-template fitting best fits $70\%$ of the sample with a Seyfert-blended SED. With CIGALE we compute the median fractional AGN contribution to the 0.1-30.0 $μ$m flux to be $0.30\pm0.06$, and that $56\%$ of the 66 galaxies have star-formation rates in the starburst range whereas the remainder are near the star-formation main sequence. There are Very Large Array (VLA) 3 GHz detections for 24/66 galaxies, implying some combination of AGN emission and vigorous star formation. We present a novel sample selection procedure in tandem to the morphological sample selection based on objects' light profiles at 4.4 $μ$m. This procedure identifies a parameter space which probes point-source features and automatically recovers our visual sample with minimal contamination from both stars and brighter galaxies. This procedure may be used in other extant and future NIRCam images to streamline the search for galaxies with point-like cores. The morphological approach to recognizing AGN is being resurrected by the James Webb Space Telescope (JWST) by virtue of its superb angular resolution at infrared wavelengths. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: 11 pages, 6 figures, 1 table

arXiv:2404.10308 [pdf, other]

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs

Authors: Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin

Abstract: Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address… ▽ More Large language models (LLMs) have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at https://github.com/alinlab/HOMER. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: Accepted to ICLR 2024. The first two authors contributed equally

arXiv:2404.09207 [pdf, other]

DEGNN: Dual Experts Graph Neural Network Handling Both Edge and Node Feature Noise

Authors: Tai Hasegawa, Sukwon Yun, Xin Liu, Yin Jun Phua, Tsuyoshi Murata

Abstract: Graph Neural Networks (GNNs) have achieved notable success in various applications over graph data. However, recent research has revealed that real-world graphs often contain noise, and GNNs are susceptible to noise in the graph. To address this issue, several Graph Structure Learning (GSL) models have been introduced. While GSL models are tailored to enhance robustness against edge noise through… ▽ More Graph Neural Networks (GNNs) have achieved notable success in various applications over graph data. However, recent research has revealed that real-world graphs often contain noise, and GNNs are susceptible to noise in the graph. To address this issue, several Graph Structure Learning (GSL) models have been introduced. While GSL models are tailored to enhance robustness against edge noise through edge reconstruction, a significant limitation surfaces: their high reliance on node features. This inherent dependence amplifies their susceptibility to noise within node features. Recognizing this vulnerability, we present DEGNN, a novel GNN model designed to adeptly mitigate noise in both edges and node features. The core idea of DEGNN is to design two separate experts: an edge expert and a node feature expert. These experts utilize self-supervised learning techniques to produce modified edges and node features. Leveraging these modified representations, DEGNN subsequently addresses downstream tasks, ensuring robustness against noise present in both edges and node features of real-world graphs. Notably, the modification process can be trained end-to-end, empowering DEGNN to adjust dynamically and achieves optimal edge and node representations for specific tasks. Comprehensive experiments demonstrate DEGNN's efficacy in managing noise, both in original real-world graphs and in graphs with synthetic noise. △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: PAKDD 2024, the code is available at https://github.com/TaiHasegawa/DEGNN

arXiv:2404.08058 [pdf, other]

Birds of a Feather: Resolving Stellar Mass Assembly With JWST/NIRCam in a Pair of Kindred $z \sim 2$ Dusty Star-forming Galaxies Lensed by the PLCK G165.7+67.0 Cluster

Authors: Patrick S. Kamieneski, Brenda L. Frye, Rogier A. Windhorst, Kevin C. Harrington, Min S. Yun, Allison Noble, Massimo Pascale, Nicholas Foo, Seth H. Cohen, Rolf A. Jansen, Timothy Carleton, Anton M. Koekemoer, Christopher N. A. Willmer, Jake S. Summers, Nikhil Garuda, Reagen Leimbach, Benne W. Holwerda, Justin D. R. Pierel, Eric F. Jimenez-Andrade, S. P. Willner, Belen Alcalde Pampliega, Amit Vishwas, William C. Keel, Q. Daniel Wang, Cheng Cheng , et al. (16 additional authors not shown)

Abstract: We present a new parametric lens model for the G165.7+67.0 galaxy cluster, which was discovered with $Planck$ through its bright submillimeter flux, originating from a pair of extraordinary dusty star-forming galaxies (DSFGs) at $z\approx 2.2$. Using JWST and interferometric mm/radio observations, we characterize the intrinsic physical properties of the DSFGs, which are separated by only… ▽ More We present a new parametric lens model for the G165.7+67.0 galaxy cluster, which was discovered with $Planck$ through its bright submillimeter flux, originating from a pair of extraordinary dusty star-forming galaxies (DSFGs) at $z\approx 2.2$. Using JWST and interferometric mm/radio observations, we characterize the intrinsic physical properties of the DSFGs, which are separated by only $\sim 1^{\prime\prime}$ (8 kpc) and a velocity difference $ΔV \lesssim 600~{\rm km}~{\rm s}^{-1}$ in the source plane, and thus likely undergoing a major merger. Boasting intrinsic star formation rates ${\rm SFR}_{\rm IR} = 320 \pm 70$ and $400 \pm 80~ M_\odot~{\rm yr}^{-1}$, stellar masses ${\rm log}[M_\star/M_\odot] = 10.2 \pm 0.1$ and $10.3 \pm 0.1$, and dust attenuations $A_V = 1.5 \pm 0.3$ and $1.2 \pm 0.3$, they are remarkably similar objects. We perform spatially-resolved pixel-by-pixel SED fitting using rest-frame near-UV to near-IR imaging from JWST/NIRCam for both galaxies, resolving some stellar structures down to 100 pc scales. Based on their resolved specific SFRs and $UVJ$ colors, both DSFGs are experiencing significant galaxy-scale star formation events. If they are indeed interacting gravitationally, this strong starburst could be the hallmark of gas that has been disrupted by an initial close passage. In contrast, the host galaxy of the recently discovered triply-imaged SN H0pe has a much lower SFR than the DSFGs, and we present evidence for the onset of inside-out quenching and large column densities of dust even in regions of low specific SFR. Based on the intrinsic SFRs of the DSFGs inferred from UV through FIR SED modeling, this pair of objects alone is predicted to yield an observable $1.1 \pm 0.2~{\rm CCSNe~yr}^{-1}$, making this cluster field ripe for continued monitoring. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 47 pages, 21 figures, 5 tables. Submitted to ApJ, comments welcome!

arXiv:2403.19997 [pdf, other]

Size-dependent fracture in elastomers: experiments and continuum modeling

Authors: Jaehee Lee, Jeongun Lee, Seounghee Yun, Sanha Kim, Shawn A. Chester, Hansohl Cho

Abstract: Elastomeric materials display a complicated set of stretchability and fracture properties that strongly depend on the flaw size, which has long been of interest to engineers and materials scientists. Here, we combine experiments and numerical simulations for a comprehensive understanding of the nonlocal, size-dependent features of fracture in elastomers. We show the size-dependent fracture behavio… ▽ More Elastomeric materials display a complicated set of stretchability and fracture properties that strongly depend on the flaw size, which has long been of interest to engineers and materials scientists. Here, we combine experiments and numerical simulations for a comprehensive understanding of the nonlocal, size-dependent features of fracture in elastomers. We show the size-dependent fracture behavior is quantitatively described through a nonlocal continuum model. The key ingredient of the nonlocal model is the use of an intrinsic length scale associated with a finite fracture process zone, which is inferred from experiments. Of particular importance, our experimental and theoretical approach passes the critical set of capturing key aspects of the size-dependent fracture in elastomers. Applications to a wide range of synthetic elastomers that exhibit moderate (~100%) to extreme stretchability (~1000%) are presented, which is also used to demonstrate the applicability of our approach in elastomeric specimens with complex geometries. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2403.19522 [pdf, other]

Model Stock: All we need is just a few fine-tuned models

Authors: Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han

Abstract: This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. Breaking away from traditional practices that need a multitude of fine-tuned models for averaging, our approach employs significantly fewer models to achieve final weights yet yield superior accuracy. Drawing from key insights in the we… ▽ More This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. Breaking away from traditional practices that need a multitude of fine-tuned models for averaging, our approach employs significantly fewer models to achieve final weights yet yield superior accuracy. Drawing from key insights in the weight space of fine-tuned weights, we uncover a strong link between the performance and proximity to the center of weight space. Based on this, we introduce a method that approximates a center-close weight using only two fine-tuned models, applicable during or after training. Our innovative layer-wise weight averaging technique surpasses state-of-the-art model methods such as Model Soup, utilizing only two fine-tuned models. This strategy can be aptly coined Model Stock, highlighting its reliance on selecting a minimal number of models to draw a more optimized-averaged model. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures, achieving remarkable performance on both ID and OOD tasks on the standard benchmarks, all while barely bringing extra computational demands. Our code and pre-trained models are available at https://github.com/naver-ai/model-stock. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Code at https://github.com/naver-ai/model-stock

arXiv:2403.18260 [pdf, other]

Toward Interactive Regional Understanding in Vision-Large Language Models

Authors: Jungbeom Lee, Sanghyuk Chun, Sangdoo Yun

Abstract: Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to un… ▽ More Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: NAACL 2024 Main Conference

arXiv:2403.14027 [pdf, other]

EcoSense: Energy-Efficient Intelligent Sensing for In-Shore Ship Detection through Edge-Cloud Collaboration

Authors: Wenjun Huang, Hanning Chen, Yang Ni, Arghavan Rezvani, Sanggeon Yun, Sungheon Jeon, Eric Pedley, Mohsen Imani

Abstract: Detecting marine objects inshore presents challenges owing to algorithmic intricacies and complexities in system deployment. We propose a difficulty-aware edge-cloud collaborative sensing system that splits the task into object localization and fine-grained classification. Objects are classified either at the edge or within the cloud, based on their estimated difficulty. The framework comprises a… ▽ More Detecting marine objects inshore presents challenges owing to algorithmic intricacies and complexities in system deployment. We propose a difficulty-aware edge-cloud collaborative sensing system that splits the task into object localization and fine-grained classification. Objects are classified either at the edge or within the cloud, based on their estimated difficulty. The framework comprises a low-power device-tailored front-end model for object localization, classification, and difficulty estimation, along with a transformer-graph convolutional network-based back-end model for fine-grained classification. Our system demonstrates superior performance (mAP@0.5 +4.3%}) on widely used marine object detection datasets, significantly reducing both data transmission volume (by 95.43%) and energy consumption (by 72.7%}) at the system level. We validate the proposed system across various embedded system platforms and in real-world scenarios involving drone deployment. △ Less

Submitted 26 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2403.13298 [pdf, other]

Rotary Position Embedding for Vision Transformer

Authors: Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun

Abstract: Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to… ▽ More Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit △ Less

Submitted 16 July, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: Accepted to ECCV 2024

arXiv:2403.08108 [pdf, other]

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Authors: Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Fei Wen, Hugo Latapie, Mohsen Imani

Abstract: Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models u… ▽ More Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts. Nevertheless, the naive application of VLMs leads to sub-optimal quality, due to the misalignment between embeddings of object images and their visual attributes, which are mainly adjective phrases. To this end, we design a transformer-based aligner after the pre-trained VLMs to re-calibrate both embeddings. Finally, we employ a trainable score function to post-process the VLM matching results for object selection. Experimental results demonstrate that our TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% and only requires a single NVIDIA RTX 4090 for both training and inference. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2403.07145

Electrically Programmable Pixelated Graphene-Integrated Plasmonic Metasurfaces for Coherent Mid-Infrared Emission

Authors: Xiu Liu, Yibai Zhong, Zexiao Wang, Tianyi Huang, Sen Lin, Jingyi Zou, Haozhe Wang, Zhien Wang, Zhuo Li, Xiao Luo, Rui Cheng, Jiayu Li, Hyeong Seok Yun, Han Wang, Jing Kong, Xu Zhang, Sheng Shen

Abstract: Active metasurfaces have recently emerged as compact, lightweight, and efficient platforms for dynamic control of electromagnetic fields and optical responses. However, the complexities associated with their post-fabrication tunability significantly hinder their widespread applications, especially for the mid-infrared range due to material scarcity and design intricacy. Here, we experimentally dem… ▽ More Active metasurfaces have recently emerged as compact, lightweight, and efficient platforms for dynamic control of electromagnetic fields and optical responses. However, the complexities associated with their post-fabrication tunability significantly hinder their widespread applications, especially for the mid-infrared range due to material scarcity and design intricacy. Here, we experimentally demonstrate highly dynamic, pixelated modulations of coherent mid-infrared emission based on an electrically programmable plasmonic metasurface integrated with graphene field effect transistors (Gr-FETs). The ultrabroad infrared transparency of graphene allows for free-form control over plasmonic meta-atoms, thus achieving coherent mid-infrared states across a broad range of wavelengths and polarizations. The spatial temperature modulation generated by Gr-FETs is effectively synergized with the emissivity control by the localized surface plasmon polaritons from gold nanoantennas. This integrated temperature-emissivity modulation of metasurfaces is systematically extended to form a pixelated 2D array, envisioning new approaches toward scalable 2D electrical wiring for densely packed, independently controlled pixels. △ Less

Submitted 6 May, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: Needs more updates for the experiments

arXiv:2403.06342 [pdf, other]

Separable Physics-informed Neural Networks for Solving the BGK Model of the Boltzmann Equation

Authors: Jaemin Oh, Seung Yeon Cho, Seok-Bae Yun, Eunbyung Park, Youngjoon Hong

Abstract: In this study, we introduce a method based on Separable Physics-Informed Neural Networks (SPINNs) for effectively solving the BGK model of the Boltzmann equation. While the mesh-free nature of PINNs offers significant advantages in handling high-dimensional partial differential equations (PDEs), challenges arise when applying quadrature rules for accurate integral evaluation in the BGK operator, w… ▽ More In this study, we introduce a method based on Separable Physics-Informed Neural Networks (SPINNs) for effectively solving the BGK model of the Boltzmann equation. While the mesh-free nature of PINNs offers significant advantages in handling high-dimensional partial differential equations (PDEs), challenges arise when applying quadrature rules for accurate integral evaluation in the BGK operator, which can compromise the mesh-free benefit and increase computational costs. To address this, we leverage the canonical polyadic decomposition structure of SPINNs and the linear nature of moment calculation, achieving a substantial reduction in computational expense for quadrature rule application. The multi-scale nature of the particle density function poses difficulties in precisely approximating macroscopic moments using neural networks. To improve SPINN training, we introduce the integration of Gaussian functions into SPINNs, coupled with a relative loss approach. This modification enables SPINNs to decay as rapidly as Maxwellian distributions, thereby enhancing the accuracy of macroscopic moment approximations. The relative loss design further ensures that both large and small-scale features are effectively captured by the SPINNs. The efficacy of our approach is demonstrated through a series of five numerical experiments, including the solution to a challenging 3D Riemann problem. These results highlight the potential of our novel method in efficiently and accurately addressing complex challenges in computational physics. △ Less

Submitted 10 March, 2024; originally announced March 2024.

MSC Class: 68T20; 35R09

arXiv:2403.05973 [pdf, other]

Calibrating Large Language Models Using Their Generations Only

Authors: Dennis Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

Abstract: As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary pre… ▽ More As large language models (LLMs) are increasingly deployed in user-facing applications, building trust and maintaining safety by accurately quantifying a model's confidence in its prediction becomes even more important. However, finding effective ways to calibrate LLMs - especially when the only interface to the models is their generated text - remains a challenge. We propose APRICOT (auxiliary prediction of confidence targets): A method to set confidence targets and train an additional model that predicts an LLM's confidence based on its textual input and output alone. This approach has several advantages: It is conceptually simple, does not require access to the target model beyond its output, does not interfere with the language generation, and has a multitude of potential usages, for instance by verbalizing the predicted confidence or adjusting the given answer based on the confidence. We show how our approach performs competitively in terms of calibration error for white-box and black-box LLMs on closed-book question-answering to detect incorrect LLM answers. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.05763 [pdf, other]

HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning

Authors: Hanning Chen, Yang Ni, Ali Zakeri, Zhuowen Zou, Sanggeon Yun, Fei Wen, Behnam Khaleghi, Narayan Srinivasa, Hugo Latapie, Mohsen Imani

Abstract: In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural netwo… ▽ More In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network (GCN) involve extensive vertex/relation embedding updates and complicated score functions, which are inherently cumbersome for acceleration. As a result, existing accelerator designs are no longer optimal, and a novel algorithm-hardware co-design for KG reasoning is needed. Recently, brain-inspired HyperDimensional Computing (HDC) has been introduced as a promising solution for lightweight machine learning, particularly for graph learning applications. In this paper, we leverage HDC for an intrinsically more efficient and acceleration-friendly KGC algorithm. We also co-design an acceleration framework named HDReason targeting FPGA platforms. On the algorithm level, HDReason achieves a balance between high reasoning accuracy, strong model interpretability, and less computation complexity. In terms of architecture, HDReason offers reconfigurability, high training throughput, and low energy consumption. When compared with NVIDIA RTX 4090 GPU, the proposed accelerator achieves an average 10.6x speedup and 65x energy efficiency improvement. When conducting cross-models and cross-platforms comparison, HDReason yields an average 4.2x higher performance and 3.4x better energy efficiency with similar accuracy versus the state-of-the-art FPGA-based GCN training platform. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2402.19321 [pdf, ps, other]

Connections between Planetary Populations and the Chemical Characteristics of their Host Stars

Authors: Sol Yun, Young Sun Lee, Young Kwang Kim, Timothy C. Beers, Togay Berfin, Dongwook Lim

Abstract: Chemical anomalies in planet-hosting stars (PHSs) are studied in order to assess how the planetary nature and multiplicity affect the atmospheric chemical abundances of their host stars. We employ APOGEE DR17 to select thin-disk stars of the Milky Way, and cross-match them with the Kepler Input Catalog to identify confirmed PHSs, which results in 227 PHSs with available chemical-abundance ratios f… ▽ More Chemical anomalies in planet-hosting stars (PHSs) are studied in order to assess how the planetary nature and multiplicity affect the atmospheric chemical abundances of their host stars. We employ APOGEE DR17 to select thin-disk stars of the Milky Way, and cross-match them with the Kepler Input Catalog to identify confirmed PHSs, which results in 227 PHSs with available chemical-abundance ratios for six refractory elements. We also examine an ensemble of stars without planet signals, which are equivalent to the selected PHSs in terms of evolutionary stage and stellar parameters, to correct for Galactic chemical-evolution effects, and derive the abundance gradient of refractory elements over the condensation temperature for the PHSs. Using the Galactic chemical-evolution corrected abundances, we found that PHSs do not show a significant difference in abundance slope from the stars without planets. Furthermore, we examine the depletion trends of refractory elements of PHSs depending on total number of planets and their types, and find that the PHSs with giant planets are more depleted in refractory elements than those with rocky planets. Among the PHSs with rocky planets, the refractory-depletion trends are potentially correlated with the terrestrial planets' radii and multiplicity. In the cases of PHSs with giant planets, sub-Jovian PHSs demonstrated more depleted refractory trends than stars hosting Jovian-mass planets, raising questions on different planetary-formation processes for Neptune-like and Jupiter-like planets. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 14 pages, 6 figures

arXiv:2402.12991 [pdf, other]

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Authors: Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

Abstract: Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a… ▽ More Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function. △ Less

Submitted 6 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

Comments: Accepted at ACL 2024 (findings)

arXiv:2402.08594 [pdf, other]

doi 10.18653/v1/2023.findings-emnlp.329

Bayesian Multi-Task Transfer Learning for Soft Prompt Tuning

Authors: Haeju Lee, Minchan Jeong, Se-Young Yun, Kee-Eung Kim

Abstract: Prompt tuning, in which prompts are optimized to adapt large-scale pre-trained language models to downstream tasks instead of fine-tuning the full model parameters, has been shown to be particularly effective when the prompts are trained in a multi-task transfer learning setting. These methods generally involve individually training prompts for each source task and then aggregating them to provide… ▽ More Prompt tuning, in which prompts are optimized to adapt large-scale pre-trained language models to downstream tasks instead of fine-tuning the full model parameters, has been shown to be particularly effective when the prompts are trained in a multi-task transfer learning setting. These methods generally involve individually training prompts for each source task and then aggregating them to provide the initialization of the prompt for the target task. However, this approach critically ignores the fact that some of the source tasks could be negatively or positively interfering with each other. We argue that when we extract knowledge from source tasks via training source prompts, we need to consider this correlation among source tasks for better transfer to target tasks. To this end, we propose a Bayesian approach where we work with the posterior distribution of prompts across source tasks. We obtain representative source prompts corresponding to the samples from the posterior utilizing Stein Variational Gradient Descent, which are then aggregated to constitute the initial target prompt. We show extensive experimental results on the standard benchmark NLP tasks, where our Bayesian multi-task transfer learning approach outperforms the state-of-the-art methods in many settings. Furthermore, our approach requires no auxiliary models other than the prompt itself, achieving a high degree of parameter efficiency. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: The first two authors equally contributed to this work. Findings of EMNLP 2023

arXiv:2402.06974 [pdf, other]

Hypernetwork-Driven Model Fusion for Federated Domain Generalization

Authors: Marc Bartholet, Taehyeon Kim, Ami Beuret, Se-Young Yun, Joachim M. Buhmann

Abstract: Federated Learning (FL) faces significant challenges with domain shifts in heterogeneous data, degrading performance. Traditional domain generalization aims to learn domain-invariant features, but the federated nature of model averaging often limits this due to its linear aggregation of local learning. To address this, we propose a robust framework, coined as hypernetwork-based Federated Fusion (h… ▽ More Federated Learning (FL) faces significant challenges with domain shifts in heterogeneous data, degrading performance. Traditional domain generalization aims to learn domain-invariant features, but the federated nature of model averaging often limits this due to its linear aggregation of local learning. To address this, we propose a robust framework, coined as hypernetwork-based Federated Fusion (hFedF), using hypernetworks for non-linear aggregation, facilitating generalization to unseen domains. Our method employs client-specific embeddings and gradient alignment techniques to manage domain generalization effectively. Evaluated in both zero-shot and few-shot settings, hFedF demonstrates superior performance in handling domain shifts. Comprehensive comparisons on PACS, Office-Home, and VLCS datasets show that hFedF consistently achieves the highest in-domain and out-of-domain accuracy with reliable predictions. Our study contributes significantly to the under-explored field of Federated Domain Generalization (FDG), setting a new benchmark for performance in this area. △ Less

Submitted 28 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

arXiv:2402.05353 [pdf, other]

Revisiting Early-Learning Regularization When Federated Learning Meets Noisy Labels

Authors: Taehyeon Kim, Donggyu Kim, Se-Young Yun

Abstract: In the evolving landscape of federated learning (FL), addressing label noise presents unique challenges due to the decentralized and diverse nature of data collection across clients. Traditional centralized learning approaches to mitigate label noise are constrained in FL by privacy concerns and the heterogeneity of client data. This paper revisits early-learning regularization, introducing an inn… ▽ More In the evolving landscape of federated learning (FL), addressing label noise presents unique challenges due to the decentralized and diverse nature of data collection across clients. Traditional centralized learning approaches to mitigate label noise are constrained in FL by privacy concerns and the heterogeneity of client data. This paper revisits early-learning regularization, introducing an innovative strategy, Federated Label-mixture Regularization (FLR). FLR adeptly adapts to FL's complexities by generating new pseudo labels, blending local and global model predictions. This method not only enhances the accuracy of the global model in both i.i.d. and non-i.i.d. settings but also effectively counters the memorization of noisy labels. Demonstrating compatibility with existing label noise and FL techniques, FLR paves the way for improved generalization in FL environments fraught with label inaccuracies. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.03898 [pdf, other]

DistiLLM: Towards Streamlined Distillation for Large Language Models

Authors: Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun

Abstract: Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to addre… ▽ More Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive sequence models (e.g., large language models) suffer from missing a standardized objective function. Moreover, the recent use of student-generated outputs to address training-inference mismatches has significantly escalated computational costs. To tackle these issues, we introduce DistiLLM, a more effective and efficient KD framework for auto-regressive language models. DistiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs. Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DistiLLM in building high-performing student models while achieving up to 4.3$\times$ speedup compared to recent KD methods. △ Less

Submitted 3 July, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

Comments: ICML 2024; Code is available at https://github.com/jongwooko/distillm

Showing 1–50 of 636 results for author: Yun, S