Skip to main content

Showing 1–25 of 25 results for author: Lim, J H

  1. arXiv:2407.04903  [pdf, other

    cs.CL cs.AI cs.CV

    MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

    Authors: Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, William Yang Wang

    Abstract: The rapid advancement of Large Language Models (LLMs) and Large Multimodal Models (LMMs) has heightened the demand for AI-based scientific assistants capable of understanding scientific articles and figures. Despite progress, there remains a significant gap in evaluating models' comprehension of professional, graduate-level, and even PhD-level scientific content. Current datasets and benchmarks pr… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Code and data are available at https://github.com/Leezekun/MMSci

  2. arXiv:2405.16496  [pdf, other

    cs.CV cs.AI cs.LG

    Exploring a Multimodal Fusion-based Deep Learning Network for Detecting Facial Palsy

    Authors: Nicole Heng Yim Oo, Min Hun Lee, Jeong Hoon Lim

    Abstract: Algorithmic detection of facial palsy offers the potential to improve current practices, which usually involve labor-intensive and subjective assessment by clinicians. In this paper, we present a multimodal fusion-based deep learning model that utilizes unstructured data (i.e. an image frame with facial line segments) and structured data (i.e. features of facial expressions) to detect facial palsy… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

  3. arXiv:2405.12538  [pdf, other

    cs.CV

    Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

    Authors: Yi Cheng, Ziwei Xu, Dongyun Lin, Harry Cheng, Yongkang Wong, Ying Sun, Joo Hwee Lim, Mohan Kankanhalli

    Abstract: For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leadi… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  4. arXiv:2404.01954  [pdf, other

    cs.CL cs.AI

    HyperCLOVA X Technical Report

    Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

    Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More

    Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: 44 pages; updated authors list and fixed author names

  5. arXiv:2309.09311  [pdf, other

    cs.CV cs.IR cs.MM

    Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

    Authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

    Abstract: Many studies focus on improving pretraining or developing new backbones in text-video retrieval. However, existing methods may suffer from the learning and inference bias issue, as recent research suggests in other text-video-related tasks. For instance, spatial appearance features on action recognition or temporal object co-occurrences on video scene graph generation could induce spurious correla… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

    Comments: Accepted by the British Machine Vision Conference (BMVC) 2023. Project Page: https://buraksatar.github.io/FrameLengthBias

  6. arXiv:2306.04345  [pdf, other

    cs.CV cs.IR cs.MM

    An Overview of Challenges in Egocentric Text-Video Retrieval

    Authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

    Abstract: Text-video retrieval contains various challenges, including biases coming from diverse sources. We highlight some of them supported by illustrations to open a discussion. Besides, we address one of the biases, frame length bias, with a simple method which brings a very incremental but promising increase. We conclude with future directions.

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: 4 pages, CVPR 2023 Joint Ego4D&EPIC Workshop, Extended Abstract

  7. Adaptive Learning based Upper-Limb Rehabilitation Training System with Collaborative Robot

    Authors: Jun Hong Lim, Kaibo He, Zeji Yi, Chen Hou, Chen Zhang, Yanan Sui, Luming Li

    Abstract: Rehabilitation training for patients with motor disabilities usually requires specialized devices in rehabilitation centers. Home-based multi-purpose training would significantly increase treatment accessibility and reduce medical costs. While it is unlikely to equip a set of rehabilitation robots at home, we investigate the feasibility to use the general-purpose collaborative robot for rehabilita… ▽ More

    Submitted 12 July, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Journal ref: EMBC2023

  8. arXiv:2302.07400  [pdf, other

    cs.LG math.FA stat.ML

    Score-based Diffusion Models in Function Space

    Authors: Jae Hyun Lim, Nikola B. Kovachki, Ricardo Baptista, Christopher Beckham, Kamyar Azizzadenesheli, Jean Kossaifi, Vikram Voleti, Jiaming Song, Karsten Kreis, Jan Kautz, Christopher Pal, Arash Vahdat, Anima Anandkumar

    Abstract: Diffusion models have recently emerged as a powerful framework for generative modeling. They consist of a forward process that perturbs input data with Gaussian white noise and a reverse process that learns a score function to generate samples by denoising. Despite their tremendous success, they are mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their applications to many… ▽ More

    Submitted 22 November, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: 52 pages

    MSC Class: 46B09 (Primary); 60J22 (Secondary) ACM Class: I.2.6; J.2

  9. arXiv:2212.04614  [pdf, other

    cs.LG

    Is Bio-Inspired Learning Better than Backprop? Benchmarking Bio Learning vs. Backprop

    Authors: Manas Gupta, Sarthak Ketanbhai Modi, Hang Zhang, Joon Hei Lee, Joo Hwee Lim

    Abstract: Bio-inspired learning has been gaining popularity recently given that Backpropagation (BP) is not considered biologically plausible. Many algorithms have been proposed in the literature which are all more biologically plausible than BP. However, apart from overcoming the biological implausibility of BP, a strong motivation for using Bio-inspired algorithms remains lacking. In this study, we undert… ▽ More

    Submitted 30 August, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

  10. arXiv:2211.11174  [pdf, other

    cs.CV

    Unveiling the Tapestry: the Interplay of Generalization and Forgetting in Continual Learning

    Authors: Zenglin Shi, Jing Jie, Ying Sun, Joo Hwee Lim, Mengmi Zhang

    Abstract: In AI, generalization refers to a model's ability to perform well on out-of-distribution data related to the given task, beyond the data it was trained on. For an AI agent to excel, it must also possess the continual learning capability, whereby an agent incrementally learns to perform a sequence of tasks without forgetting the previously acquired knowledge to solve the old tasks. Intuitively, gen… ▽ More

    Submitted 17 January, 2024; v1 submitted 20 November, 2022; originally announced November 2022.

  11. Portmanteauing Features for Scene Text Recognition

    Authors: Yew Lee Tan, Ernest Yu Kai Chew, Adams Wai-Kin Kong, Jung-Jae Kim, Joo Hwee Lim

    Abstract: Scene text images have different shapes and are subjected to various distortions, e.g. perspective distortions. To handle these challenges, the state-of-the-art methods rely on a rectification network, which is connected to the text recognition network. They form a linear pipeline which uses text rectification on all input images, even for images that can be recognized without it. Undoubtedly, the… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

    Comments: Accepted in ICPR 2022

  12. arXiv:2208.01897  [pdf, other

    cs.CV

    Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

    Authors: Mei Chee Leong, Haosong Zhang, Hui Li Tan, Liyuan Li, Joo Hwee Lim

    Abstract: Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's mode… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: The Ninth Workshop on Fine-Grained Visual Categorization (FGVC9) @ CVPR2022

  13. arXiv:2206.14381  [pdf, other

    cs.CV cs.IR cs.LG

    Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

    Authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

    Abstract: In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized D… ▽ More

    Submitted 26 September, 2023; v1 submitted 28 June, 2022; originally announced June 2022.

    Comments: Ranked joint 3rd place in the Multi-Instance Retrieval Challenge at EPIC@CVPR2022. (v2: ref error is corrected)

  14. Semantic Role Aware Correlation Transformer for Text to Video Retrieval

    Authors: Burak Satar, Hongyuan Zhu, Xavier Bresson, Joo Hwee Lim

    Abstract: With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer tha… ▽ More

    Submitted 26 June, 2022; originally announced June 2022.

    Comments: Camera-ready for ICIP 2021

    Journal ref: IEEE International Conference on Image Processing (ICIP), 2021, pp. 1334-1338

  15. arXiv:2206.12845  [pdf, other

    cs.CV cs.IR cs.LG

    RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

    Authors: Burak Satar, Hongyuan Zhu, Hanwang Zhang, Joo Hwee Lim

    Abstract: Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting o… ▽ More

    Submitted 26 June, 2022; originally announced June 2022.

    Comments: Preprint, under review in TCSVT Journal

  16. arXiv:2111.14145  [pdf, other

    cs.CV cs.AI

    FashionSearchNet-v2: Learning Attribute Representations with Localization for Image Retrieval with Attribute Manipulation

    Authors: Kenan E. Ak, Joo Hwee Lim, Ying Sun, Jo Yew Tham, Ashraf A. Kassim

    Abstract: The focus of this paper is on the problem of image retrieval with attribute manipulation. Our proposed work is able to manipulate the desired attributes of the query image while maintaining its other attributes. For example, the collar attribute of the query image can be changed from round to v-neck to retrieve similar images from a large dataset. A key challenge in e-commerce is that images have… ▽ More

    Submitted 28 November, 2021; originally announced November 2021.

    Comments: 15 pages

  17. Joint Learning On The Hierarchy Representation for Fine-Grained Human Action Recognition

    Authors: Mei Chee Leong, Hui Li Tan, Haosong Zhang, Liyuan Li, Feng Lin, Joo Hwee Lim

    Abstract: Fine-grained human action recognition is a core research topic in computer vision. Inspired by the recently proposed hierarchy representation of fine-grained actions in FineGym and SlowFast network for action recognition, we propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction for fine-grained human action recogni… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: Camera ready for IEEE ICIP 2021

    Journal ref: 2021 IEEE International Conference on Image Processing (ICIP)

  18. arXiv:2106.02808  [pdf, other

    cs.LG

    A Variational Perspective on Diffusion-Based Generative Models and Score Matching

    Authors: Chin-Wei Huang, Jae Hyun Lim, Aaron Courville

    Abstract: Discrete-time diffusion-based generative models and score matching methods have shown promising results in modeling high-dimensional image data. Recently, Song et al. (2021) show that diffusion processes that transform data into noise can be reversed via learning the score function, i.e. the gradient of the log-density of the perturbed data. They propose to plug the learned score function into an… ▽ More

    Submitted 29 September, 2021; v1 submitted 5 June, 2021; originally announced June 2021.

  19. arXiv:2008.11009  [pdf, other

    cs.CV cs.CR

    Protect, Show, Attend and Tell: Empowering Image Captioning Models with Ownership Protection

    Authors: Jian Han Lim, Chee Seng Chan, Kam Woh Ng, Lixin Fan, Qiang Yang

    Abstract: By and large, existing Intellectual Property (IP) protection on deep neural networks typically i) focus on image classification task only, and ii) follow a standard digital watermarking framework that was conventionally used to protect the ownership of multimedia and video content. This paper demonstrates that the current digital watermarking framework is insufficient to protect image captioning t… ▽ More

    Submitted 31 August, 2021; v1 submitted 25 August, 2020; originally announced August 2020.

    Comments: Accepted at Pattern Recognition, 17 pages

  20. arXiv:2006.05164  [pdf, other

    cs.LG stat.ML

    AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation

    Authors: Jae Hyun Lim, Aaron Courville, Christopher Pal, Chin-Wei Huang

    Abstract: Entropy is ubiquitous in machine learning, but it is in general intractable to compute the entropy of the distribution of an arbitrary continuous random variable. In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. Amortization allows us to significantly reduc… ▽ More

    Submitted 9 June, 2020; originally announced June 2020.

    Comments: accepted in ICML 2020

  21. arXiv:1910.02344  [pdf, other

    cs.LG stat.ML

    Neural Multisensory Scene Inference

    Authors: Jae Hyun Lim, Pedro O. Pinheiro, Negar Rostamzadeh, Christopher Pal, Sungjin Ahn

    Abstract: For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Netwo… ▽ More

    Submitted 7 November, 2019; v1 submitted 5 October, 2019; originally announced October 2019.

  22. arXiv:1905.09447  [pdf, other

    cs.CV cs.AI

    Variational Prototype Replays for Continual Learning

    Authors: Mengmi Zhang, Tao Wang, Joo Hwee Lim, Gabriel Kreiman, Jiashi Feng

    Abstract: Continual learning refers to the ability to acquire and transfer knowledge without catastrophically forgetting what was previously learned. In this work, we consider \emph{few-shot} continual learning in classification tasks, and we propose a novel method, Variational Prototype Replays, that efficiently consolidates and recalls previous knowledge to avoid catastrophic forgetting. In each classific… ▽ More

    Submitted 15 February, 2020; v1 submitted 22 May, 2019; originally announced May 2019.

    Comments: under submission

  23. arXiv:1807.11929  [pdf, other

    cs.CV cs.RO

    Egocentric Spatial Memory

    Authors: Mengmi Zhang, Keng Teck Ma, Shih-Cheng Yen, Joo Hwee Lim, Qi Zhao, Jiashi Feng

    Abstract: Egocentric spatial memory (ESM) defines a memory system with encoding, storing, recognizing and recalling the spatial information about the environment from an egocentric perspective. We introduce an integrated deep neural network architecture for modeling ESM. It learns to estimate the occupancy state of the world and progressively construct top-down 2D global maps from egocentric views in a spat… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

    Comments: 8 pages, 6 figures, accepted in IROS 2018

  24. arXiv:1807.10587  [pdf

    cs.CV cs.AI q-bio.NC

    Finding any Waldo: zero-shot invariant and efficient visual search

    Authors: Mengmi Zhang, Jiashi Feng, Keng Teck Ma, Joo Hwee Lim, Qi Zhao, Gabriel Kreiman

    Abstract: Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work has focused… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

    Comments: Number of figures: 6 Number of supplementary figures: 14

  25. arXiv:1705.02894  [pdf, other

    stat.ML cond-mat.dis-nn cs.AI cs.CV cs.LG

    Geometric GAN

    Authors: Jae Hyun Lim, Jong Chul Ye

    Abstract: Generative Adversarial Nets (GANs) represent an important milestone for effective generative models, which has inspired numerous variants seemingly different from each other. One of the main contributions of this paper is to reveal a unified geometric structure in GAN and its variants. Specifically, we show that the adversarial generative model training can be decomposed into three geometric steps… ▽ More

    Submitted 8 May, 2017; v1 submitted 8 May, 2017; originally announced May 2017.