subscribe to arXiv mailings

Efficient and Accurate Memorable Conversation Model using DPO based on sLLM

Authors: Youngkyung Seo, Yoonseok Heo, Jun-Seok Koh, Du-Seoung Chang

Abstract: In multi-session dialog system, it is essential to continuously update the memory as the session progresses. Simply accumulating memory can make it difficult to focus on the content of the conversation for inference due to the limited input sentence size. Therefore, efficient and accurate conversation model that is capable of managing memory to reflect the conversation history continuously is nece… ▽ More In multi-session dialog system, it is essential to continuously update the memory as the session progresses. Simply accumulating memory can make it difficult to focus on the content of the conversation for inference due to the limited input sentence size. Therefore, efficient and accurate conversation model that is capable of managing memory to reflect the conversation history continuously is necessary. This paper presents a conversation model that efficiently manages memory as sessions progress and incorporates this into the model to reflect the conversation history accurately with 3 methodologies: SFT, DPO and DPO with SFT model. Our model using DPO algorithm shows an improvement about 0.0591 of BERTScore in memory accuracy, and the rate of responses reflecting the memory increased as well. Also, response generation performance enhanced about 4.292 in fluency, 3.935 in coherence, and 2.896 in consistency. This paper describes a training method that yields better performance than models with more than twice the parameter size, even when the model size is smaller. Thus, our model demonstrates efficiency not only in terms of accuracy but also in resource utilization. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.01476 [pdf, other]

Tree Search for Language Model Agents

Authors: Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov

Abstract: Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards… ▽ More Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at https://jykoh.com/search-agents. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 11 pages. Models and code available at https://jykoh.com/search-agents

arXiv:2406.12814 [pdf, other]

Adversarial Attacks on Multimodal Agents

Authors: Chen Henry Wu, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

Abstract: Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-base… ▽ More Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 19 pages

arXiv:2406.08718 [pdf, other]

Enhancing Psychotherapy Counseling: A Data Augmentation Pipeline Leveraging Large Language Models for Counseling Conversations

Authors: Jun-Woo Kim, Ji-Eun Han, Jun-Seok Koh, Hyeon-Tae Seo, Du-Seong Chang

Abstract: We introduce a pipeline that leverages Large Language Models (LLMs) to transform single-turn psychotherapy counseling sessions into multi-turn interactions. While AI-supported online counseling services for individuals with mental disorders exist, they are often constrained by the limited availability of multi-turn training datasets and frequently fail to fully utilize therapists' expertise. Our p… ▽ More We introduce a pipeline that leverages Large Language Models (LLMs) to transform single-turn psychotherapy counseling sessions into multi-turn interactions. While AI-supported online counseling services for individuals with mental disorders exist, they are often constrained by the limited availability of multi-turn training datasets and frequently fail to fully utilize therapists' expertise. Our proposed pipeline effectively addresses these limitations. The pipeline comprises two main steps: 1) Information Extraction and 2) Multi-turn Counseling Generation. Each step is meticulously designed to extract and generate comprehensive multi-turn counseling conversations from the available datasets. Experimental results from both zero-shot and few-shot generation scenarios demonstrate that our approach significantly enhances the ability of LLMs to produce higher quality multi-turn dialogues in the context of mental health counseling. Our pipeline and dataset are publicly available https://github.com/jwkim-chat/A-Data-Augmentation-Pipeline-Leveraging-Large-Language-Models-for-Counseling-Conversations. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: IJCAI 2024 AI4Research workshop

arXiv:2406.00505 [pdf, other]

Improving Text Generation on Images with Synthetic Captions

Authors: Jun Young Koh, Sang Hyun Park, Joy Song

Abstract: The recent emergence of latent diffusion models such as SDXL and SD 1.5 has shown significant capability in generating highly detailed and realistic images. Despite their remarkable ability to produce images, generating accurate text within images still remains a challenging task. In this paper, we examine the validity of fine-tuning approaches in generating legible text within the image. We propo… ▽ More The recent emergence of latent diffusion models such as SDXL and SD 1.5 has shown significant capability in generating highly detailed and realistic images. Despite their remarkable ability to produce images, generating accurate text within images still remains a challenging task. In this paper, we examine the validity of fine-tuning approaches in generating legible text within the image. We propose a low-cost approach by leveraging SDXL without any time-consuming training on large-scale datasets. The proposed strategy employs a fine-tuning technique that examines the effects of data refinement levels and synthetic captions. Moreover, our results demonstrate how our small scale fine-tuning approach can improve the accuracy of text generation in different scenarios without the need of additional multimodal encoders. Our experiments show that with the addition of random letters to our raw dataset, our model's performance improves in producing well-formed visual text. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: 9 pages, 12 figures

arXiv:2405.18623 [pdf]

I See You: Teacher Analytics with GPT-4 Vision-Powered Observational Assessment

Authors: Unggi Lee, Yeil Jeong, Junbo Koh, Gyuri Byun, Yunseo Lee, Hyunwoong Lee, Seunmin Eun, Jewoong Moon, Cheolil Lim, Hyeoncheol Kim

Abstract: This preliminary study explores the integration of GPT-4 Vision (GPT-4V) technology into teacher analytics, focusing on its applicability in observational assessment to enhance reflective teaching practice. This research is grounded in developing a Video-based Automatic Assessment System (VidAAS) empowered by GPT-4V. Our approach aims to revolutionize teachers' assessment of students' practices by… ▽ More This preliminary study explores the integration of GPT-4 Vision (GPT-4V) technology into teacher analytics, focusing on its applicability in observational assessment to enhance reflective teaching practice. This research is grounded in developing a Video-based Automatic Assessment System (VidAAS) empowered by GPT-4V. Our approach aims to revolutionize teachers' assessment of students' practices by leveraging Generative Artificial Intelligence (GenAI) to offer detailed insights into classroom dynamics. Our research methodology encompasses a comprehensive literature review, prototype development of the VidAAS, and usability testing with in-service teachers. The study findings provide future research avenues for VidAAS design, implementation, and integration in teacher analytics, underscoring the potential of GPT-4V to provide real-time, scalable feedback and a deeper understanding of the classroom. △ Less

Submitted 30 May, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 27 pages, 5 figures, 4 tables

arXiv:2404.07554 [pdf, other]

CAT: Contrastive Adapter Training for Personalized Image Generation

Authors: Jae Wan Park, Sang Hyun Park, Jun Young Koh, Junha Lee, Min Song

Abstract: The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the… ▽ More The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the corruption of the backbone model's prior knowledge. One of the well known phenomena is the loss of diversity in object generation, especially within the same class which leads to generating almost identical objects with minor variations. This poses challenges in generation capabilities. To solve this issue, we present Contrastive Adapter Training (CAT), a simple yet effective strategy to enhance adapter training through the application of CAT loss. Our approach facilitates the preservation of the base model's original knowledge when the model initiates adapters. Furthermore, we introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to keep the former information. We qualitatively and quantitatively compare CAT's improvement. Finally, we mention the possibility of CAT in the aspects of multi-concept adapter and optimization. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: CVPRW 2024

arXiv:2404.03984 [pdf, other]

ROMA-iQSS: An Objective Alignment Approach via State-Based Value Learning and ROund-Robin Multi-Agent Scheduling

Authors: Chi-Hui Lin, Joewie J. Koh, Alessandro Roncone, Lijun Chen

Abstract: Effective multi-agent collaboration is imperative for solving complex, distributed problems. In this context, two key challenges must be addressed: first, autonomously identifying optimal objectives for collective outcomes; second, aligning these objectives among agents. Traditional frameworks, often reliant on centralized learning, struggle with scalability and efficiency in large multi-agent sys… ▽ More Effective multi-agent collaboration is imperative for solving complex, distributed problems. In this context, two key challenges must be addressed: first, autonomously identifying optimal objectives for collective outcomes; second, aligning these objectives among agents. Traditional frameworks, often reliant on centralized learning, struggle with scalability and efficiency in large multi-agent systems. To overcome these issues, we introduce a decentralized state-based value learning algorithm that enables agents to independently discover optimal states. Furthermore, we introduce a novel mechanism for multi-agent interaction, wherein less proficient agents follow and adopt policies from more experienced ones, thereby indirectly guiding their learning process. Our theoretical analysis shows that our approach leads decentralized agents to an optimal collective policy. Empirical experiments further demonstrate that our method outperforms existing decentralized state-based and action-based value learning strategies by effectively identifying and aligning optimal objectives. △ Less

Submitted 5 April, 2024; originally announced April 2024.

Comments: 10 pages, 3 figures, extended version of our 2024 American Control Conference publication

Journal ref: Proceedings of the 2024 American Control Conference (ACC), 2024

arXiv:2404.00930 [pdf, other]

PSYDIAL: Personality-based Synthetic Dialogue Generation using Large Language Models

Authors: Ji-Eun Han, Jun-Seok Koh, Hyeon-Tae Seo, Du-Seong Chang, Kyung-Ah Sohn

Abstract: We present a novel end-to-end personality-based synthetic dialogue data generation pipeline, specifically designed to elicit responses from large language models via prompting. We design the prompts to generate more human-like dialogues considering real-world scenarios when users engage with chatbots. We introduce PSYDIAL, the first Korean dialogue dataset focused on personality-based dialogues, c… ▽ More We present a novel end-to-end personality-based synthetic dialogue data generation pipeline, specifically designed to elicit responses from large language models via prompting. We design the prompts to generate more human-like dialogues considering real-world scenarios when users engage with chatbots. We introduce PSYDIAL, the first Korean dialogue dataset focused on personality-based dialogues, curated using our proposed pipeline. Notably, we focus on the Extraversion dimension of the Big Five personality model in our research. Experimental results indicate that while pre-trained models and those fine-tuned with a chit-chat dataset struggle to generate responses reflecting personality, models trained with PSYDIAL show significant improvements. The versatility of our pipeline extends beyond dialogue tasks, offering potential for other non-dialogue related applications. This research opens doors for more nuanced, personality-driven conversational AI in Korean and potentially other languages. Our code is publicly available at https://github.com/jiSilverH/psydial. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: LREC-COLING 2024 Main

arXiv:2403.09969 [pdf, other]

Prediction of Vessel Arrival Time to Pilotage Area Using Multi-Data Fusion and Deep Learning

Authors: Xiaocai Zhang, Xiuju Fu, Zhe Xiao, Haiyan Xu, Xiaoyang Wei, Jimmy Koh, Daichi Ogawa, Zheng Qin

Abstract: This paper investigates the prediction of vessels' arrival time to the pilotage area using multi-data fusion and deep learning approaches. Firstly, the vessel arrival contour is extracted based on Multivariate Kernel Density Estimation (MKDE) and clustering. Secondly, multiple data sources, including Automatic Identification System (AIS), pilotage booking information, and meteorological data, are… ▽ More This paper investigates the prediction of vessels' arrival time to the pilotage area using multi-data fusion and deep learning approaches. Firstly, the vessel arrival contour is extracted based on Multivariate Kernel Density Estimation (MKDE) and clustering. Secondly, multiple data sources, including Automatic Identification System (AIS), pilotage booking information, and meteorological data, are fused before latent feature extraction. Thirdly, a Temporal Convolutional Network (TCN) framework that incorporates a residual mechanism is constructed to learn the hidden arrival patterns of the vessels. Extensive tests on two real-world data sets from Singapore have been conducted and the following promising results have been obtained: 1) fusion of pilotage booking information and meteorological data improves the prediction accuracy, with pilotage booking information having a more significant impact; 2) using discrete embedding for the meteorological data performs better than using continuous embedding; 3) the TCN outperforms the state-of-the-art baseline methods in regression tasks, exhibiting Mean Absolute Error (MAE) ranging from 4.58 min to 4.86 min; and 4) approximately 89.41% to 90.61% of the absolute prediction residuals fall within a time frame of 10 min. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: The 26th IEEE International Conference on Intelligent Transportation Systems (ITSC 2023)

arXiv:2403.06433 [pdf, other]

Fine-Grained Pillar Feature Encoding Via Spatio-Temporal Virtual Grid for 3D Object Detection

Authors: Konyul Park, Yecheol Kim, Junho Koh, Byungwoo Park, Jun Won Choi

Abstract: Developing high-performance, real-time architectures for LiDAR-based 3D object detectors is essential for the successful commercialization of autonomous vehicles. Pillar-based methods stand out as a practical choice for onboard deployment due to their computational efficiency. However, despite their efficiency, these methods can sometimes underperform compared to alternative point encoding techniq… ▽ More Developing high-performance, real-time architectures for LiDAR-based 3D object detectors is essential for the successful commercialization of autonomous vehicles. Pillar-based methods stand out as a practical choice for onboard deployment due to their computational efficiency. However, despite their efficiency, these methods can sometimes underperform compared to alternative point encoding techniques such as Voxel-encoding or PointNet++. We argue that current pillar-based methods have not sufficiently captured the fine-grained distributions of LiDAR points within each pillar structure. Consequently, there exists considerable room for improvement in pillar feature encoding. In this paper, we introduce a novel pillar encoding architecture referred to as Fine-Grained Pillar Feature Encoding (FG-PFE). FG-PFE utilizes Spatio-Temporal Virtual (STV) grids to capture the distribution of point clouds within each pillar across vertical, temporal, and horizontal dimensions. Through STV grids, points within each pillar are individually encoded using Vertical PFE (V-PFE), Temporal PFE (T-PFE), and Horizontal PFE (H-PFE). These encoded features are then aggregated through an Attentive Pillar Aggregation method. Our experiments conducted on the nuScenes dataset demonstrate that FG-PFE achieves significant performance improvements over baseline models such as PointPillar, CenterPoint-Pillar, and PillarNet, with only a minor increase in computational overhead. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: ICRA 2024

arXiv:2402.17553 [pdf, other]

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Authors: Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov

Abstract: For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They coul… ▽ More For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens. △ Less

Submitted 28 February, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

arXiv:2401.13649 [pdf, other]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Authors: Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried

Abstract: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augmen… ▽ More Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa. △ Less

Submitted 5 June, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

Comments: Accepted to ACL 2024. 24 pages. Project page: https://jykoh.com/vwa

arXiv:2311.13691 [pdf]

Next-Generation Earth System Models: Towards Reliable Hybrid Models for Weather and Climate Applications

Authors: Tom Beucler, Erwan Koch, Sven Kotlarski, David Leutwyler, Adrien Michel, Jonathan Koh

Abstract: We review how machine learning has transformed our ability to model the Earth system, and how we expect recent breakthroughs to benefit end-users in Switzerland in the near future. Drawing from our review, we identify three recommendations. Recommendation 1: Develop Hybrid AI-Physical Models: Emphasize the integration of AI and physical modeling for improved reliability, especially for longer pr… ▽ More We review how machine learning has transformed our ability to model the Earth system, and how we expect recent breakthroughs to benefit end-users in Switzerland in the near future. Drawing from our review, we identify three recommendations. Recommendation 1: Develop Hybrid AI-Physical Models: Emphasize the integration of AI and physical modeling for improved reliability, especially for longer prediction horizons, acknowledging the delicate balance between knowledge-based and data-driven components required for optimal performance. Recommendation 2: Emphasize Robustness in AI Downscaling Approaches, favoring techniques that respect physical laws, preserve inter-variable dependencies and spatial structures, and accurately represent extremes at the local scale. Recommendation 3: Promote Inclusive Model Development: Ensure Earth System Model development is open and accessible to diverse stakeholders, enabling forecasters, the public, and AI/statistics experts to use, develop, and engage with the model and its predictions/projections. △ Less

Submitted 26 January, 2024; v1 submitted 22 November, 2023; originally announced November 2023.

Comments: 12 pages, 1 figure, submitted as part of the Swiss Academy of Engineering Sciences' 2024 whitepaper on "Artificial Intelligence for Climate Change Mitigation"

arXiv:2310.07478 [pdf, other]

Multimodal Graph Learning for Generative Tasks

Authors: Minji Yoon, Jing Yu Koh, Bryan Hooi, Ruslan Salakhutdinov

Abstract: Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize: for example, from plain text to image-caption pairs. Most multimodal learning algorithms focus on modeling simple one-to-one pairs of data from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world settings, entities of different modalit… ▽ More Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize: for example, from plain text to image-caption pairs. Most multimodal learning algorithms focus on modeling simple one-to-one pairs of data from two modalities, such as image-caption pairs, or audio-text pairs. However, in most real-world settings, entities of different modalities interact with each other in more complex and multifaceted ways, going beyond one-to-one mappings. We propose to represent these complex relationships as graphs, allowing us to capture data with any number of modalities, and with complex relationships between modalities that can flexibly vary from one sample to another. Toward this goal, we propose Multimodal Graph Learning (MMGL), a general and systematic framework for capturing information from multiple multimodal neighbors with relational structures among them. In particular, we focus on MMGL for generative tasks, building upon pretrained Language Models (LMs), aiming to augment their text generation with multimodal neighbor contexts. We study three research questions raised by MMGL: (1) how can we infuse multiple neighbor information into the pretrained LMs, while avoiding scalability issues? (2) how can we infuse the graph structure information among multimodal neighbors into the LMs? and (3) how can we finetune the pretrained LMs to learn from the neighbor context in a parameter-efficient manner? We conduct extensive experiments to answer these three questions on MMGL and analyze the empirical results to pave the way for future MMGL research. △ Less

Submitted 12 October, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2309.05032 [pdf, other]

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

Authors: Kyoung Ok Yang, Junho Koh, Jun Won Choi

Abstract: Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR perform… ▽ More Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information fusion. Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance, outperforming competing methods by considerable margins. △ Less

Submitted 10 September, 2023; originally announced September 2023.

arXiv:2305.17216 [pdf, other]

Generating Images with Multimodal Language Models

Authors: Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov

Abstract: We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to… ▽ More We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence. △ Less

Submitted 13 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023. Project page: http://jykoh.com/gill

arXiv:2302.06833 [pdf, other]

VQ3D: Learning a 3D-Aware Generative Model on ImageNet

Authors: Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun

Abstract: Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder… ▽ More Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder into a two-stage vector-quantized autoencoder. Our Stage 1 allows for the reconstruction of an input image and the ability to change the camera position around the image, and our Stage 2 allows for the generation of new 3D scenes. VQ3D is capable of generating and reconstructing 3D-aware images from the 1000-class ImageNet dataset of 1.2 million training images. We achieve an ImageNet generation FID score of 16.8, compared to 69.8 for the next best baseline method. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: 15 pages. For visual results, please visit the project webpage at http://kylesargent.github.io/vq3d

arXiv:2301.13823 [pdf, other]

Grounding Language Models to Images for Multimodal Inputs and Outputs

Authors: Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

Abstract: We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the langu… ▽ More We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings. △ Less

Submitted 13 June, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

Comments: Published in ICML 2023. Project page: https://jykoh.com/fromage

arXiv:2212.00442 [pdf, other]

MGTANet: Encoding Sequential LiDAR Points Using Long Short-Term Motion-Guided Temporal Attention for 3D Object Detection

Authors: Junho Koh, Junhyung Lee, Youngwoo Lee, Jaekyum Kim, Jun Won Choi

Abstract: Most scanning LiDAR sensors generate a sequence of point clouds in real-time. While conventional 3D object detectors use a set of unordered LiDAR points acquired over a fixed time interval, recent studies have revealed that substantial performance improvement can be achieved by exploiting the spatio-temporal context present in a sequence of LiDAR point sets. In this paper, we propose a novel 3D ob… ▽ More Most scanning LiDAR sensors generate a sequence of point clouds in real-time. While conventional 3D object detectors use a set of unordered LiDAR points acquired over a fixed time interval, recent studies have revealed that substantial performance improvement can be achieved by exploiting the spatio-temporal context present in a sequence of LiDAR point sets. In this paper, we propose a novel 3D object detection architecture, which can encode LiDAR point cloud sequences acquired by multiple successive scans. The encoding process of the point cloud sequence is performed on two different time scales. We first design a short-term motion-aware voxel encoding that captures the short-term temporal changes of point clouds driven by the motion of objects in each voxel. We also propose long-term motion-guided bird's eye view (BEV) feature enhancement that adaptively aligns and aggregates the BEV feature maps obtained by the short-term voxel encoding by utilizing the dynamic motion context inferred from the sequence of the feature maps. The experiments conducted on the public nuScenes benchmark demonstrate that the proposed 3D object detector offers significant improvements in performance compared to the baseline methods and that it sets a state-of-the-art performance for certain 3D object detection categories. Code is available at https://github.com/HYjhkoh/MGTANet.git △ Less

Submitted 21 December, 2022; v1 submitted 1 December, 2022; originally announced December 2022.

Comments: Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI'23)

arXiv:2210.03112 [pdf, other]

A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Authors: Aishwarya Kamath, Peter Anderson, Su Wang, Jing Yu Koh, Alexander Ku, Austin Waters, Yinfei Yang, Jason Baldridge, Zarana Parekh

Abstract: Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial langua… ▽ More Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities. △ Less

Submitted 17 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

Comments: CVPR 2023

arXiv:2210.00087 [pdf, other]

D-Align: Dual Query Co-attention Network for 3D Object Detection Based on Multi-frame Point Cloud Sequence

Authors: Junhyung Lee, Junho Koh, Youngwoo Lee, Jun Won Choi

Abstract: LiDAR sensors are widely used for 3D object detection in various mobile robotics applications. LiDAR sensors continuously generate point cloud data in real-time. Conventional 3D object detectors detect objects using a set of points acquired over a fixed duration. However, recent studies have shown that the performance of object detection can be further enhanced by utilizing spatio-temporal informa… ▽ More LiDAR sensors are widely used for 3D object detection in various mobile robotics applications. LiDAR sensors continuously generate point cloud data in real-time. Conventional 3D object detectors detect objects using a set of points acquired over a fixed duration. However, recent studies have shown that the performance of object detection can be further enhanced by utilizing spatio-temporal information obtained from point cloud sequences. In this paper, we propose a new 3D object detector, named D-Align, which can effectively produce strong bird's-eye-view (BEV) features by aligning and aggregating the features obtained from a sequence of point sets. The proposed method includes a novel dual-query co-attention network that uses two types of queries, including target query set (T-QS) and support query set (S-QS), to update the features of target and support frames, respectively. D-Align aligns S-QS to T-QS based on the temporal context features extracted from the adjacent feature maps and then aggregates S-QS with T-QS using a gated attention mechanism. The dual queries are updated through multiple attention layers to progressively enhance the target frame features used to produce the detection results. Our experiments on the nuScenes dataset show that the proposed D-Align method greatly improved the performance of a single frame-based baseline method and significantly outperformed the latest 3D object detectors. △ Less

Submitted 30 September, 2022; originally announced October 2022.

arXiv:2207.00117 [pdf, other]

WAKU-RLN-RELAY: Privacy-Preserving Peer-to-Peer Economic Spam Protection

Authors: Sanaz Taheri-Boshrooyeh, Oskar Thorén, Barry Whitehat, Wei Jie Koh, Onur Kilic, Kobi Gurkan

Abstract: In this paper, we propose WAKU-RLN-RELAY as a spam-protected gossip-based routing protocol that can run in heterogeneous networks. It features a privacy-preserving peer-to-peer (p2p) economic spam protection mechanism. WAKU-RLN-RELAY addresses the performance and privacy issues of the state-of-the-art p2p spam prevention techniques including peer scoring utilized by libp2p, and proof-of-work used… ▽ More In this paper, we propose WAKU-RLN-RELAY as a spam-protected gossip-based routing protocol that can run in heterogeneous networks. It features a privacy-preserving peer-to-peer (p2p) economic spam protection mechanism. WAKU-RLN-RELAY addresses the performance and privacy issues of the state-of-the-art p2p spam prevention techniques including peer scoring utilized by libp2p, and proof-of-work used by e.g., Whisper, the p2p messaging layer of Ethereum. In WAKU-RLN-RELAY, spam protection works by limiting the messaging rate of each network participant. Rate violation is disincentivized since it results in financial punishment where the punishment is cryptographically guaranteed. Peers who identify spammers are also rewarded. To enforce the rate limit, we adopt the suggested framework of Semaphore and its extended version, however, we modify that framework to properly address the unique requirements of a network of p2p resource-restricted users. The current work dives into the end-to-end integration of Semaphore into WAKU-RLN-RELAY, the modifications required to make it suitable for resource-limited users, and the open problems and future research directions. We also provide a proof-of-concept open-source implementation of WAKU-RLN-RELAY, and its specifications together with a rough performance evaluation. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: IEEE ICDCSW 2022

arXiv:2207.00116 [pdf, other]

Privacy-Preserving Spam-Protected Gossip-Based Routing

Authors: Sanaz Taheri-Boshrooyeh, Oskar Thorén, Barry Whitehat, Wei Jie Koh, Onur Kilic, Kobi Gurkan

Abstract: WAKU-RLN-RELAY is an anonymous peer-to-peer gossip-based routing protocol that features a privacy-preserving spam-protection with cryptographically guaranteed economic incentives. While being an anonymous routing protocol where routed messages are not attributable to their origin, it allows global identification and removal of spammers. It addresses the performance and privacy issues of its counte… ▽ More WAKU-RLN-RELAY is an anonymous peer-to-peer gossip-based routing protocol that features a privacy-preserving spam-protection with cryptographically guaranteed economic incentives. While being an anonymous routing protocol where routed messages are not attributable to their origin, it allows global identification and removal of spammers. It addresses the performance and privacy issues of its counterparts including proof-of-work and reputation-based schemes. Its light computational overhead makes it suitable for resource-limited environments. The spam protection works by limiting the messaging rate of each network participant where rate violation results in financial punishment. We deploy the novel construct of rate-limiting nullifier to enforce the message rate limit. We provide a proof-of-concept implementation of WAKU-RLN-RELAY to prove the efficiency and feasibility of our solution. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: IEEE ICDCS 2022

arXiv:2206.10789 [pdf, other]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Authors: Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, Yonghui Wu

Abstract: We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in a… ▽ More We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images. △ Less

Submitted 21 June, 2022; originally announced June 2022.

Comments: Preprint

arXiv:2204.02960 [pdf, other]

Simple and Effective Synthesis of Indoor 3D Scenes

Authors: Jing Yu Koh, Harsh Agrawal, Dhruv Batra, Richard Tucker, Austin Waters, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

Abstract: We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an ima… ▽ More We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaining 3D consistency. Existing approaches are highly complex, with many separately trained stages and components. We propose a simple alternative: an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images. On the Matterport3D and RealEstate10K datasets, our approach significantly outperforms prior work when evaluated by humans, as well as on FID scores. Further, we show that our model is useful for generative data augmentation. A vision-and-language navigation (VLN) agent trained with trajectories spatially-perturbed by our model improves success rate by up to 1.5% over a state of the art baseline on the R2R benchmark. Our code will be made available to facilitate generative data augmentation and applications to downstream robotics and embodied AI tasks. △ Less

Submitted 1 December, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: AAAI 2023

arXiv:2202.00195 [pdf, other]

doi 10.1109/ACCESS.2024.3376746

Federated Active Learning (F-AL): an Efficient Annotation Strategy for Federated Learning

Authors: Jin-Hyun Ahn, Kyungsang Kim, Jeongwan Koh, Quanzheng Li

Abstract: Federated learning (FL) has been intensively investigated in terms of communication efficiency, privacy, and fairness. However, efficient annotation, which is a pain point in real-world FL applications, is less studied. In this project, we propose to apply active learning (AL) and sampling strategy into the FL framework to reduce the annotation workload. We expect that the AL and FL can improve th… ▽ More Federated learning (FL) has been intensively investigated in terms of communication efficiency, privacy, and fairness. However, efficient annotation, which is a pain point in real-world FL applications, is less studied. In this project, we propose to apply active learning (AL) and sampling strategy into the FL framework to reduce the annotation workload. We expect that the AL and FL can improve the performance of each other complementarily. In our proposed federated active learning (F-AL) method, the clients collaboratively implement the AL to obtain the instances which are considered as informative to FL in a distributed optimization manner. We compare the test accuracies of the global FL models using the conventional random sampling strategy, client-level separate AL (S-AL), and the proposed F-AL. We empirically demonstrate that the F-AL outperforms baseline methods in image classification tasks. △ Less

Submitted 7 February, 2022; v1 submitted 31 January, 2022; originally announced February 2022.

Comments: 13 pages, 9 figures, submitted for conference publication

arXiv:2112.07116 [pdf, other]

Joint 3D Object Detection and Tracking Using Spatio-Temporal Representation of Camera Image and LiDAR Point Clouds

Authors: Junho Koh, Jaekyum Kim, Jinhyuk Yoo, Yecheol Kim, Dongsuk Kum, Jun Won Choi

Abstract: In this paper, we propose a new joint object detection and tracking (JoDT) framework for 3D object detection and tracking based on camera and LiDAR sensors. The proposed method, referred to as 3D DetecTrack, enables the detector and tracker to cooperate to generate a spatio-temporal representation of the camera and LiDAR data, with which 3D object detection and tracking are then performed. The det… ▽ More In this paper, we propose a new joint object detection and tracking (JoDT) framework for 3D object detection and tracking based on camera and LiDAR sensors. The proposed method, referred to as 3D DetecTrack, enables the detector and tracker to cooperate to generate a spatio-temporal representation of the camera and LiDAR data, with which 3D object detection and tracking are then performed. The detector constructs the spatio-temporal features via the weighted temporal aggregation of the spatial features obtained by the camera and LiDAR fusion. Then, the detector reconfigures the initial detection results using information from the tracklets maintained up to the previous time step. Based on the spatio-temporal features generated by the detector, the tracker associates the detected objects with previously tracked objects using a graph neural network (GNN). We devise a fully-connected GNN facilitated by a combination of rule-based edge pruning and attention-based edge gating, which exploits both spatial and temporal object contexts to improve tracking performance. The experiments conducted on both KITTI and nuScenes benchmarks demonstrate that the proposed 3D DetecTrack achieves significant improvements in both detection and tracking performances over baseline methods and achieves state-of-the-art performance among existing methods through collaboration between the detector and tracker. △ Less

Submitted 15 December, 2021; v1 submitted 13 December, 2021; originally announced December 2021.

arXiv:2110.04627 [pdf, other]

Vector-quantized Image Modeling with Improved VQGAN

Authors: Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu

Abstract: Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregres… ▽ More Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at $256\times256$ resolution, we achieve Inception Score (IS) of 175.1 and Fr'echet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 73.2% for a similar model size. VIM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size. △ Less

Submitted 4 June, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

Comments: Accepted in ICLR 2022

arXiv:2105.08756 [pdf, other]

Pathdreamer: A World Model for Indoor Navigation

Authors: Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson

Abstract: People navigating in unfamiliar buildings take advantage of myriad visual, spatial and semantic cues to efficiently achieve their navigation goals. Towards equipping computational agents with similar capabilities, we introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-re… ▽ More People navigating in unfamiliar buildings take advantage of myriad visual, spatial and semantic cues to efficiently achieve their navigation goals. Towards equipping computational agents with similar capabilities, we introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations (RGB, semantic segmentation and depth) for viewpoints that have not been visited, in buildings not seen during training. In regions of high uncertainty (e.g. predicting around corners, imagining the contents of an unseen room), Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes for a given trajectory. We demonstrate that Pathdreamer encodes useful and accessible visual, spatial and semantic knowledge about human environments by using it in the downstream task of Vision-and-Language Navigation (VLN). Specifically, we show that planning ahead with Pathdreamer brings about half the benefit of looking ahead at actual observations from unobserved parts of the environment. We hope that Pathdreamer will help unlock model-based approaches to challenging embodied navigation tasks such as navigating to specified objects and VLN. △ Less

Submitted 16 August, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: In ICCV 2021

arXiv:2105.06887 [pdf]

A Frequency Domain Constraint for Synthetic and Real X-ray Image Super Resolution

Authors: Qing Ma, Jae Chul Koh, WonSook Lee

Abstract: Synthetic X-ray images are simulated X-ray images projected from CT data. High-quality synthetic X-ray images can facilitate various applications such as surgical image guidance systems and VR training simulations. However, it is difficult to produce high-quality arbitrary view synthetic X-ray images in real-time due to different CT slice thickness, high computational cost, and the complexity of a… ▽ More Synthetic X-ray images are simulated X-ray images projected from CT data. High-quality synthetic X-ray images can facilitate various applications such as surgical image guidance systems and VR training simulations. However, it is difficult to produce high-quality arbitrary view synthetic X-ray images in real-time due to different CT slice thickness, high computational cost, and the complexity of algorithms. Our goal is to generate high-resolution synthetic X-ray images in real-time by upsampling low-resolution images with deep learning-based super-resolution methods. Reference-based Super Resolution (RefSR) has been well studied in recent years and has shown higher performance than traditional Single Image Super-Resolution (SISR). It can produce fine details by utilizing the reference image but still inevitably generates some artifacts and noise. In this paper, we introduce frequency domain loss as a constraint to further improve the quality of the RefSR results with fine details and without obvious artifacts. To the best of our knowledge, this is the first paper utilizing the frequency domain for the loss functions in the field of super-resolution. We achieved good results in evaluating our method on both synthetic and real X-ray image datasets. △ Less

Submitted 10 August, 2021; v1 submitted 14 May, 2021; originally announced May 2021.

arXiv:2104.10386 [pdf, other]

Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps

Authors: Yuk Heo, Yeong Jun Koh, Chang-Su Kim

Abstract: We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time. First, we design the reliability-based attention module to analyze the reliability of multiple annotated frames. Second, we develop the intersection-aware propagation module to propagate segmentation results to neighboring frames. Third, we intr… ▽ More We propose a novel guided interactive segmentation (GIS) algorithm for video objects to improve the segmentation accuracy and reduce the interaction time. First, we design the reliability-based attention module to analyze the reliability of multiple annotated frames. Second, we develop the intersection-aware propagation module to propagate segmentation results to neighboring frames. Third, we introduce the GIS mechanism for a user to select unsatisfactory frames quickly with less effort. Experimental results demonstrate that the proposed algorithm provides more accurate segmentation results at a faster speed than conventional algorithms. Codes are available at https://github.com/yuk6heo/GIS-RAmap. △ Less

Submitted 21 April, 2021; originally announced April 2021.

Comments: accepted to CVPR2021 (oral)

arXiv:2104.06697 [pdf, other]

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Authors: Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong

Abstract: Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to des… ▽ More Learning to predict the long-term future of video frames is notoriously challenging due to inherent ambiguities in the distant future and dramatic amplifications of prediction error through time. Despite the recent advances in the literature, existing approaches are limited to moderately short-term prediction (less than a few seconds), while extrapolating it to a longer future quickly leads to destruction in structure and content. In this work, we revisit hierarchical models in video prediction. Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation. Despite the simplicity, we show that modeling structures and their dynamics in the discrete semantic structure space with a stochastic recurrent estimator leads to surprisingly successful long-term prediction. We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon (i.e., thousands frames), setting a new standard of video prediction with orders of magnitude longer prediction time than existing approaches. Full videos and codes are available at https://1konny.github.io/HVP/. △ Less

Submitted 14 April, 2021; originally announced April 2021.

Comments: Accepted as a conference paper at ICLR 2021

arXiv:2101.04702 [pdf, other]

Cross-Modal Contrastive Learning for Text-to-Image Generation

Authors: Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

Abstract: The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and int… ▽ More The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN's output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but--more importantly--people prefer XMC-GAN by 77.3 for image quality and 74.1 for image-text alignment, compared to three other recent models. XMC-GAN also generalizes to the challenging Localized Narratives dataset (which has longer, more detailed descriptions), improving state-of-the-art FID from 48.70 to 14.12. Lastly, we train and evaluate XMC-GAN on the challenging Open Images data, establishing a strong benchmark FID score of 26.91. △ Less

Submitted 14 April, 2022; v1 submitted 12 January, 2021; originally announced January 2021.

Comments: CVPR 2021

arXiv:2011.10278 [pdf, other]

Joint Representation of Temporal Image Sequences and Object Motion for Video Object Detection

Authors: Junho Koh, Jaekyum Kim, Younji Shin, Byeongwon Lee, Seungji Yang, Jun Won Choi

Abstract: In this paper, we propose a new video object detector (VoD) method referred to as temporal feature aggregation and motion-aware VoD (TM-VoD), which produces a joint representation of temporal image sequences and object motion. The proposed TM-VoD aggregates visual feature maps extracted by convolutional neural networks applying the temporal attention gating and spatial feature alignment. This temp… ▽ More In this paper, we propose a new video object detector (VoD) method referred to as temporal feature aggregation and motion-aware VoD (TM-VoD), which produces a joint representation of temporal image sequences and object motion. The proposed TM-VoD aggregates visual feature maps extracted by convolutional neural networks applying the temporal attention gating and spatial feature alignment. This temporal feature aggregation is performed in two stages in a hierarchical fashion. In the first stage, the visual feature maps are fused at a pixel level via gated attention model. In the second stage, the proposed method aggregates the features after aligning the object features using temporal box offset calibration and weights them according to the cosine similarity measure. The proposed TM-VoD also finds the representation of the motion of objects in two successive steps. The pixel-level motion features are first computed based on the incremental changes between the adjacent visual feature maps. Then, box-level motion features are obtained from both the region of interest (RoI)-aligned pixel-level motion features and the sequential changes of the box coordinates. Finally, all these features are concatenated to produce a joint representation of the objects for VoD. The experiments conducted on the ImageNet VID dataset demonstrate that the proposed method outperforms existing VoD methods and achieves a performance comparable to that of state-of-the-art VoDs. △ Less

Submitted 20 November, 2020; originally announced November 2020.

arXiv:2011.03775 [pdf, other]

Text-to-Image Generation Grounded by Fine-Grained User Attention

Authors: Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

Abstract: Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used t… ▽ More Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions. △ Less

Submitted 30 March, 2021; v1 submitted 7 November, 2020; originally announced November 2020.

Comments: To appear in WACV 2021

arXiv:2010.11457 [pdf, other]

Momentum Contrast Speaker Representation Learning

Authors: Jangho Lee, Jaihyun Koh, Sungroh Yoon

Abstract: Unsupervised representation learning has shown remarkable achievement by reducing the performance gap with supervised feature learning, especially in the image domain. In this study, to extend the technique of unsupervised learning to the speech domain, we propose the Momentum Contrast for VoxCeleb (MoCoVox) as a form of learning mechanism. We pre-trained the MoCoVox on the VoxCeleb1 by implementi… ▽ More Unsupervised representation learning has shown remarkable achievement by reducing the performance gap with supervised feature learning, especially in the image domain. In this study, to extend the technique of unsupervised learning to the speech domain, we propose the Momentum Contrast for VoxCeleb (MoCoVox) as a form of learning mechanism. We pre-trained the MoCoVox on the VoxCeleb1 by implementing instance discrimination. Applying MoCoVox for speaker verification revealed that it outperforms the state-of-the-art metric learning-based approach by a large margin. We also empirically demonstrate the features of contrastive learning in the speech domain by analyzing the distribution of learned representations. Furthermore, we explored which pretext task is adequate for speaker verification. We expect that learning speaker representation without human supervision helps to address the open-set speaker recognition. △ Less

Submitted 22 October, 2020; originally announced October 2020.

arXiv:2008.00679 [pdf, other]

doi 10.1109/IROS45743.2020.9341376

Cooperative Control of Mobile Robots with Stackelberg Learning

Authors: Joewie J. Koh, Guohui Ding, Christoffer Heckman, Lijun Chen, Alessandro Roncone

Abstract: Multi-robot cooperation requires agents to make decisions that are consistent with the shared goal without disregarding action-specific preferences that might arise from asymmetry in capabilities and individual objectives. To accomplish this goal, we propose a method named SLiCC: Stackelberg Learning in Cooperative Control. SLiCC models the problem as a partially observable stochastic game compose… ▽ More Multi-robot cooperation requires agents to make decisions that are consistent with the shared goal without disregarding action-specific preferences that might arise from asymmetry in capabilities and individual objectives. To accomplish this goal, we propose a method named SLiCC: Stackelberg Learning in Cooperative Control. SLiCC models the problem as a partially observable stochastic game composed of Stackelberg bimatrix games, and uses deep reinforcement learning to obtain the payoff matrices associated with these games. Appropriate cooperative actions are then selected with the derived Stackelberg equilibria. Using a bi-robot cooperative object transportation problem, we validate the performance of SLiCC against centralized multi-agent Q-learning and demonstrate that SLiCC achieves better combined utility. △ Less

Submitted 3 August, 2020; originally announced August 2020.

Comments: 8 pages, 7 figures

ACM Class: I.2.9; I.2.6; I.2.11

Journal ref: Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 7985-7992

arXiv:2007.08139 [pdf, other]

Interactive Video Object Segmentation Using Global and Local Transfer Modules

Authors: Yuk Heo, Yeong Jun Koh, Chang-Su Kim

Abstract: An interactive video object segmentation algorithm, which takes scribble annotations on query objects as input, is proposed in this paper. We develop a deep neural network, which consists of the annotation network (A-Net) and the transfer network (T-Net). First, given user scribbles on a frame, A-Net yields a segmentation result based on the encoder-decoder architecture. Second, T-Net transfers th… ▽ More An interactive video object segmentation algorithm, which takes scribble annotations on query objects as input, is proposed in this paper. We develop a deep neural network, which consists of the annotation network (A-Net) and the transfer network (T-Net). First, given user scribbles on a frame, A-Net yields a segmentation result based on the encoder-decoder architecture. Second, T-Net transfers the segmentation result bidirectionally to the other frames, by employing the global and local transfer modules. The global transfer module conveys the segmentation information in an annotated frame to a target frame, while the local transfer module propagates the segmentation information in a temporally adjacent frame to the target frame. By applying A-Net and T-Net alternately, a user can obtain desired segmentation results with minimal efforts. We train the entire network in two stages, by emulating user scribbles and employing an auxiliary loss. Experimental results demonstrate that the proposed interactive video object segmentation algorithm outperforms the state-of-the-art conventional algorithms. Codes and models are available at https://github.com/yuk6heo/IVOS-ATNet. △ Less

Submitted 16 July, 2020; originally announced July 2020.

arXiv:2003.09540 [pdf]

Distributed Reinforcement Learning for Cooperative Multi-Robot Object Manipulation

Authors: Guohui Ding, Joewie J. Koh, Kelly Merckaert, Bram Vanderborght, Marco M. Nicotra, Christoffer Heckman, Alessandro Roncone, Lijun Chen

Abstract: We consider solving a cooperative multi-robot object manipulation task using reinforcement learning (RL). We propose two distributed multi-agent RL approaches: distributed approximate RL (DA-RL), where each agent applies Q-learning with individual reward functions; and game-theoretic RL (GT-RL), where the agents update their Q-values based on the Nash equilibrium of a bimatrix Q-value game. We val… ▽ More We consider solving a cooperative multi-robot object manipulation task using reinforcement learning (RL). We propose two distributed multi-agent RL approaches: distributed approximate RL (DA-RL), where each agent applies Q-learning with individual reward functions; and game-theoretic RL (GT-RL), where the agents update their Q-values based on the Nash equilibrium of a bimatrix Q-value game. We validate the proposed approaches in the setting of cooperative object manipulation with two simulated robot arms. Although we focus on a small system of two agents in this paper, both DA-RL and GT-RL apply to general multi-agent systems, and are expected to scale well to large systems. △ Less

Submitted 20 March, 2020; originally announced March 2020.

Comments: 3 pages, 3 figures

ACM Class: I.2.9; I.2.6; I.2.11

Journal ref: Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2020, pp. 1831-1833

arXiv:2002.04455 [pdf]

HRINet: Alternative Supervision Network for High-resolution CT image Interpolation

Authors: Jiawei Li, Jae Chul Koh, Won-Sook Lee

Abstract: Image interpolation in medical area is of high importance as most 3D biomedical volume images are sampled where the distance between consecutive slices significantly greater than the in-plane pixel size due to radiation dose or scanning time. Image interpolation creates a number of new slices between known slices in order to obtain an isotropic volume image. The results can be used for the higher… ▽ More Image interpolation in medical area is of high importance as most 3D biomedical volume images are sampled where the distance between consecutive slices significantly greater than the in-plane pixel size due to radiation dose or scanning time. Image interpolation creates a number of new slices between known slices in order to obtain an isotropic volume image. The results can be used for the higher quality of 3D reconstruction and visualization of human body structures. Semantic interpolation on the manifold has been proved to be very useful for smoothing image interpolation. Nevertheless, all previous methods focused on low-resolution image interpolation, and most of them work poorly on high-resolution image. We propose a novel network, High Resolution Interpolation Network (HRINet), aiming at producing high-resolution CT image interpolations. We combine the idea of ACAI and GANs, and propose a novel idea of alternative supervision method by applying supervised and unsupervised training alternatively to raise the accuracy of human organ structures in CT while keeping high quality. We compare an MSE based and a perceptual based loss optimizing methods for high quality interpolation, and show the tradeoff between the structural correctness and sharpness. Our experiments show the great improvement on 256 2 and 5122 images quantitatively and qualitatively. △ Less

Submitted 7 June, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

arXiv:2002.02634 [pdf, other]

SideInfNet: A Deep Neural Network for Semi-Automatic Semantic Segmentation with Side Information

Authors: Jing Yu Koh, Duc Thanh Nguyen, Quang-Trung Truong, Sai-Kit Yeung, Alexander Binder

Abstract: Fully-automatic execution is the ultimate goal for many Computer Vision applications. However, this objective is not always realistic in tasks associated with high failure costs, such as medical applications. For these tasks, semi-automatic methods allowing minimal effort from users to guide computer algorithms are often preferred due to desirable accuracy and performance. Inspired by the practica… ▽ More Fully-automatic execution is the ultimate goal for many Computer Vision applications. However, this objective is not always realistic in tasks associated with high failure costs, such as medical applications. For these tasks, semi-automatic methods allowing minimal effort from users to guide computer algorithms are often preferred due to desirable accuracy and performance. Inspired by the practicality and applicability of the semi-automatic approach, this paper proposes a novel deep neural network architecture, namely SideInfNet that effectively integrates features learnt from images with side information extracted from user annotations. To evaluate our method, we applied the proposed network to three semantic segmentation tasks and conducted extensive experiments on benchmark datasets. Experimental results and comparison with prior work have verified the superiority of our model, suggesting the generality and effectiveness of the model in semi-automatic semantic segmentation. △ Less

Submitted 17 July, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

Comments: ECCV 2020

arXiv:1910.04446 [pdf, other]

doi 10.1016/j.chaos.2019.109464

Passive network evolution promotes group welfare in complex networks

Authors: Ye Ye, Xiao Rong Hang, Jin Ming Koh, Jarosław Adam Miszczak, Kang Hao Cheong, Neng-gang Xie

Abstract: The Parrondo's paradox is a counterintuitive phenomenon in which individually losing strategies, canonically termed game A and game B, are combined to produce winning outcomes. In this paper, a co-evolution of game dynamics and network structure is adopted to study adaptability and survivability in multi-agent dynamics. The model includes action A, representing a rewiring process on the network, a… ▽ More The Parrondo's paradox is a counterintuitive phenomenon in which individually losing strategies, canonically termed game A and game B, are combined to produce winning outcomes. In this paper, a co-evolution of game dynamics and network structure is adopted to study adaptability and survivability in multi-agent dynamics. The model includes action A, representing a rewiring process on the network, and a two-branch game B, representing redistributive interactions between agents. Simulation results indicate that stochastically mixing action A and game B can produce enhanced, and even winning outcomes, despite gameB being individually losing. In other words, a Parrondo-type paradox can be achieved, but unlike canonical variants, the source of agitation is provided by passive network evolution instead of an active second game. The underlying paradoxical mechanism is analyzed, revealing that the rewiring process drives a topology shift from initial regular lattices towards scale-free characteristics, and enables exploitative behavior that grants enhanced access to the favourable branch of game B. △ Less

Submitted 10 October, 2019; originally announced October 2019.

Comments: 15 pages, 9 figures

Journal ref: Chaos, Solitons & Fractals, Vol. 130, pp. 109464 (2020)

arXiv:1811.08705 [pdf, other]

doi 10.1109/BigData.2018.8622066

Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings

Authors: Joewie J. Koh, Barton Rhodes

Abstract: Domain generation algorithms (DGAs) are frequently employed by malware to generate domains used for connecting to command-and-control (C2) servers. Recent work in DGA detection leveraged deep learning architectures like convolutional neural networks (CNNs) and character-level long short-term memory networks (LSTMs) to classify domains. However, these classifiers perform poorly with wordlist-based… ▽ More Domain generation algorithms (DGAs) are frequently employed by malware to generate domains used for connecting to command-and-control (C2) servers. Recent work in DGA detection leveraged deep learning architectures like convolutional neural networks (CNNs) and character-level long short-term memory networks (LSTMs) to classify domains. However, these classifiers perform poorly with wordlist-based DGA families, which generate domains by pseudorandomly concatenating dictionary words. We propose a novel approach that combines context-sensitive word embeddings with a simple fully-connected classifier to perform classification of domains based on word-level information. The word embeddings were pre-trained on a large unrelated corpus and left frozen during the training on domain data. The resulting small number of trainable parameters enabled extremely short training durations, while the transfer of language knowledge stored in the representations allowed for high-performing models with small training datasets. We show that this architecture reliably outperformed existing techniques on wordlist-based DGA families with just 30 DGA training examples and achieved state-of-the-art performance with around 100 DGA training examples, all while requiring an order of magnitude less time to train compared to current techniques. Of special note is the technique's performance on the matsnu DGA: the classifier attained a 89.5% detection rate with a 1:1,000 false positive rate (FPR) after training on only 30 examples of the DGA domains, and a 91.2% detection rate with a 1:10,000 FPR after 90 examples. Considering that some of these DGAs have wordlists of several hundred words, our results demonstrate that this technique does not rely on the classifier learning the DGA wordlists. Instead, the classifier is able to learn the semantic signatures of the wordlist-based DGA families. △ Less

Submitted 21 November, 2018; originally announced November 2018.

Comments: 6 pages, 5 figures, 2 tables

ACM Class: K.6.5; C.2.0; I.2.7; I.2.6

Journal ref: Proceedings of the 2018 IEEE International Conference on Big Data, 2018, pp. 2966-2971

arXiv:1807.06233 [pdf, other]

Robust Deep Multi-modal Learning Based on Gated Information Fusion Network

Authors: Jaekyum Kim, Junho Koh, Yecheol Kim, Jaehyung Choi, Youngbae Hwang, Jun Won Choi

Abstract: The goal of multi-modal learning is to use complimentary information on the relevant task provided by the multiple modalities to achieve reliable and robust performance. Recently, deep learning has led significant improvement in multi-modal learning by allowing for the information fusion in the intermediate feature levels. This paper addresses a problem of designing robust deep multi-modal learnin… ▽ More The goal of multi-modal learning is to use complimentary information on the relevant task provided by the multiple modalities to achieve reliable and robust performance. Recently, deep learning has led significant improvement in multi-modal learning by allowing for the information fusion in the intermediate feature levels. This paper addresses a problem of designing robust deep multi-modal learning architecture in the presence of imperfect modalities. We introduce deep fusion architecture for object detection which processes each modality using the separate convolutional neural network (CNN) and constructs the joint feature map by combining the intermediate features from the CNNs. In order to facilitate the robustness to the degraded modalities, we employ the gated information fusion (GIF) network which weights the contribution from each modality according to the input feature maps to be fused. The weights are determined through the convolutional layers followed by a sigmoid function and trained along with the information fusion network in an end-to-end fashion. Our experiments show that the proposed GIF network offers the additional architectural flexibility to achieve robust performance in handling some degraded modalities, and show a significant performance improvement based on Single Shot Detector (SSD) for KITTI dataset using the proposed fusion network and data augmentation schemes. △ Less

Submitted 2 November, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

Comments: 2018 Asian Conference on Computer Vision (ACCV)

arXiv:1711.08142 [pdf, other]

On the Feasibility of Full-duplex Large-scale MIMO Cellular Systems

Authors: Jeongwan Koh, Yeon-Geun Lim, Chan-Byoung Chae, Joonhyuk Kang

Abstract: This paper concerns the feasibility of full-duplex large-scale multiple-input-multiple-output (MIMO) cellular systems. We first propose a pilot transmission scheme and assess its performance, specifically the ergodic sum-rate. The proposed scheme -- the simultaneous pilot transmission (SPT) -- enables to reduce pilot overhead, where the pilot overhead depends on the number of antennas at the base… ▽ More This paper concerns the feasibility of full-duplex large-scale multiple-input-multiple-output (MIMO) cellular systems. We first propose a pilot transmission scheme and assess its performance, specifically the ergodic sum-rate. The proposed scheme -- the simultaneous pilot transmission (SPT) -- enables to reduce pilot overhead, where the pilot overhead depends on the number of antennas at the base station (BS), since the self-interference channel has to be estimated. We consider two multicell scenarios-- cooperative and non-cooperative multicell systems--, and derive the analytic model of the ergodic achievable sum-rate for cell-boundary users. The model is derived by applying a simple linear filter, i.e., matched filter or zero-forcing filter, to the BS. In the analytic model, we also consider large-scale fading, pilot contamination, transmitter noise and receiver distortion. Exploiting the derived analytic model, the feasibility of full-duplex large-scale MIMO systems is shown with respect to system parameters. In the end, we confirm that our analytic model matches well the numerical results and the SPT has advantages over other pilot transmission methods. △ Less

Submitted 22 November, 2017; originally announced November 2017.

Comments: 29 pages, 8 figures, submitted to transaction on wireless communications (TWC)

arXiv:1606.09187 [pdf, other]

Object Boundary Detection and Classification with Image-level Labels

Authors: Jing Yu Koh, Wojciech Samek, Klaus-Robert Müller, Alexander Binder

Abstract: Semantic boundary and edge detection aims at simultaneously detecting object edge pixels in images and assigning class labels to them. Systematic training of predictors for this task requires the labeling of edges in images which is a particularly tedious task. We propose a novel strategy for solving this task, when pixel-level annotations are not available, performing it in an almost zero-shot ma… ▽ More Semantic boundary and edge detection aims at simultaneously detecting object edge pixels in images and assigning class labels to them. Systematic training of predictors for this task requires the labeling of edges in images which is a particularly tedious task. We propose a novel strategy for solving this task, when pixel-level annotations are not available, performing it in an almost zero-shot manner by relying on conventional whole image neural net classifiers that were trained using large bounding boxes. Our method performs the following two steps at test time. Firstly it predicts the class labels by applying the trained whole image network to the test images. Secondly, it computes pixel-wise scores from the obtained predictions by applying backprop gradients as well as recent visualization algorithms such as deconvolution and layer-wise relevance propagation. We show that high pixel-wise scores are indicative for the location of semantic boundaries, which suggests that the semantic boundary problem can be approached without using edge labels during the training phase. △ Less

Submitted 25 June, 2017; v1 submitted 29 June, 2016; originally announced June 2016.

Comments: 12 pages, 2 figures, accepted for GCPR 2017 - 39th German Conference on Pattern Recognition

arXiv:1602.05335 [pdf, other]

doi 10.1109/JIOT.2016.2535165

Geo-spatial Location Spoofing Detection for Internet of Things

Authors: Jing Yang Koh, Ido Nevat, Derek Leong, Wai-Choong Wong

Abstract: We develop a new location spoofing detection algorithm for geo-spatial tagging and location-based services in the Internet of Things (IoT), called Enhanced Location Spoofing Detection using Audibility (ELSA) which can be implemented at the backend server without modifying existing legacy IoT systems. ELSA is based on a statistical decision theory framework and uses two-way time-of-arrival (TW-TOA)… ▽ More We develop a new location spoofing detection algorithm for geo-spatial tagging and location-based services in the Internet of Things (IoT), called Enhanced Location Spoofing Detection using Audibility (ELSA) which can be implemented at the backend server without modifying existing legacy IoT systems. ELSA is based on a statistical decision theory framework and uses two-way time-of-arrival (TW-TOA) information between the user's device and the anchors. In addition to the TW-TOA information, ELSA exploits the implicit available audibility information to improve detection rates of location spoofing attacks. Given TW-TOA and audibility information, we derive the decision rule for the verification of the device's location, based on the generalized likelihood ratio test. We develop a practical threat model for delay measurements spoofing scenarios, and investigate in detail the performance of ELSA in terms of detection and false alarm rates. Our extensive simulation results on both synthetic and real-world datasets demonstrate the superior performance of ELSA compared to conventional non-audibility-aware approaches. △ Less

Submitted 28 March, 2017; v1 submitted 17 February, 2016; originally announced February 2016.

Comments: A shorten version of this work has been accepted to the IEEE IoT Journal (IoT-J) on 08-Feb-2016

Journal ref: IEEE Internet of Things Journal, vol. 3, no. 6, pp. 971-978, Dec. 2016

arXiv:1601.07229 [pdf, other]

Genie: A Longitudinal Study Comparing Physical and Software-augmented Thermostats in Office Buildings

Authors: Bharathan Balaji, Jason Koh, Nadir Weibel, Yuvraj Agarwal

Abstract: Thermostats are primary interfaces for occupants of office buildings to express their comfort preferences. However, standard thermostats are often ineffective due to inaccessibility, lack of information, or limited responsiveness, leading to occupant discomfort. Software thermostats based on web or smartphone applications provide alternative interfaces to occupants with minimal deployment cost. Ho… ▽ More Thermostats are primary interfaces for occupants of office buildings to express their comfort preferences. However, standard thermostats are often ineffective due to inaccessibility, lack of information, or limited responsiveness, leading to occupant discomfort. Software thermostats based on web or smartphone applications provide alternative interfaces to occupants with minimal deployment cost. However, their usage and effectiveness have not been studied extensively in real settings. In this paper we present Genie, a novel software-augmented thermostat that we deployed and studied at our university over a period of 21 months. Our data shows that providing wider thermal control to users does not lead to system abuse and that the effect on energy consumption is minimal while improving comfort and energy awareness. We believe that increased introduction of software thermostats in office buildings will have important effects on comfort and energy consumption and we provide key design recommendations for their implementation and deployment. △ Less

Submitted 26 January, 2016; originally announced January 2016.

Comments: 12 pages

ACM Class: H.5.3

Showing 1–49 of 49 results for author: Koh, J