subscribe to arXiv mailings

Aptly: Making Mobile Apps from Natural Language

Authors: Evan W. Patton, David Y. J. Kim, Ashley Granquist, Robin Liu, Arianna Scott, Jennet Zamanova, Harold Abelson

Abstract: We present Aptly, an extension of the MIT App Inventor platform enabling mobile app development via natural language powered by code-generating large language models (LLMs). Aptly complements App Inventor's block language with a text language designed to allow visual code generation via text-based LLMs. We detail the technical aspects of how the Aptly server integrates LLMs with a realtime collabo… ▽ More We present Aptly, an extension of the MIT App Inventor platform enabling mobile app development via natural language powered by code-generating large language models (LLMs). Aptly complements App Inventor's block language with a text language designed to allow visual code generation via text-based LLMs. We detail the technical aspects of how the Aptly server integrates LLMs with a realtime collaboration function to facilitate the automated creation and editing of mobile apps given user instructions. The paper concludes with insights from a study of a pilot implementation involving high school students, which examines Aptly's practicality and user experience. The findings underscore Aptly's potential as a tool that democratizes app development and fosters technological creativity. △ Less

Submitted 30 April, 2024; originally announced May 2024.

Comments: 11 pages, 7 figures, 2 tables

arXiv:2404.01954 [pdf, other]

HyperCLOVA X Technical Report

Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs. △ Less

Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 44 pages; updated authors list and fixed author names

arXiv:2403.13513 [pdf, other]

What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models

Authors: Junho Kim, Yeon Ju Kim, Yong Man Ro

Abstract: This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of cou… ▽ More This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of counterfactual thinking, a cognitive process where human considers alternative realities, enabling more extensive context exploration. Bridging the human cognition mechanism into LMMs, we aim for the models to engage with and generate responses that span a wider contextual scene understanding, mitigating hallucinatory outputs. We further introduce Plausibility Verification Process (PVP), a simple yet robust keyword constraint that effectively filters out sub-optimal keywords to enable the consistent triggering of counterfactual thinking in the model responses. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination and helps to broaden contextual understanding based on true visual clues. △ Less

Submitted 21 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: Project page: https://ivy-lvlm.github.io/Counterfactual-Inception/

arXiv:2401.08417 [pdf, other]

Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation

Authors: Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim

Abstract: Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We… ▽ More Moderate-sized large language models (LLMs) -- those with 7B or 13B parameters -- exhibit promising machine translation (MT) performance. However, even the top-performing 13B LLM-based translation models, like ALMA, does not match the performance of state-of-the-art conventional encoder-decoder translation models or larger-scale LLMs such as GPT-4. In this study, we bridge this performance gap. We first assess the shortcomings of supervised fine-tuning for LLMs in the MT task, emphasizing the quality issues present in the reference data, despite being human-generated. Then, in contrast to SFT which mimics reference translations, we introduce Contrastive Preference Optimization (CPO), a novel approach that trains models to avoid generating adequate but not perfect translations. Applying CPO to ALMA models with only 22K parallel sentences and 12M parameters yields significant improvements. The resulting model, called ALMA-R, can match or exceed the performance of the WMT competition winners and GPT-4 on WMT'21, WMT'22 and WMT'23 test datasets. △ Less

Submitted 2 June, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted at ICML 2024

arXiv:2312.01253 [pdf]

On Merits of Faster-than-Nyquist Signaling in the Finite Blocklength Regime

Authors: Yong Jin Daniel Kim

Abstract: We identify potential merits of faster-than-Nyquist (FTN) signaling in the finite blocklength (FBL) regime. A unique aspect of FTN signaling is that it can increase the blocklength by packing more data symbols within the same time and frequency to yield strictly higher number of independent signaling dimensions than that of Nyquist rate signaling. Using the finite-blocklength information theory, w… ▽ More We identify potential merits of faster-than-Nyquist (FTN) signaling in the finite blocklength (FBL) regime. A unique aspect of FTN signaling is that it can increase the blocklength by packing more data symbols within the same time and frequency to yield strictly higher number of independent signaling dimensions than that of Nyquist rate signaling. Using the finite-blocklength information theory, we provide tight bounds on the maximum channel coding rate (MCCR) of FTN signaling for any finite time-bandwidth product. The merits are categorized into two operating regions of FTN, i.e., when the time-acceleration factor of FTN, $τ$, is above or below a certain threshold $τ_{0}$. When $τ> τ_{0}$, FTN has both higher channel capacity and MCCR than that of Nyquist rate signaling, when the utilized pulse shape is non-sinc. Since the issues associated with the ideal sinc pulse only get exacerbated when packets are short, the benefit of FTN becomes more significant in the FBL regime. On the other hand, when $τ< τ_{0}$, the channel capacity is fixed but MCCR of FTN can continue to increase to a certain degree, thereby reducing the gap between the capacity and MCCR. This benefit is present regardless of the utilized pulse shape, including the ideal sinc-pulse, and is unique to the FBL regime. Instead of increasing MCCR for fixed block error rates, FTN can alternatively lower the block error rates for fixed channel coding rates. These results imply that FTN can lower the penalty from limited channel coding over short blocklength and can improve the performance and reliability of short packet communications. △ Less

Submitted 25 April, 2024; v1 submitted 2 December, 2023; originally announced December 2023.

arXiv:2311.08590 [pdf, other]

PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models

Authors: HyunJin Kim, Young Jin Kim, JinYeong Bak

Abstract: Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the… ▽ More Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM's final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA's effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles. △ Less

Submitted 29 March, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Accepted to NAACL 2024

arXiv:2310.02410 [pdf, other]

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

Authors: Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla

Abstract: Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption and increased memory bandwidth bottleneck at deployment time. In this paper, we propose Mixture of Quantized Expe… ▽ More Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption and increased memory bandwidth bottleneck at deployment time. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training in most cases. In particular, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit expert weights can deliver better model performance than the dense model trained on the same dataset. As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. Combined with an optimized GPU runtime implementation, it also achieves 1.24X speed-up on A100 GPUs. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2309.14741 [pdf, other]

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

Authors: Hee-Soo Heo, KiHyun Nam, Bong-Jin Lee, Youngki Kwon, Minjae Lee, You Jin Kim, Joon Son Chung

Abstract: In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remain… ▽ More In the field of speaker verification, session or channel variability poses a significant challenge. While many contemporary methods aim to disentangle session information from speaker embeddings, we introduce a novel approach using an additional embedding to represent the session information. This is achieved by training an auxiliary network appended to the speaker embedding extractor which remains fixed in this training process. This results in two similarity scores: one for the speakers information and one for the session information. The latter score acts as a compensator for the former that might be skewed due to session variations. Our extensive experiments demonstrate that session information can be effectively compensated without retraining of the embedding extractor. △ Less

Submitted 26 September, 2023; originally announced September 2023.

arXiv:2309.12306 [pdf, other]

TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning

Authors: Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, Joon Son Chung

Abstract: The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full se… ▽ More The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets. △ Less

Submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.11674 [pdf, other]

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Authors: Haoran Xu, Young Jin Kim, Amr Sharaf, Hany Hassan Awadalla

Abstract: Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities… ▽ More Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. However, these advances have not been reflected in the translation task, especially those with moderate model sizes (i.e., 7B or 13B parameters), which still lag behind conventional supervised encoder-decoder translation models. Previous studies have attempted to improve the translation capabilities of these moderate LLMs, but their gains have been limited. In this study, we propose a novel fine-tuning approach for LLMs that is specifically designed for the translation task, eliminating the need for the abundant parallel data that traditional translation models usually depend on. Our approach consists of two fine-tuning stages: initial fine-tuning on monolingual data followed by subsequent fine-tuning on a small set of high-quality parallel data. We introduce the LLM developed through this strategy as Advanced Language Model-based trAnslator (ALMA). Based on LLaMA-2 as our underlying model, our results show that the model can achieve an average improvement of more than 12 BLEU and 12 COMET over its zero-shot performance across 10 translation directions from the WMT'21 (2 directions) and WMT'22 (8 directions) test datasets. The performance is significantly better than all prior work and even superior to the NLLB-54B model and GPT-3.5-text-davinci-003, with only 7B or 13B parameters. This method establishes the foundation for a novel training paradigm in machine translation. △ Less

Submitted 6 February, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted at ICLR 2024

arXiv:2308.15772 [pdf, other]

Task-Based MoE for Multitask Multilingual Machine Translation

Authors: Hai Pham, Young Jin Kim, Subhabrata Mukherjee, David P. Woodruff, Barnabas Poczos, Hany Hassan Awadalla

Abstract: Mixture-of-experts (MoE) architecture has been proven a powerful method for diverse tasks in training deep models in many applications. However, current MoE implementations are task agnostic, treating all tokens from different tasks in the same manner. In this work, we instead design a novel method that incorporates task information into MoE models at different granular levels with shared dynamic… ▽ More Mixture-of-experts (MoE) architecture has been proven a powerful method for diverse tasks in training deep models in many applications. However, current MoE implementations are task agnostic, treating all tokens from different tasks in the same manner. In this work, we instead design a novel method that incorporates task information into MoE models at different granular levels with shared dynamic task-based adapters. Our experiments and analysis show the advantages of our approaches over the dense and canonical MoE models on multi-task multilingual machine translations. With task-specific adapters, our models can additionally generalize to new tasks efficiently. △ Less

Submitted 24 October, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.13539 [pdf, other]

Redefining Computer Science Education: Code-Centric to Natural Language Programming with AI-Based No-Code Platforms

Authors: David Y. J. Kim

Abstract: This paper delves into the evolving relationship between humans and computers in the realm of programming. Historically, programming has been a dialogue where humans meticulously crafted communication to suit machine understanding, shaping the trajectory of computer science education. However, the advent of AI-based no-code platforms is revolutionizing this dynamic. Now, humans can converse in the… ▽ More This paper delves into the evolving relationship between humans and computers in the realm of programming. Historically, programming has been a dialogue where humans meticulously crafted communication to suit machine understanding, shaping the trajectory of computer science education. However, the advent of AI-based no-code platforms is revolutionizing this dynamic. Now, humans can converse in their natural language, expecting machines to interpret and act. This shift has profound implications for computer science education. As educators, it's imperative to integrate this new dynamic into curricula. In this paper, we've explored several pertinent research questions in this transformation, which demand continued inquiry and adaptation in our educational strategies. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: 7 pages, 1 figure

arXiv:2308.09723 [pdf, other]

FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs

Authors: Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient… ▽ More Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment due to their substantial memory requirements. Furthermore, the latest generative models suffer from high inference costs caused by the memory bandwidth bottleneck in the auto-regressive decoding process. To address these issues, we propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. To ensure minimal quality degradation, we introduce a simple and effective heuristic approach that utilizes only the model weights of a pre-trained model. This approach is applicable to both Mixture-of-Experts (MoE) and dense models without requiring additional fine-tuning. To demonstrate the effectiveness of our proposed method, we first analyze the challenges and issues associated with LLM quantization. Subsequently, we present our heuristic approach, which adaptively finds the granularity of quantization, effectively addressing these problems. Furthermore, we implement highly efficient GPU GEMMs that perform on-the-fly matrix multiplication and dequantization, supporting the multiplication of fp16 or bf16 activations with int8 or int4 weights. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput on the same number of GPUs. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2306.00680 [pdf, other]

Encoder-decoder multimodal speaker change detection

Authors: Jee-weon Jung, Soonshin Seo, Hee-Soo Heo, Geonmin Kim, You Jin Kim, Young-ki Kwon, Minjae Lee, Bong-Jin Lee

Abstract: The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are bui… ▽ More The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription. △ Less

Submitted 1 June, 2023; originally announced June 2023.

Comments: 5 pages, accepted for presentation at INTERSPEECH 2023

arXiv:2305.15630 [pdf]

doi 10.1109/TBC.2023.3311330

Multicast and Unicast Superposition Transmission in MIMO OFDMA Systems with Statistical CSIT

Authors: Yong Jin Daniel Kim, David Vargas

Abstract: We consider a downlink multicast and unicast superposition transmission in multi-layer Multiple-Input Multiple-Output (MIMO) Orthogonal Frequency Division Multiple Access (OFDMA) systems when only the statistical channel state information is available at the transmitter (CSIT). Multiple users can be scheduled by using the time/frequency resources in OFDMA, while for each scheduled user MIMO spatia… ▽ More We consider a downlink multicast and unicast superposition transmission in multi-layer Multiple-Input Multiple-Output (MIMO) Orthogonal Frequency Division Multiple Access (OFDMA) systems when only the statistical channel state information is available at the transmitter (CSIT). Multiple users can be scheduled by using the time/frequency resources in OFDMA, while for each scheduled user MIMO spatial multiplexing is used to transmit multiple information layers, i.e., single user (SU)-MIMO. The users only need to feedback to the base-station the rank-indicator and the long-term average channel signal-to-noise ratio to indicate a suitable number of transmission layers, a suitable modulation and coding scheme and allow the base-station to perform user scheduling. This approach is especially relevant for the delivery of common (e.g., popular live event) and independent (e.g., user personalized) content to a high number of users in deployments in the lower frequency bands operating in Frequency-Division-Duplex (FDD) mode, e.g., sub-1 GHz. We show that the optimal resource allocation that maximizes the ergodic sum-rate involves greedy user selection per OFDM subchannel and superposition transmission of one multicast signal across all subchannels and single unicast signal per subchannel. Degree-of-freedom (DoF) analysis shows that while the lack of instantaneous CSI limits DoF of unicast messages to the minimum number of transmit antennas and receiver antennas, the multicast message obtains full DoF that increases linearly with the number of users. We present resource allocation algorithms consisting of user selection and power allocation between multicast and unicast signals in each OFDM subchannel. System level simulations in 5G rural macro-cell scenarios show overall network throughput gains in realistic network environments by superposition transmission of multicast and unicast signals. △ Less

Submitted 29 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 17 pages, 10 figures, 2 tables

arXiv:2302.09210 [pdf, other]

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Authors: Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, Hany Hassan Awadalla

Abstract: Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural language generation, but their performance for machine translation has not been thoroughly investigated. In this paper, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with state-of-the-art researc… ▽ More Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural language generation, but their performance for machine translation has not been thoroughly investigated. In this paper, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with state-of-the-art research and commercial systems, effect of prompting strategies, robustness towards domain shifts and document-level translation. We experiment with eighteen different translation directions involving high and low resource languages, as well as non English-centric translations, and evaluate the performance of three GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. Our results show that GPT models achieve very competitive translation quality for high resource languages, while having limited capabilities for low resource languages. We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. We perform comprehensive analysis and human evaluation to further understand the characteristics of GPT translations. We hope that our paper provides valuable insights for researchers and practitioners in the field and helps to better understand the potential and limitations of GPT models for translation. △ Less

Submitted 17 February, 2023; originally announced February 2023.

arXiv:2211.10017 [pdf, other]

Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production

Authors: Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan Awadalla

Abstract: Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large mem… ▽ More Mixture of Experts (MoE) models with conditional execution of sparsely activated layers have enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models replacing the traditional practice of distilling teacher models into dozens of smaller models per language or task. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: Accepted to SustaiNLP 2022 (EMNLP 2022)

arXiv:2211.04768 [pdf, other]

Absolute decision corrupts absolutely: conservative online speaker diarisation

Authors: Youngki Kwon, Hee-Soo Heo, Bong-Jin Lee, You Jin Kim, Jee-weon Jung

Abstract: Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount i… ▽ More Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount importance among many other factors. Thus, our proposed framework includes decreasing the number of speakers by one when the system judges that an increase in the past was faulty. We also adopt dual buffers, checkpoints and centroids, where checkpoints are combined with silhouette coefficients to estimate the number of speakers and centroids represent speakers. Again, we believe that more than one centroid can be generated from one speaker. Thus we design a clustering-based label matching technique to assign labels in real-time. The resulting system is lightweight yet surprisingly effective. The system demonstrates state-of-the-art performance on DIHARD 2 and 3 datasets, where it is also competitive in AMI and VoxConverse test sets. △ Less

Submitted 9 November, 2022; originally announced November 2022.

Comments: 5pages, 2 figure, 4 tables, submitted to ICASSP

arXiv:2211.04060 [pdf, other]

High-resolution embedding extractor for speaker diarisation

Authors: Hee-Soo Heo, Youngki Kwon, Bong-Jin Lee, You Jin Kim, Jee-weon Jung

Abstract: Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a h… ▽ More Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE), which extracts multiple high-resolution embeddings from each speech segment. Hee consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. The enhancer of HEE replaces the aggregation process; instead of a global pooling layer, the enhancer combines relative information to each frame via attention leveraging the global context. Extracted dense frame-level embeddings can each represent a speaker. Thus, multiple speakers can be represented by different frame-level features in each segment. We also propose an artificially generating mixture data training framework to train the proposed HEE. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set, except for one dataset, which we analyse that rapid speaker changes less exist. △ Less

Submitted 8 November, 2022; originally announced November 2022.

Comments: 5pages, 2 figure, 3 tables, submitted to ICASSP

arXiv:2210.07592 [pdf, other]

TSP-Bot: Robotic TSP Pen Art using High-DoF Manipulators

Authors: Daeun Song, Eunjung Lim, Jiyoon Park, Minjung Jung, Young J. Kim

Abstract: TSP art is an art form for drawing an image using piecewise-continuous line segments. We present TSP-Bot, a robotic pen drawing system capable of creating complicated TSP pen art on a planar surface using multiple colors. The system begins by converting a colored raster image into a set of points that represent the image's tone, which can be controlled by adjusting the point density. Next, the sys… ▽ More TSP art is an art form for drawing an image using piecewise-continuous line segments. We present TSP-Bot, a robotic pen drawing system capable of creating complicated TSP pen art on a planar surface using multiple colors. The system begins by converting a colored raster image into a set of points that represent the image's tone, which can be controlled by adjusting the point density. Next, the system finds a piecewise-continuous linear path that visits each point exactly once, which is equivalent to solving a Traveling Salesman Problem (TSP). The path is simplified with fewer points using bounded approximation and smoothed and optimized using Bezier spline curves with bounded curvature. Our robotic drawing system consisting of single or dual manipulators with fingered grippers and a mobile platform performs the drawing task by following the resulting complex and sophisticated path composed of thousands of TSP sites. As a result, our system can draw complicated and visually pleasing TSP pen art. △ Less

Submitted 10 April, 2024; v1 submitted 14 October, 2022; originally announced October 2022.

arXiv:2210.07590 [pdf, other]

Stroke-based Rendering and Planning for Robotic Performance of Artistic Drawing

Authors: Ivaylo Ilinkin, Daeun Song, Young J. Kim

Abstract: We present a new robotic drawing system based on stroke-based rendering (SBR). Our motivation is the artistic quality of the whole performance. Not only should the generated strokes in the final drawing resemble the input image, but the stroke sequence should also exhibit a human artist's planning process. Thus, when a robot executes the drawing task, both the drawing results and the way the robot… ▽ More We present a new robotic drawing system based on stroke-based rendering (SBR). Our motivation is the artistic quality of the whole performance. Not only should the generated strokes in the final drawing resemble the input image, but the stroke sequence should also exhibit a human artist's planning process. Thus, when a robot executes the drawing task, both the drawing results and the way the robot executes would look artistic. Our SBR system is based on image segmentation and depth estimation. It generates the drawing strokes in an order that allows for the intended shape to be perceived quickly and for its detailed features to be filled in and emerge gradually when observed by the human. This ordering represents a stroke plan that the drawing robot should follow to create an artistic rendering of images. We experimentally demonstrate that our SBR-based drawing makes visually pleasing artistic images, and our robotic system can replicate the result with proper sequences of stroke drawing. △ Less

Submitted 3 March, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: Submitted to IEEE IROS 2023

arXiv:2210.07535 [pdf, other]

AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation

Authors: Ganesh Jawahar, Subhabrata Mukherjee, Xiaodong Liu, Young Jin Kim, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Ahmed Hassan Awadallah, Sebastien Bubeck, Jianfeng Gao

Abstract: Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this e… ▽ More Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this end, we develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints. AutoMoE leverages Neural Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with 4x inference speedup (CPU) and FLOPs reduction over manually designed Transformers, with parity in BLEU score over dense Transformer and within 1 BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for NMT. Heterogeneous search space with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?) allows for adaptive compute -- where different amounts of computations are used for different tokens in the input. Adaptivity comes naturally from routing decisions which send tokens to experts of different sizes. AutoMoE code, data, and trained models are available at https://aka.ms/AutoMoE. △ Less

Submitted 7 June, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: ACL 2023 Findings

arXiv:2209.02966 [pdf, other]

ExpTrialMng: A Universal Experiment Trial Manager for AR/VR/MR Experiments based on Unity

Authors: Jinwook Kim, Yee Joon Kim, Jeongmi Lee

Abstract: Based on the improvement of recent virtual and augmented reality (VR and AR) Head Mounted Display (HMD), there have been attempts to adopt VR and AR in various fields. Since VR and AR could provide more immersive experimental environments and stimuli than 2D settings in a cost-efficient way, psychological and cognitive researchers are particularly interested in using these platforms. However, ther… ▽ More Based on the improvement of recent virtual and augmented reality (VR and AR) Head Mounted Display (HMD), there have been attempts to adopt VR and AR in various fields. Since VR and AR could provide more immersive experimental environments and stimuli than 2D settings in a cost-efficient way, psychological and cognitive researchers are particularly interested in using these platforms. However, there is still an entry barrier for researchers who are not familiar with Unity programming, and current VR/AR HMDs could also cause unexpected errors during the experiment. Therefore, we developed a Unity library that can be adopted in various experiments universally and assist researchers in developing their own. Our library provides functions related to trial assignment and results saving. That way, researchers can easily implement the essential functions of their psychological experiments. We also made a function that enables proceeding with the experiment from a specific trial point to handle unexpected errors caused by HMD tracking loss issues during the experiment. We expect our library could invite researchers from various disciplines and help them acquire valuable insights in VR/AR environments. △ Less

Submitted 7 September, 2022; originally announced September 2022.

Comments: 5 pages, 3 figures, https://github.com/jinwook31/Unity-Experiment-Trial-Manager

arXiv:2208.06874 [pdf, other]

Fast Vocabulary Projection Method via Clustering for Multilingual Machine Translation on GPU

Authors: Hossam Amer, Young Jin Kim, Mohamed Afify, Hitokazu Matsushita, Hany Hassan Awadallah

Abstract: Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via c… ▽ More Multilingual Neural Machine Translation has been showing great success using transformer models. Deploying these models is challenging because they usually require large vocabulary (vocab) sizes for various languages. This limits the speed of predicting the output tokens in the last vocab projection layer. To alleviate these challenges, this paper proposes a fast vocabulary projection method via clustering which can be used for multilingual transformers on GPUs. First, we offline split the vocab search space into disjoint clusters given the hidden context vector of the decoder output, which results in much smaller vocab columns for vocab projection. Second, at inference time, the proposed method predicts the clusters and candidate active tokens for hidden context vectors at the vocab projection. This paper also includes analysis of different ways of building these clusters in multilingual settings. Our results show end-to-end speed gains in float16 GPU inference up to 25% while maintaining the BLEU score and slightly increasing memory cost. The proposed method speeds up the vocab projection step itself by up to 2.6x. We also conduct an extensive human evaluation to verify the proposed method preserves the quality of the translations from the original model. △ Less

Submitted 14 August, 2022; originally announced August 2022.

Comments: 12 pages, accepted at AMTA-2022 (Association for Machine Translation in the Americas Conference)

arXiv:2206.03715 [pdf, other]

Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning

Authors: Yu Jin Kim, Beong-woo Kwak, Youngwook Kim, Reinald Kim Amplayo, Seung-won Hwang, Jinyoung Yeo

Abstract: Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge g… ▽ More Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge graph (KG) into synthetic QA-form samples for model training. Considering the increasing type of different commonsense KGs, this paper aims to extend the zero-shot transfer learning scenario into multiple-source settings, where different KGs can be utilized synergetically. Towards this goal, we propose to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework. Results on five commonsense reasoning benchmarks demonstrate the efficacy of our framework, improving the performance with multiple KGs. △ Less

Submitted 22 June, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

Comments: Accepted to NAACL2022

arXiv:2205.14336 [pdf, other]

Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers

Authors: Rui Liu, Young Jin Kim, Alexandre Muzio, Hany Hassan Awadalla

Abstract: Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to… ▽ More Sparsely activated transformers, such as Mixture of Experts (MoE), have received great interest due to their outrageous scaling capability which enables dramatical increases in model size without significant increases in computational cost. To achieve this, MoE models replace the feedforward sub-layer with Mixture-of-Experts sub-layer in transformers and use a gating network to route each token to its assigned experts. Since the common practice for efficient training of such models requires distributing experts and tokens across different machines, this routing strategy often incurs huge cross-machine communication cost because tokens and their assigned experts likely reside in different machines. In this paper, we propose \emph{Gating Dropout}, which allows tokens to ignore the gating network and stay at their local machines, thus reducing the cross-machine communication. Similar to traditional dropout, we also show that Gating Dropout has a regularization effect during training, resulting in improved generalization performance. We validate the effectiveness of Gating Dropout on multilingual machine translation tasks. Our results demonstrate that Gating Dropout improves a state-of-the-art MoE model with faster wall-clock time convergence rates and better BLEU scores for a variety of model sizes and datasets. △ Less

Submitted 4 July, 2022; v1 submitted 28 May, 2022; originally announced May 2022.

Comments: Accepted to ICML 2022

arXiv:2203.08488 [pdf, other]

Pushing the limits of raw waveform speaker recognition

Authors: Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Abstract: In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs… ▽ More In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation. Our best model achieves an equal error rate of 0.89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin. We also explore the application of the proposed model in the context of self-supervised learning framework. Our self-supervised model outperforms single phase-based existing works in this line of research. Finally, we show that self-supervised pre-training is effective for the semi-supervised scenario where we only have a small set of labelled training data, along with a larger set of unlabelled examples. △ Less

Submitted 28 March, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

Comments: submitted to INTERSPEECH 2022 as a conference paper. 5 pages, 2 figures, 5 tables

arXiv:2201.11661 [pdf, other]

TrustAL: Trustworthy Active Learning using Knowledge Distillation

Authors: Beong-woo Kwak, Youngwook Kim, Yu Jin Kim, Seung-won Hwang, Jinyoung Yeo

Abstract: Active learning can be defined as iterations of data labeling, model training, and data acquisition, until sufficient labels are acquired. A traditional view of data acquisition is that, through iterations, knowledge from human labels and models is implicitly distilled to monotonically increase the accuracy and label consistency. Under this assumption, the most recently trained model is a good sur… ▽ More Active learning can be defined as iterations of data labeling, model training, and data acquisition, until sufficient labels are acquired. A traditional view of data acquisition is that, through iterations, knowledge from human labels and models is implicitly distilled to monotonically increase the accuracy and label consistency. Under this assumption, the most recently trained model is a good surrogate for the current labeled data, from which data acquisition is requested based on uncertainty/diversity. Our contribution is debunking this myth and proposing a new objective for distillation. First, we found example forgetting, which indicates the loss of knowledge learned across iterations. Second, for this reason, the last model is no longer the best teacher -- For mitigating such forgotten knowledge, we select one of its predecessor models as a teacher, by our proposed notion of "consistency". We show that this novel distillation is distinctive in the following three aspects; First, consistency ensures to avoid forgetting labels. Second, consistency improves both uncertainty/diversity of labeled data. Lastly, consistency redeems defective labels produced by human annotators. △ Less

Submitted 26 January, 2022; originally announced January 2022.

Comments: Accepted to AAAI2022

arXiv:2110.04260 [pdf, other]

Taming Sparsely Activated Transformer with Stochastic Experts

Authors: Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Tuo Zhao, Jianfeng Gao

Abstract: Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing… ▽ More Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts. △ Less

Submitted 3 February, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: ICLR 2022

arXiv:2110.03380 [pdf, other]

Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity

Authors: You Jin Kim, Hee-Soo Heo, Jee-weon Jung, Youngki Kwon, Bong-Jin Lee, Joon Son Chung

Abstract: The objective of this work is to train noise-robust speaker embeddings adapted for speaker diarisation. Speaker embeddings play a crucial role in the performance of diarisation systems, but they often capture spurious information such as noise, adversely affecting performance. Our previous work has proposed an auto-encoder-based dimensionality reduction module to help remove the redundant informat… ▽ More The objective of this work is to train noise-robust speaker embeddings adapted for speaker diarisation. Speaker embeddings play a crucial role in the performance of diarisation systems, but they often capture spurious information such as noise, adversely affecting performance. Our previous work has proposed an auto-encoder-based dimensionality reduction module to help remove the redundant information. However, they do not explicitly separate such information and have also been found to be sensitive to hyper-parameter values. To this end, we propose two contributions to overcome these issues: (i) a novel dimensionality reduction framework that can disentangle spurious information from the speaker embeddings; (ii) the use of speech activity vector to prevent the speaker code from representing the background noise. Through a range of experiments conducted on four datasets, our approach consistently demonstrates the state-of-the-art performance among models without system fusion. △ Less

Submitted 3 November, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: This paper was submitted to ICASSP 2023

arXiv:2110.03361 [pdf, other]

Multi-scale speaker embedding-based graph attention networks for speaker diarisation

Authors: Youngki Kwon, Hee-Soo Heo, Jee-weon Jung, You Jin Kim, Bong-Jin Lee, Joon Son Chung

Abstract: The objective of this work is effective speaker diarisation using multi-scale speaker embeddings. Typically, there is a trade-off between the ability to recognise short speaker segments and the discriminative power of the embedding, according to the segment length used for embedding extraction. To this end, recent works have proposed the use of multi-scale embeddings where segments with varying le… ▽ More The objective of this work is effective speaker diarisation using multi-scale speaker embeddings. Typically, there is a trade-off between the ability to recognise short speaker segments and the discriminative power of the embedding, according to the segment length used for embedding extraction. To this end, recent works have proposed the use of multi-scale embeddings where segments with varying lengths are used. However, the scores are combined using a weighted summation scheme where the weights are fixed after the training phase, whereas the importance of segment lengths can differ with in a single session. To address this issue, we present three key contributions in this paper: (1) we propose graph attention networks for multi-scale speaker diarisation; (2) we design scale indicators to utilise scale information of each embedding; (3) we adapt the attention-based aggregation to utilise a pre-computed affinity matrix from multi-scale embeddings. We demonstrate the effectiveness of our method in various datasets where the speaker confusion which constitutes the primary metric drops over 10% in average relative compared to the baseline. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 5 pages, 2 figures, submitted to ICASSP as a conference paper

arXiv:2109.10465 [pdf, other]

Scalable and Efficient MoE Training for Multitask Multilingual Models

Authors: Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, Hany Hassan Awadalla

Abstract: The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE tra… ▽ More The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library. △ Less

Submitted 21 September, 2021; originally announced September 2021.

arXiv:2108.07640 [pdf, other]

Look Who's Talking: Active Speaker Detection in the Wild

Authors: You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung

Abstract: In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detec… ▽ More In this work, we present a novel audio-visual dataset for active speaker detection in the wild. A speaker is considered active when his or her face is visible and the voice is audible simultaneously. Although active speaker detection is a crucial pre-processing step for many audio-visual tasks, there is no existing dataset of natural human speech to evaluate the performance of active speaker detection. We therefore curate the Active Speakers in the Wild (ASW) dataset which contains videos and co-occurring speech segments with dense speech activity labels. Videos and timestamps of audible segments are parsed and adopted from VoxConverse, an existing speaker diarisation dataset that consists of videos in the wild. Face tracks are extracted from the videos and active segments are annotated based on the timestamps of VoxConverse in a semi-automatic way. Two reference systems, a self-supervised system and a fully supervised one, are evaluated on the dataset to provide the baseline performances of ASW. Cross-domain evaluation is conducted in order to show the negative effect of dubbed videos in the training data. △ Less

Submitted 17 August, 2021; originally announced August 2021.

Comments: To appear in Interspeech 2021. Data will be available from https://github.com/clovaai/lookwhostalking

arXiv:2107.08737 [pdf, other]

Synthesizing Human Faces using Latent Space Factorization and Local Weights (Extended Version)

Authors: Minyoung Kim, Young J. Kim

Abstract: We propose a 3D face generative model with local weights to increase the model's variations and expressiveness. The proposed model allows partial manipulation of the face while still learning the whole face mesh. For this purpose, we address an effective way to extract local facial features from the entire data and explore a way to manipulate them during a holistic generation. First, we factorize… ▽ More We propose a 3D face generative model with local weights to increase the model's variations and expressiveness. The proposed model allows partial manipulation of the face while still learning the whole face mesh. For this purpose, we address an effective way to extract local facial features from the entire data and explore a way to manipulate them during a holistic generation. First, we factorize the latent space of the whole face to the subspace indicating different parts of the face. In addition, local weights generated by non-negative matrix factorization are applied to the factorized latent space so that the decomposed part space is semantically meaningful. We experiment with our model and observe that effective facial part manipulation is possible and that the model's expressiveness is improved. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: Extended version of the paper to will be published in Computer Graphics International 2021 (LNCS Proceeding Papers)

arXiv:2105.02580 [pdf, other]

Time-Aware Q-Networks: Resolving Temporal Irregularity for Deep Reinforcement Learning

Authors: Yeo Jin Kim, Min Chi

Abstract: Deep Reinforcement Learning (DRL) has shown outstanding performance on inducing effective action policies that maximize expected long-term return on many complex tasks. Much of DRL work has been focused on sequences of events with discrete time steps and ignores the irregular time intervals between consecutive events. Given that in many real-world domains, data often consists of temporal sequences… ▽ More Deep Reinforcement Learning (DRL) has shown outstanding performance on inducing effective action policies that maximize expected long-term return on many complex tasks. Much of DRL work has been focused on sequences of events with discrete time steps and ignores the irregular time intervals between consecutive events. Given that in many real-world domains, data often consists of temporal sequences with irregular time intervals, and it is important to consider the time intervals between temporal events to capture latent progressive patterns of states. In this work, we present a general Time-Aware RL framework: Time-aware Q-Networks (TQN), which takes into account physical time intervals within a deep RL framework. TQN deals with time irregularity from two aspects: 1) elapsed time in the past and an expected next observation time for time-aware state approximation, and 2) action time window for the future for time-aware discounting of rewards. Experimental results show that by capturing the underlying structures in the sequences with time irregularities from both aspects, TQNs significantly outperform DQN in four types of contexts with irregular time intervals. More specifically, our results show that in classic RL tasks such as CartPole and MountainCar and Atari benchmark with randomly segmented time intervals, time-aware discounting alone is more important while in the real-world tasks such as nuclear reactor operation and septic patient treatment with intrinsic time intervals, both time-aware state and time-aware discounting are crucial. Moreover, to improve the agent's learning capacity, we explored three boosting methods: Double networks, Dueling networks, and Prioritized Experience Replay, and our results show that for the two real-world tasks, combining all three boosting methods with TQN is especially effective. △ Less

Submitted 6 May, 2021; originally announced May 2021.

Comments: 36 pages, 27 figures

arXiv:2105.00568 [pdf, other]

InferNet for Delayed Reinforcement Tasks: Addressing the Temporal Credit Assignment Problem

Authors: Markel Sanz Ausin, Hamoon Azizsoltani, Song Ju, Yeo Jin Kim, Min Chi

Abstract: The temporal Credit Assignment Problem (CAP) is a well-known and challenging task in AI. While Reinforcement Learning (RL), especially Deep RL, works well when immediate rewards are available, it can fail when only delayed rewards are available or when the reward function is noisy. In this work, we propose delegating the CAP to a Neural Network-based algorithm named InferNet that explicitly learns… ▽ More The temporal Credit Assignment Problem (CAP) is a well-known and challenging task in AI. While Reinforcement Learning (RL), especially Deep RL, works well when immediate rewards are available, it can fail when only delayed rewards are available or when the reward function is noisy. In this work, we propose delegating the CAP to a Neural Network-based algorithm named InferNet that explicitly learns to infer the immediate rewards from the delayed rewards. The effectiveness of InferNet was evaluated on two online RL tasks: a simple GridWorld and 40 Atari games; and two offline RL tasks: GridWorld and a real-life Sepsis treatment task. For all tasks, the effectiveness of using the InferNet inferred rewards is compared against the immediate and the delayed rewards with two settings: with noisy rewards and without noise. Overall, our results show that the effectiveness of InferNet is robust against noisy reward functions and is an effective add-on mechanism for solving temporal CAP in a wide range of RL tasks, from classic RL simulation environments to a real-world RL problem and for both online and offline learning. △ Less

Submitted 2 May, 2021; originally announced May 2021.

arXiv:2104.02879 [pdf, other]

Adapting Speaker Embeddings for Speaker Diarisation

Authors: Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung

Abstract: The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have directly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to bett… ▽ More The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have directly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures, 3 tables, submitted to Interspeech as a conference paper

arXiv:2011.10348 [pdf, other]

Accelerating Probabilistic Volumetric Mapping using Ray-Tracing Graphics Hardware

Authors: Heajung Min, Kyung Min Han, Young J. Kim

Abstract: Probabilistic volumetric mapping (PVM) represents a 3D environmental map for an autonomous robotic navigational task. A popular implementation such as Octomap is widely used in the robotics community for such a purpose. The Octomap relies on octree to represent a PVM and its main bottleneck lies in massive ray-shooting to determine the occupancy of the underlying volumetric voxel grids. In this pa… ▽ More Probabilistic volumetric mapping (PVM) represents a 3D environmental map for an autonomous robotic navigational task. A popular implementation such as Octomap is widely used in the robotics community for such a purpose. The Octomap relies on octree to represent a PVM and its main bottleneck lies in massive ray-shooting to determine the occupancy of the underlying volumetric voxel grids. In this paper, we propose GPU-based ray shooting to drastically improve the ray shooting performance in Octomap. Our main idea is based on the use of recent ray-tracing RTX GPU, mainly designed for real-time photo-realistic computer graphics and the accompanying graphics API, known as DXR. Our ray-shooting first maps leaf-level voxels in the given octree to a set of axis-aligned bounding boxes (AABBs) and employ massively parallel ray shooting on them using GPUs to find free and occupied voxels. These are fed back into CPU to update the voxel occupancy and restructure the octree. In our experiments, we have observed more than three-orders-of-magnitude performance improvement in terms of ray shooting using ray-tracing RTX GPU over a state-of-the-art Octomap CPU implementation, where the benchmarking environments consist of more than 77K points and 25K~34K voxel grids. △ Less

Submitted 2 December, 2020; v1 submitted 20 November, 2020; originally announced November 2020.

Comments: Submitted IEEE International Conference on Robotics and Automation

arXiv:2011.09772 [pdf, other]

doi 10.1109/LRA.2021.3088797

Solving Footstep Planning as a Feasibility Problem using L1-norm Minimization (Extended Version)

Authors: Daeun Song, Pierre Fernbach, Thomas Flayols, Andrea Del Prete, Nicolas Mansard, Steve Tonneau, Young J. Kim

Abstract: One challenge of legged locomotion on uneven terrains is to deal with both the discrete problem of selecting a contact surface for each footstep and the continuous problem of placing each footstep on the selected surface. Consequently, footstep planning can be addressed with a Mixed Integer Program (MIP), an elegant but computationally-demanding method, which can make it unsuitable for online plan… ▽ More One challenge of legged locomotion on uneven terrains is to deal with both the discrete problem of selecting a contact surface for each footstep and the continuous problem of placing each footstep on the selected surface. Consequently, footstep planning can be addressed with a Mixed Integer Program (MIP), an elegant but computationally-demanding method, which can make it unsuitable for online planning. We reformulate the MIP into a cardinality problem, then approximate it as a computationally efficient l1-norm minimisation, called SL1M. Moreover, we improve the performance and convergence of SL1M by combining it with a sampling-based root trajectory planner to prune irrelevant surface candidates. Our tests on the humanoid Talos in four representative scenarios show that SL1M always converges faster than MIP. For scenarios when the combinatorial complexity is small (< 10 surfaces per step), SL1M converges at least two times faster than MIP with no need for pruning. In more complex cases, SL1M converges up to 100 times faster than MIP with the help of pruning. Moreover, pruning can also improve the MIP computation time. The versatility of the framework is shown with additional tests on the quadruped robot ANYmal. △ Less

Submitted 16 May, 2021; v1 submitted 19 November, 2020; originally announced November 2020.

Comments: Extended version of the paper to be published in IEEE Robotics and Automation Letters

Journal ref: IEEE Robotics and Automation Letters, Volume 6, Issue 3, July 2021

arXiv:2010.13382 [pdf, other]

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

Authors: Young Jin Kim, Hany Hassan Awadalla

Abstract: Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time pe… ▽ More Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On GPU, we also achieve up to 12.4x speed-up with the presented methods. We show that FastFormers can drastically reduce cost of serving 100 million requests from 4,223 USD to just 18 USD on an Azure F16s_v2 instance. This translates to a sustainable runtime by reducing energy consumption 6.9x - 125.8x according to the metrics used in the SustaiNLP 2020 shared task. △ Less

Submitted 26 October, 2020; originally announced October 2020.

Comments: Accepted to SustaiNLP 2020 at EMNLP 2020

arXiv:2005.08606 [pdf, other]

End-to-End Lip Synchronisation Based on Pattern Classification

Authors: You Jin Kim, Hee Soo Heo, Soo-Whan Chung, Bong-Jin Lee

Abstract: The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the t… ▽ More The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets. △ Less

Submitted 19 March, 2021; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: slt 2021 accepted

arXiv:1911.06144 [pdf, other]

A Penetration Metric for Deforming Tetrahedra using Object Norm

Authors: Jisu Kim, Young J. Kim

Abstract: In this paper, we propose a novel penetration metric, called deformable penetration depth PDd, to define a measure of inter-penetration between two linearly deforming tetrahedra using the object norm. First of all, we show that a distance metric for a tetrahedron deforming between two configurations can be found in closed form based on object norm. Then, we show that the PDd between an intersectin… ▽ More In this paper, we propose a novel penetration metric, called deformable penetration depth PDd, to define a measure of inter-penetration between two linearly deforming tetrahedra using the object norm. First of all, we show that a distance metric for a tetrahedron deforming between two configurations can be found in closed form based on object norm. Then, we show that the PDd between an intersecting pair of static and deforming tetrahedra can be found by solving a quadratic programming (QP) problem in terms of the distance metric with non-penetration constraints. We also show that the PDd between two, intersected, deforming tetrahedra can be found by solving a similar QP problem under some assumption on penetrating directions, and it can be also accelerated by an order of magnitude using pre-calculated penetration direction. We have implemented our algorithm on a standard PC platform using an off-the-shelf QP optimizer, and experimentally show that both the static/deformable and deformable/deformable tetrahedra cases can be solvable in from a few to tens of milliseconds. Finally, we demonstrate that our penetration metric is three-times smaller (or tighter) than the classical, rigid penetration depth metric in our experiments. △ Less

Submitted 13 November, 2019; originally announced November 2019.

Comments: Published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019

arXiv:1901.09351 [pdf, other]

Automated Quality Control in Image Segmentation: Application to the UK Biobank Cardiac MR Imaging Study

Authors: Robert Robinson, Vanya V. Valindria, Wenjia Bai, Ozan Oktay, Bernhard Kainz, Hideaki Suzuki, Mihir M. Sanghvi, Nay Aung, Jos$é$ Miguel Paiva, Filip Zemrak, Kenneth Fung, Elena Lukaschuk, Aaron M. Lee, Valentina Carapella, Young Jin Kim, Stefan K. Piechnik, Stefan Neubauer, Steffen E. Petersen, Chris Page, Paul M. Matthews, Daniel Rueckert, Ben Glocker

Abstract: Background: The trend towards large-scale studies including population imaging poses new challenges in terms of quality control (QC). This is a particular issue when automatic processing tools, e.g. image segmentation methods, are employed to derive quantitative measures or biomarkers for later analyses. Manual inspection and visual QC of each segmentation isn't feasible at large scale. However, i… ▽ More Background: The trend towards large-scale studies including population imaging poses new challenges in terms of quality control (QC). This is a particular issue when automatic processing tools, e.g. image segmentation methods, are employed to derive quantitative measures or biomarkers for later analyses. Manual inspection and visual QC of each segmentation isn't feasible at large scale. However, it's important to be able to automatically detect when a segmentation method fails so as to avoid inclusion of wrong measurements into subsequent analyses which could lead to incorrect conclusions. Methods: To overcome this challenge, we explore an approach for predicting segmentation quality based on Reverse Classification Accuracy, which enables us to discriminate between successful and failed segmentations on a per-cases basis. We validate this approach on a new, large-scale manually-annotated set of 4,800 cardiac magnetic resonance scans. We then apply our method to a large cohort of 7,250 cardiac MRI on which we have performed manual QC. Results: We report results used for predicting segmentation quality metrics including Dice Similarity Coefficient (DSC) and surface-distance measures. As initial validation, we present data for 400 scans demonstrating 99% accuracy for classifying low and high quality segmentations using predicted DSC scores. As further validation we show high correlation between real and predicted scores and 95% classification accuracy on 4,800 scans for which manual segmentations were available. We mimic real-world application of the method on 7,250 cardiac MRI where we show good agreement between predicted quality metrics and manual visual QC scores. Conclusions: We show that RCA has the potential for accurate and fully automatic segmentation QC on a per-case basis in the context of large-scale population imaging as in the UK Biobank Imaging Study. △ Less

Submitted 27 January, 2019; originally announced January 2019.

Comments: 14 pages, 7 figures, Journal of Cardiovascular Magnetic Resonance

arXiv:1811.00912 [pdf]

Two-Layered Superposition of Broadcast/Multicast and Unicast Signals in Multiuser OFDMA Systems

Authors: David Vargas, Yong Jin Daniel Kim

Abstract: We study optimal delivery strategies of one common and $K$ independent messages from a source to multiple users in wireless environments. In particular, two-layered superposition of broadcast/multicast and unicast signals is considered in a downlink multiuser OFDMA system. In the literature and industry, the two-layer superposition is often considered as a pragmatic approach to make a compromise b… ▽ More We study optimal delivery strategies of one common and $K$ independent messages from a source to multiple users in wireless environments. In particular, two-layered superposition of broadcast/multicast and unicast signals is considered in a downlink multiuser OFDMA system. In the literature and industry, the two-layer superposition is often considered as a pragmatic approach to make a compromise between the simple but suboptimal orthogonal multiplexing (OM) and the optimal but complex fully-layered non-orthogonal multiplexing. In this work, we show that only two-layers are necessary to achieve the maximum sum-rate when the common message has higher priority than the $K$ individual unicast messages, and OM cannot be sum-rate optimal in general. We develop an algorithm that finds the optimal power allocation over the two-layers and across the OFDMA radio resources in static channels and a class of fading channels. Two main use-cases are considered: i) Multicast and unicast multiplexing when $K$ users with uplink capabilities request both common and independent messages, and ii) broadcast and unicast multiplexing when the common message targets receive-only devices and $K$ users with uplink capabilities additionally request independent messages. Finally, we develop a transceiver design for broadcast/multicast and unicast superposition transmission based on LTE-A-Pro physical layer and show with numerical evaluations in mobile environments with multipath propagation that the capacity improvements can be translated into significant practical performance gains compared to the orthogonal schemes in the 3GPP specifications. We also analyze the impact of real channel estimation and show that significant gains in terms of spectral efficiency or coverage area are still available even with estimation errors and imperfect interference cancellation for the two-layered superposition system. △ Less

Submitted 4 December, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

arXiv:1806.06244 [pdf, other]

Real-time Prediction of Segmentation Quality

Authors: Robert Robinson, Ozan Oktay, Wenjia Bai, Vanya Valindria, Mihir Sanghvi, Nay Aung, José Paiva, Filip Zemrak, Kenneth Fung, Elena Lukaschuk, Aaron Lee, Valentina Carapella, Young Jin Kim, Bernhard Kainz, Stefan Piechnik, Stefan Neubauer, Steffen Petersen, Chris Page, Daniel Rueckert, Ben Glocker

Abstract: Recent advances in deep learning based image segmentation methods have enabled real-time performance with human-level accuracy. However, occasionally even the best method fails due to low image quality, artifacts or unexpected behaviour of black box algorithms. Being able to predict segmentation quality in the absence of ground truth is of paramount importance in clinical practice, but also in lar… ▽ More Recent advances in deep learning based image segmentation methods have enabled real-time performance with human-level accuracy. However, occasionally even the best method fails due to low image quality, artifacts or unexpected behaviour of black box algorithms. Being able to predict segmentation quality in the absence of ground truth is of paramount importance in clinical practice, but also in large-scale studies to avoid the inclusion of invalid data in subsequent analysis. In this work, we propose two approaches of real-time automated quality control for cardiovascular MR segmentations using deep learning. First, we train a neural network on 12,880 samples to predict Dice Similarity Coefficients (DSC) on a per-case basis. We report a mean average error (MAE) of 0.03 on 1,610 test samples and 97% binary classification accuracy for separating low and high quality segmentations. Secondly, in the scenario where no manually annotated data is available, we train a network to predict DSC scores from estimated quality obtained via a reverse testing strategy. We report an MAE=0.14 and 91% binary classification accuracy for this case. Predictions are obtained in real-time which, when combined with real-time segmentation methods, enables instant feedback on whether an acquired scan is analysable while the patient is still in the scanner. This further enables new applications of optimising image acquisition towards best possible analysis results. △ Less

Submitted 16 June, 2018; originally announced June 2018.

Comments: Accepted at MICCAI 2018

arXiv:1712.00010 [pdf, ps, other]

Highrisk Prediction from Electronic Medical Records via Deep Attention Networks

Authors: You Jin Kim, Yun-Geun Lee, Jeong Whun Kim, Jin Joo Park, Borim Ryu, Jung-Woo Ha

Abstract: Predicting highrisk vascular diseases is a significant issue in the medical domain. Most predicting methods predict the prognosis of patients from pathological and radiological measurements, which are expensive and require much time to be analyzed. Here we propose deep attention models that predict the onset of the high risky vascular disease from symbolic medical histories sequence of hypertensio… ▽ More Predicting highrisk vascular diseases is a significant issue in the medical domain. Most predicting methods predict the prognosis of patients from pathological and radiological measurements, which are expensive and require much time to be analyzed. Here we propose deep attention models that predict the onset of the high risky vascular disease from symbolic medical histories sequence of hypertension patients such as ICD-10 and pharmacy codes only, Medical History-based Prediction using Attention Network (MeHPAN). We demonstrate two types of attention models based on 1) bidirectional gated recurrent unit (R-MeHPAN) and 2) 1D convolutional multilayer model (C-MeHPAN). Two MeHPAN models are evaluated on approximately 50,000 hypertension patients with respect to precision, recall, f1-measure and area under the curve (AUC). Experimental results show that our MeHPAN methods outperform standard classification models. Comparing two MeHPANs, R-MeHPAN provides more better discriminative capability with respect to all metrics while C-MeHPAN presents much shorter training time with competitive accuracy. △ Less

Submitted 30 November, 2017; originally announced December 2017.

Comments: Accepted poster at NIPS 2017 Workshop on Machine Learning for Health (https://ml4health.github.io/2017/)

arXiv:1710.09289 [pdf, other]

Automated cardiovascular magnetic resonance image analysis with fully convolutional networks

Authors: Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Oktay, Martin Rajchl, Ghislain Vaillant, Aaron M. Lee, Nay Aung, Elena Lukaschuk, Mihir M. Sanghvi, Filip Zemrak, Kenneth Fung, Jose Miguel Paiva, Valentina Carapella, Young Jin Kim, Hideaki Suzuki, Bernhard Kainz, Paul M. Matthews, Steffen E. Petersen, Stefan K. Piechnik, Stefan Neubauer, Ben Glocker, Daniel Rueckert

Abstract: Cardiovascular magnetic resonance (CMR) imaging is a standard imaging modality for assessing cardiovascular diseases (CVDs), the leading cause of death globally. CMR enables accurate quantification of the cardiac chamber volume, ejection fraction and myocardial mass, providing information for diagnosis and monitoring of CVDs. However, for years, clinicians have been relying on manual approaches fo… ▽ More Cardiovascular magnetic resonance (CMR) imaging is a standard imaging modality for assessing cardiovascular diseases (CVDs), the leading cause of death globally. CMR enables accurate quantification of the cardiac chamber volume, ejection fraction and myocardial mass, providing information for diagnosis and monitoring of CVDs. However, for years, clinicians have been relying on manual approaches for CMR image analysis, which is time consuming and prone to subjective errors. It is a major clinical challenge to automatically derive quantitative and clinically relevant information from CMR images. Deep neural networks have shown a great potential in image pattern recognition and segmentation for a variety of tasks. Here we demonstrate an automated analysis method for CMR images, which is based on a fully convolutional network (FCN). The network is trained and evaluated on a large-scale dataset from the UK Biobank, consisting of 4,875 subjects with 93,500 pixelwise annotated images. The performance of the method has been evaluated using a number of technical metrics, including the Dice metric, mean contour distance and Hausdorff distance, as well as clinically relevant measures, including left ventricle (LV) end-diastolic volume (LVEDV) and end-systolic volume (LVESV), LV mass (LVM); right ventricle (RV) end-diastolic volume (RVEDV) and end-systolic volume (RVESV). By combining FCN with a large-scale annotated dataset, the proposed automated method achieves a high performance on par with human experts in segmenting the LV and RV on short-axis CMR images and the left atrium (LA) and right atrium (RA) on long-axis CMR images. △ Less

Submitted 22 May, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

Comments: Accepted for publication by Journal of Cardiovascular Magnetic Resonance

arXiv:1704.02724 [pdf, other]

CanvoX: High-resolution VR Painting in Large Volumetric Canvas

Authors: Yeojin Kim, Byungmoon Kim, Jiyang Kim, Young J. Kim

Abstract: With virtual reality, digital painting on 2D canvases is now being extended to 3D spaces. Tilt Brush and Oculus Quill are widely accepted among artists as tools that pave the way to a new form of art - 3D emmersive painting. Current 3D painting systems are only a start, emitting textured triangular geometries. In this paper, we advance this new art of 3D painting to 3D volumetric painting that ena… ▽ More With virtual reality, digital painting on 2D canvases is now being extended to 3D spaces. Tilt Brush and Oculus Quill are widely accepted among artists as tools that pave the way to a new form of art - 3D emmersive painting. Current 3D painting systems are only a start, emitting textured triangular geometries. In this paper, we advance this new art of 3D painting to 3D volumetric painting that enables an artist to draw a huge scene with full control of spatial color fields. Inspired by the fact that 2D paintings often use vast space to paint background and small but detailed space for foreground, we claim that supporting a large canvas in varying detail is essential for 3D painting. In order to help artists focus and audiences to navigate the large canvas space, we provide small artist-defined areas, called rooms, that serve as beacons for artist-suggested scales, spaces, locations for intended appreciation view of the painting. Artists and audiences can easily transport themselves between different rooms. Technically, our canvas is represented as an array of deep octrees of depth 24 or higher, built on CPU for volume painting and on GPU for volume rendering using accurate ray casting. In CPU side, we design an efficient iterative algorithm to refine or coarsen octree, as a result of volumetric painting strokes, at highly interactive rates, and update the corresponding GPU textures. Then we use GPU-based ray casting algorithms to render the volumetric painting result. We explore precision issues stemming from ray-casting the octree of high depth, and provide a new analysis and verification. From our experimental results as well as the positive feedback from the participating artists, we strongly believe that our new 3D volume painting system can open up a new possibility for VR-driven digital art medium to professional artists as well as to novice users. △ Less

Submitted 10 April, 2017; originally announced April 2017.

arXiv:1508.06181 [pdf, other]

doi 10.1145/2077341.2077346

PolyDepth: Real-time Penetration Depth Computation using Iterative Contact-Space Projection

Authors: Changsoo Je, Min Tang, Youngeun Lee, Minkyoung Lee, Young J. Kim

Abstract: We present a real-time algorithm that finds the Penetration Depth (PD) between general polygonal models based on iterative and local optimization techniques. Given an in-collision configuration of an object in configuration space, we find an initial collision-free configuration using several methods such as centroid difference, maximally clear configuration, motion coherence, random configuration,… ▽ More We present a real-time algorithm that finds the Penetration Depth (PD) between general polygonal models based on iterative and local optimization techniques. Given an in-collision configuration of an object in configuration space, we find an initial collision-free configuration using several methods such as centroid difference, maximally clear configuration, motion coherence, random configuration, and sampling-based search. We project this configuration on to a local contact space using a variant of continuous collision detection algorithm and construct a linear convex cone around the projected configuration. We then formulate a new projection of the in-collision configuration onto the convex cone as a Linear Complementarity Problem (LCP), which we solve using a type of Gauss-Seidel iterative algorithm. We repeat this procedure until a locally optimal PD is obtained. Our algorithm can process complicated models consisting of tens of thousands triangles at interactive rates. △ Less

Submitted 25 August, 2015; originally announced August 2015.

Comments: Presented in ACM SIGGRAPH 2012. 15 pages, 23 figures

ACM Class: I.2.9; I.3.5; I.3.7; I.6.8

Journal ref: ACM Transactions on Graphics (ToG 2012), Volume 31, Issue 1, Article 5, pp. 1-14, January 1, 2012

arXiv:1403.1048 [pdf]

doi 10.3938/jkps.64.341

Network Structures between Strategies in Iterated Prisoners' Dilemma Games

Authors: Young Jin Kim, Myungkyoon Roh, Seung-Woo Son

Abstract: We use replicator dynamics to study an iterated prisoners' dilemma game with memory. In this study, we investigate the characteristics of all 32 possible strategies with a single-step memory by observing the results when each strategy encounters another one. Based on these results, we define similarity measures between the 32 strategies and perform a network analysis of the relationship between th… ▽ More We use replicator dynamics to study an iterated prisoners' dilemma game with memory. In this study, we investigate the characteristics of all 32 possible strategies with a single-step memory by observing the results when each strategy encounters another one. Based on these results, we define similarity measures between the 32 strategies and perform a network analysis of the relationship between the strategies by constructing a strategies network. Interestingly, we find that a win-lose circulation, like rock-paper-scissors, exists between strategies and that the circulation results from one unusual strategy. △ Less

Submitted 5 March, 2014; originally announced March 2014.

Journal ref: The Korean Physical Society February 2014, Volume 64, Issue 3, pp 341-345

Showing 1–50 of 50 results for author: Kim, Y J